Posts Tagged ‘Firefox’

URI encoding in location bar, different strategies

Sunday, February 8th, 2009

In my previous post, I mentioned that some browsers escape non-ASCII characters in URIs in an inconsistent manner.

I started to look a little deeper in the issue and tried a few more characters in the location bar and see what strategies the browsers adopted.

Default character set

First, when I tried characters which are defined in Windows-1252 but not in ISO-8859-1, I found that both Firefox and IE are actually using my machine’s default encoding Windows-1252 and not simply ISO-8859-1 to replace characters in the query string.

The euro symbol € (U+20AC) was replaced by %80 (Windows-1252) whereas it should be %E2%82%AC (utf-8). The € symbol is not part of ISO-8859-1. I had similar results with the mdash — (U+2014): %97 instead of %E2%80%94. The mdash also is part of the Windows-1252 but not ISO-8859-1.

Of course, this means that these escaped byte sequences will make little sense for a Linux or MacOS server.

When outside the range

Then I tried to see what happened with characters that where not part of Windows-1252, for example the Greek small letter alpha α (U+03B1). IE and Firefox started to differ. Firefox opted for utf-8 this time so I got %CE%B1 as expected. IE assumed that I must have wanted a latin small letter A, and replaced the α (alpha) by a (A). When using the Greek small letter lambda λ (U+03BB). Firefox used utf-8 again and gave me %CE%BB. IE unable to provides a suitable replacement, replaced my character by a question mark ?.

At least, Firefox provides a way for me to access sites using a different character set. IE on the other hand assumes that I would only type the URL of sites which use the same character set. So if I want to type the URL of a site which uses say Greek characters, it probably won’t work.

Mixing characters

I decided to test the strategy adopted when mixing some Unicode characters that only appear in different ISO-8859-x character sets. I used ‘é’ (U+00E9) in ISO-8859-1 and ‘α′ (U+03B1) in ISO-8859-7.

Firefox used utf-8 for transforming that part of the query string and replaced it with %C3%A9%CE%B1. However, IE persisted with Windows 1252 and replaced it with %E9a.

I am glad that Firefox realised that it required an universal character set to correctly represent the URL. But this means that the encoding of characters in the query string is dependent upon the other characters: without the alpha, my e with an acute accent is encoded using Windows 1252. IE is more consistent, but make it impossible for me to type the URL that contains characters from another character set.

Summary

  • Firefox assumes the default character set, and if it cannot use it, it falls back on utf-8;
  • IE always uses the default character set, and therefore tries to find replacement characters; if it cannot, it gives up;
  • Google Chrome uses utf-8 only.

Additional Info

I tested this on Firefox 3.0.6, Internet Explorer 7.0.6 and Google Chrome 1.0.

Below are the characters that I have used and their byte representation in various character encodings.

Char Unicode ISO-8859-1 CP-1252 ISO-8859-7 UTF-8
é U+00E9 E9 E9 C3 A9
U+20AC 80 A4 E2 82 AC
U+2014 97 E2 80 94
α U+03B1; E1 CE B1
λ U+03BB; EB CE BB

Below are the results of the characters I tested in the query string of a URI typed in the location bar:

Char(s) IE Firefox Chrome
é %E9 %E9 %C3%A9
%80 %80 %E2%82%AC
%97 %97 %E2%80%94
α a %CE%B1 %CE%B1
λ ? %CE%BB %CE%BB
éα %E9a %C3%A9%CE%B1 %C3%A9%CE%B1