Posts Tagged ‘Character Encoding’

URI encoding in location bar, different strategies

Sunday, February 8th, 2009

In my previous post, I mentioned that some browsers escape non-ASCII characters in URIs in an inconsistent manner.

I started to look a little deeper in the issue and tried a few more characters in the location bar and see what strategies the browsers adopted.

Default character set

First, when I tried characters which are defined in Windows-1252 but not in ISO-8859-1, I found that both Firefox and IE are actually using my machine’s default encoding Windows-1252 and not simply ISO-8859-1 to replace characters in the query string.

The euro symbol € (U+20AC) was replaced by %80 (Windows-1252) whereas it should be %E2%82%AC (utf-8). The € symbol is not part of ISO-8859-1. I had similar results with the mdash — (U+2014): %97 instead of %E2%80%94. The mdash also is part of the Windows-1252 but not ISO-8859-1.

Of course, this means that these escaped byte sequences will make little sense for a Linux or MacOS server.

When outside the range

Then I tried to see what happened with characters that where not part of Windows-1252, for example the Greek small letter alpha α (U+03B1). IE and Firefox started to differ. Firefox opted for utf-8 this time so I got %CE%B1 as expected. IE assumed that I must have wanted a latin small letter A, and replaced the α (alpha) by a (A). When using the Greek small letter lambda λ (U+03BB). Firefox used utf-8 again and gave me %CE%BB. IE unable to provides a suitable replacement, replaced my character by a question mark ?.

At least, Firefox provides a way for me to access sites using a different character set. IE on the other hand assumes that I would only type the URL of sites which use the same character set. So if I want to type the URL of a site which uses say Greek characters, it probably won’t work.

Mixing characters

I decided to test the strategy adopted when mixing some Unicode characters that only appear in different ISO-8859-x character sets. I used ‘é’ (U+00E9) in ISO-8859-1 and ‘&#x03B1’ (U+03B1) in ISO-8859-7.

Firefox used utf-8 for transforming that part of the query string and replaced it with %C3%A9%CE%B1. However, IE persisted with Windows 1252 and replaced it with %E9a.

I am glad that Firefox realised that it required an universal character set to correctly represent the URL. But this means that the encoding of characters in the query string is dependent upon the other characters: without the alpha, my e with an acute accent is encoded using Windows 1252. IE is more consistent, but make it impossible for me to type the URL that contains characters from another character set.


  • Firefox assumes the default character set, and if it cannot use it, it falls back on utf-8;
  • IE always uses the default character set, and therefore tries to find replacement characters; if it cannot, it gives up;
  • Google Chrome uses utf-8 only.

Additional Info

I tested this on Firefox 3.0.6, Internet Explorer 7.0.6 and Google Chrome 1.0.

Below are the characters that I have used and their byte representation in various character encodings.

Char Unicode ISO-8859-1 CP-1252 ISO-8859-7 UTF-8
é U+00E9 E9 E9 C3 A9
U+20AC 80 A4 E2 82 AC
U+2014 97 E2 80 94
α U+03B1; E1 CE B1
λ U+03BB; EB CE BB

Below are the results of the characters I tested in the query string of a URI typed in the location bar:

Char(s) IE Firefox Chrome
é %E9 %E9 %C3%A9
%80 %80 %E2%82%AC
%97 %97 %E2%80%94
α a %CE%B1 %CE%B1
λ ? %CE%BB %CE%BB
éα %E9a %C3%A9%CE%B1 %C3%A9%CE%B1

A puzzling URI encoding affair

Friday, February 6th, 2009

While working some URI template/pattern java library, I started looking into the details of URI encoding.

I have realised that browsers make puzzling decisions when it comes to encoding URIs that the user types in the location bar. They seem to use different encodings for the different parts of the URI.

Consider the following URL:é?variété=#caféine

This isn’t a valid URI because the URI generic syntax (RFC3986) requires all bytes that are not 7-bit ASCII to be percent encoded. But it is perfectly legitimate from the user’s perspective. Users do not know about RFC-3986, and they should not, so the browser should do the work on their behalf.

The W3C has indicated how non-ASCII characters in URI attribute values should be handled in an appendix of the HTML specifications:

Although URIs do not contain non-ASCII values (see [URI], section 2.1) authors sometimes specify them in attribute values expecting URIs (i.e., defined with %URI; in the DTD). For instance, the following href value is illegal:

<A href="åkon">...</A>

We recommend that user agents adopt the following convention for handling non-ASCII characters in such cases:

  1. Represent each character in UTF-8 (see [RFC2279]) as one or more bytes.
  2. Escape these bytes with the URI escaping mechanism (i.e., by converting each byte to %HH, where HH is the hexadecimal notation of the byte value).

This procedure results in a syntactically legal URI (as defined in [RFC1738], section 2.2 or [RFC2141], section 2) that is independent of the character encoding to which the HTML document carrying the URI may have been transcoded.

I thought this rule would similarly apply to addresses typed in the location bar. But both Firefox and Internet Explorer (and apparently Opera also) will generate the following URI HTTP request:


The ‘é’ (U+00E9) has been escaped differently:

  • The path and fragment part have a percent encoded version of the UTF-8 byte encoding: C3 A9.
  • The query string however was encoded using ISO-8859-1 (or a superset): E9.

This leaves us with a URI that uses two different characters encodings — what a mess!

Google Chrome however produces the URI that the W3C prescribes by using UTF-8 for all parts of the URL.