Posts Tagged ‘HTTP’

A puzzling URI encoding affair

Friday, February 6th, 2009

While working some URI template/pattern java library, I started looking into the details of URI encoding.

I have realised that browsers make puzzling decisions when it comes to encoding URIs that the user types in the location bar. They seem to use different encodings for the different parts of the URI.

Consider the following URL:

  http://acme.com/café?variété=#caféine

This isn’t a valid URI because the URI generic syntax (RFC3986) requires all bytes that are not 7-bit ASCII to be percent encoded. But it is perfectly legitimate from the user’s perspective. Users do not know about RFC-3986, and they should not, so the browser should do the work on their behalf.

The W3C has indicated how non-ASCII characters in URI attribute values should be handled in an appendix of the HTML specifications:

Although URIs do not contain non-ASCII values (see [URI], section 2.1) authors sometimes specify them in attribute values expecting URIs (i.e., defined with %URI; in the DTD). For instance, the following href value is illegal:

<A href="http://foo.org/Håkon">...</A>

We recommend that user agents adopt the following convention for handling non-ASCII characters in such cases:

  1. Represent each character in UTF-8 (see [RFC2279]) as one or more bytes.
  2. Escape these bytes with the URI escaping mechanism (i.e., by converting each byte to %HH, where HH is the hexadecimal notation of the byte value).

This procedure results in a syntactically legal URI (as defined in [RFC1738], section 2.2 or [RFC2141], section 2) that is independent of the character encoding to which the HTML document carrying the URI may have been transcoded.

I thought this rule would similarly apply to addresses typed in the location bar. But both Firefox and Internet Explorer (and apparently Opera also) will generate the following URI HTTP request:

  http://localhost/caf%C3%A9?vari%E9t%E9=#caf%C3%A9ine

The ‘é’ (U+00E9) has been escaped differently:

  • The path and fragment part have a percent encoded version of the UTF-8 byte encoding: C3 A9.
  • The query string however was encoded using ISO-8859-1 (or a superset): E9.

This leaves us with a URI that uses two different characters encodings — what a mess!

Google Chrome however produces the URI that the W3C prescribes by using UTF-8 for all parts of the URL.