Archive for the ‘Internet’ Category

URI encoding in location bar, different strategies

Sunday, February 8th, 2009

In my previous post, I mentioned that some browsers escape non-ASCII characters in URIs in an inconsistent manner.

I started to look a little deeper in the issue and tried a few more characters in the location bar and see what strategies the browsers adopted.

Default character set

First, when I tried characters which are defined in Windows-1252 but not in ISO-8859-1, I found that both Firefox and IE are actually using my machine’s default encoding Windows-1252 and not simply ISO-8859-1 to replace characters in the query string.

The euro symbol € (U+20AC) was replaced by %80 (Windows-1252) whereas it should be %E2%82%AC (utf-8). The € symbol is not part of ISO-8859-1. I had similar results with the mdash — (U+2014): %97 instead of %E2%80%94. The mdash also is part of the Windows-1252 but not ISO-8859-1.

Of course, this means that these escaped byte sequences will make little sense for a Linux or MacOS server.

When outside the range

Then I tried to see what happened with characters that where not part of Windows-1252, for example the Greek small letter alpha α (U+03B1). IE and Firefox started to differ. Firefox opted for utf-8 this time so I got %CE%B1 as expected. IE assumed that I must have wanted a latin small letter A, and replaced the α (alpha) by a (A). When using the Greek small letter lambda λ (U+03BB). Firefox used utf-8 again and gave me %CE%BB. IE unable to provides a suitable replacement, replaced my character by a question mark ?.

At least, Firefox provides a way for me to access sites using a different character set. IE on the other hand assumes that I would only type the URL of sites which use the same character set. So if I want to type the URL of a site which uses say Greek characters, it probably won’t work.

Mixing characters

I decided to test the strategy adopted when mixing some Unicode characters that only appear in different ISO-8859-x character sets. I used ‘é’ (U+00E9) in ISO-8859-1 and ‘α′ (U+03B1) in ISO-8859-7.

Firefox used utf-8 for transforming that part of the query string and replaced it with %C3%A9%CE%B1. However, IE persisted with Windows 1252 and replaced it with %E9a.

I am glad that Firefox realised that it required an universal character set to correctly represent the URL. But this means that the encoding of characters in the query string is dependent upon the other characters: without the alpha, my e with an acute accent is encoded using Windows 1252. IE is more consistent, but make it impossible for me to type the URL that contains characters from another character set.

Summary

  • Firefox assumes the default character set, and if it cannot use it, it falls back on utf-8;
  • IE always uses the default character set, and therefore tries to find replacement characters; if it cannot, it gives up;
  • Google Chrome uses utf-8 only.

Additional Info

I tested this on Firefox 3.0.6, Internet Explorer 7.0.6 and Google Chrome 1.0.

Below are the characters that I have used and their byte representation in various character encodings.

Char Unicode ISO-8859-1 CP-1252 ISO-8859-7 UTF-8
é U+00E9 E9 E9 C3 A9
U+20AC 80 A4 E2 82 AC
U+2014 97 E2 80 94
α U+03B1; E1 CE B1
λ U+03BB; EB CE BB

Below are the results of the characters I tested in the query string of a URI typed in the location bar:

Char(s) IE Firefox Chrome
é %E9 %E9 %C3%A9
%80 %80 %E2%82%AC
%97 %97 %E2%80%94
α a %CE%B1 %CE%B1
λ ? %CE%BB %CE%BB
éα %E9a %C3%A9%CE%B1 %C3%A9%CE%B1

A puzzling URI encoding affair

Friday, February 6th, 2009

While working some URI template/pattern java library, I started looking into the details of URI encoding.

I have realised that browsers make puzzling decisions when it comes to encoding URIs that the user types in the location bar. They seem to use different encodings for the different parts of the URI.

Consider the following URL:

  http://acme.com/café?variété=#caféine

This isn’t a valid URI because the URI generic syntax (RFC3986) requires all bytes that are not 7-bit ASCII to be percent encoded. But it is perfectly legitimate from the user’s perspective. Users do not know about RFC-3986, and they should not, so the browser should do the work on their behalf.

The W3C has indicated how non-ASCII characters in URI attribute values should be handled in an appendix of the HTML specifications:

Although URIs do not contain non-ASCII values (see [URI], section 2.1) authors sometimes specify them in attribute values expecting URIs (i.e., defined with %URI; in the DTD). For instance, the following href value is illegal:

<A href="http://foo.org/Håkon">...</A>

We recommend that user agents adopt the following convention for handling non-ASCII characters in such cases:

  1. Represent each character in UTF-8 (see [RFC2279]) as one or more bytes.
  2. Escape these bytes with the URI escaping mechanism (i.e., by converting each byte to %HH, where HH is the hexadecimal notation of the byte value).

This procedure results in a syntactically legal URI (as defined in [RFC1738], section 2.2 or [RFC2141], section 2) that is independent of the character encoding to which the HTML document carrying the URI may have been transcoded.

I thought this rule would similarly apply to addresses typed in the location bar. But both Firefox and Internet Explorer (and apparently Opera also) will generate the following URI HTTP request:

  http://localhost/caf%C3%A9?vari%E9t%E9=#caf%C3%A9ine

The ‘é’ (U+00E9) has been escaped differently:

  • The path and fragment part have a percent encoded version of the UTF-8 byte encoding: C3 A9.
  • The query string however was encoded using ISO-8859-1 (or a superset): E9.

This leaves us with a URI that uses two different characters encodings — what a mess!

Google Chrome however produces the URI that the W3C prescribes by using UTF-8 for all parts of the URL.

Strato-cumulus

Wednesday, January 28th, 2009

It is only after reading Jason Scott’s F*ck the Cloud, that I realised that my two previous posts were in fact touching upon the same subject from two different angles. Though, I do not necessarily agree with all of Jason’s points because his definition of “The Cloud” seems a little vague, he has made several good points.

I have to admit I am still not sure what “The cloud” is — people seem to have many different views, a bit like for Web2.0. I note that the Wikipedia entry for cloud computing is move-protected due to vandalism, and a large number of techies prefer surrounding the term with inverted commas.

To clarify this post, I will refer to “the cloud” as the collection of software and hardware services which, by using the Internet as the underlying infrastructure, enable data to live outside the home/office space. It therefore relies on SaaS, virtualisation and Web2.0 to make it happen. This definition therefore includes GMail, Blog platforms, Social networks as well as Amazon EC2. To me, the term is simply a convenient way to refer to the current trend in Web development; even if given the lack of integration and interoperability, we should really use the plural form…

In my post on Google’s approach to business rationalisation, I was looking at the service provider’s end. I was wondering about the effect of shutting down Web services for a company which is actively promoting cloud computing. Because companies like Google and Amazon are at the steering wheel, people are watching their moves; especially service providers in search of a good business model. Freemium might be the way to go because it allows the service to reach critical mass, but I am sure that other models will emerge.

What I was implying was that providers are not only selling productivity, they are selling trust as well. The issue of trust isn’t new, but when you have control over the software and hardware, it is easier to take responsibility for the data. When users lose the direct control of their data, trust seems vital. After all, there could be thorny legal issues regarding data retention, liability, etc… At the moment, providers take no responsibility (read the fine print!) which makes it theoretically risky to utilise “the cloud” for anything that is mission critical or sensitive.

But people are bad risk assessors and if “the cloud” solves their problem, people will embrace it. As Dare Obasango mentioned on his blog, given the massive adoption, trust might already be an issue that is ready evaporate. To follow on his example, it took a few decades for people to realise that seat belts and airbags might be good ideas, and drink driving and speeding not such good ones… The fatality rate did not deter people from using cars: gradually, manufacturers made the cars safer and traffic authorities enforced rules and educated people.

In my other post, I mentioned an article published in a French magazine that reconstructed the life of an Internet user from all the information that he had left on the Internet. What I found interesting was that people were putting so much data, and therefore so much faith, in “the cloud”. Of course, in the case of social applications such as Facebook or Twitter, the data is generally trivial and can hardly be considered mission critical or sensitive — although, a lot of people would not appreciate to lose their list of friends, photos, business contacts, etc…

I was pointing out that the coming generation, not trailing the same history of problems as the previous ones, is making anyone 30+ sound grumpy — in fact most of the criticism was coming from experienced professionals. There was a time when people would print everything they type because their hard drive was not safe enough; nowadays, they say that only their hard drive (for personal users) or their data center (for businesses) is safe enough, they see “the cloud” as a big fluffy thing that will disappear. Maybe they would appreciate my old dot matrix printer.

My guess is that users will continue to take advantage of “the cloud”, and they will learn to decide what data is important. Businesses also will learn, and because they are better risk assessors, they will pay premium for better guarantees and service when needed. And providers will probably start providing better interoperability, and continue to adapt their services to a growing demand.

Trust (or lack thereof) did not affect adoption, but risk awareness eventually changed the behaviour of users and manufacturers. In that regard, what happened with the car industry will happen with the Web.

“The cloud” is no silver bullet, we just need to understand better when it is appropriate to use. It will gradually disappear though, but only because it is a silly term.

The Life Of – related articles

Tuesday, January 27th, 2009

A few articles worth mentioning relating to my previous post “The Life Of”.

From the New Scientist:

From ReadWriteWeb:

Interestingly, I just saw this:

I might add a few more later…

Google’s new year clean-up

Wednesday, January 21st, 2009

There have been many reactions to Google’s announcements that they will cut back on services and shelve several of their pet projects. I have to admit that I am actually not familiar with several of the services that are being shutdown and they are already lots of well-informed comments for each individual project, so this is more a general comment.

I understand the need for a company to focus on its core assets and shut down products which bring little revenue, especially during a downturn. In fact, I wonder why it took them so long to deal with Google Video given that YouTube is a lot more popular and offers a similar service. It was a redundant service, so I guess they could have merged it or removed it a lot earlier — they didn’t have to wait for the credit crunch.

I am sure that the same could be said for other services.

But it did not seem to be the case of several other services that were given the knife. A lot of bloggers felt that projects like Jaiku or Google NoteBook had a lot of potential and were never given a real chance; which led to speculation about Google’s intention in the first place — was it just to acquire new talented development teams? If that is the case, this confirms that Google is adopting the conventional strategies. Fine.

But I wonder if Google may also be shooting themselves in the foot…

Rationalising a network of services

A few years ago, I became really interested in small world networks: an example stuck to my mind because it was counter-intuitive. In one of the books there was a chapter about transport optimisation, explaining how having some lines operating at a loss enabled other lines to be more profitable, therefore generating a global profit.

There were several examples of transport networks eventually collapsing because instead of evaluating the network as a whole, lines were evaluated individually and independently. When non-profitable lines were chopped off or the frequency of service was reduced to reflect attendance, users would start looking at other options. But they would also stop using other lines which would then became less profitable. As the rationalisation process would continue, other lines would be chopped off until only a handful of profitable lines would survive or none at all. Globally though, the network system would not generate any more revenue, but it would provide a lot less service and not necessarily more efficiently.

It was because the structural properties of the network ensured that the whole structure was sustainable; these non-profitable lines were not valuated properly. The network had to be rationalised as a whole to be made more efficient, not purely assessed based on individual lines: management had failed to understand that these services were interdependent and what it meant to remove them. Commuters did not want to go just from one station to another, they wanted to go from one place to another and would choose the transportation means that best suited their needs.

What does that have to do with Google?

I certainly don’t think that Google is going to shrink anytime soon due to some drastic cost-cutting — they are smarter than that and I am sure that they do consider each project carefully. And of course, they have every right to shut down any service they please, they run a business after all and most of these services are provided for free.

But I think they may be neglecting the interdependence between their products and the relationship with their users. People don’t want to use just that particular service, they are trying to find solutions to their problems. Google uses the Web to provide solutions and I believe that the interconnected nature of the Web should make them consider business rationalisation differently.

One reason why we use Google services instead of others, is not necessarily because they are better but because we feel Google is more reliable. A start-up could go bust and since I don’t want to lose my data or change my habits, I will be more hesitant before committing to one of their services. As Pisani mentioned on Transnets (in French) , there is a moral contract between users and Google: by interrupting some web services, we are reminded that maybe we should not entrust the Internet giant with all our online documents, emails, blogs, videos, feeds, applications, etc…

We are in an interesting transition period where the Web is supplanting the Operating System as a development platform. By shutting down these services even if they operate at a loss, Google is giving a cold shower to those who believe in moving applications to the web and were counting on Google to spearhead the move. The trend towards web-based applications is not going to stop, but we now have a reason to think twice before using Google’s services and their APIs.

I am still glad that there are companies out there like Google to beat new ground, but their recent house cleaning is a good reminder that if we follow, we may also need to backtrack.

The Life of …

Sunday, January 18th, 2009

There was an interesting article in Le Monde today referring to an article published in Le Tigre, an independent French magazine. The article in Le Tigre was simply the reconstructed biography of “Marc L” – a person, the paper claimed, they chose randomly on the Internet using Google and the data collected on social web sites.

They posted a number of details about him such as his age, sexual preference, the schools he attended, the music he listened to, and his friends and partners over the last few years… It reads just like a mix between the people’s section of a newspaper and a wikipedia entry.

The persistence of information

All the information was legally obtained, since it was publicly available on the Internet – although, they claim many details were removed after he requested it. While we’re all aware (to varying degrees) of the trails we leave on the Internet, it is easy to forget that information that we thought was transient is still there and can be collected to produce our portrait and extract information about us.

The conversation we had on a forum 5 years ago may still be there somewhere, unflattering photos posted by friends may also still be there. Since the emergence of social websites, there have been many articles on the subject and on the impact of leaving too much information on these websites.

But as data is better organised, more searchable, but not so easy to remove; we have all the more reasons to be careful about what we say on the Net. Let’s not forget Google’s mission to organise the world’s information and make it universally accessible. Information is not transient.

A false sense of privacy

An email is no more secure than a post card as two legal secretaries in a Sydney law firm were painfully reminded a few years ago when their incendiary email exchange was forwarded to pretty much everyone in the company before appearing in overseas newspapers.

But the so-called “Generation Y” who like to collect hundreds, if not thousands, of friends on FaceBook or MySpace and maintain personal blogs do not seem to mind. In fact, I am surprised by how much information people are willing to share with others — strangers and friends alike.

It seems that a lot of people are induced into a false sense of privacy not realising how much the information they publish becomes public, and how much of it could be used against them.

Identity theft

Given that most of the information for the “security question” can be gleaned on the Internet, we are increasingly more vulnerable to identity theft. Notwithstanding the fact that a lot of people still use basic passwords and generally the same one to access their accounts.

Anyone’s personal information is, as marketers would put it, “at your fingertips”, so new techniques are going to be needed to protect us from identity theft. And I hope that the options will be better than coming up with passwords that include combinations of symbols and numbers which are impossible to remember, or providing even more personal information about ourselves.

The future

All this makes me wonder how this will evolve.

It is not impossible to imagine having some online reputation management software coming out helping people clean up their traces, possibly optimising their friends and links to improve their online identity or removing what should not be there.

On the other side, you could have increasing sophisticated automated online portraits for use by marketers and recruiters – primitive versions already exist. Or identity thieves could start using photos of us to use when face recognition becomes more widely available.

So should we learn not to disclose too much information about ourselves, just like we learned not to undress in front of an opened window? Or should we get used to watching the neighbour walking around naked?

Afterword: Internet speed

Monday, December 22nd, 2008

Following my post on Australia’s Internet filter and looking around for other opinions on the subject, it seems that a lot of people were concerned about speed of the Internet here and the potential network performance degradation when a filter is in place — especially after the results of the ISP-level Internet content filtering laboratory trials were released in July 2008.

Looking at the 2008 broadband rankings published by the ITIF last June, it seems that international comparison does not place Australia all that well at the moment when it comes to broadband speed. When I visited Japan, France and Finland last year, I noticed that the Internet was significantly faster. Here are the numbers that show why…

Average download speed in Mbps per country

Average download speed in Mbps per country

Admittedly, Finland and France have amongst the fastest broadband speeds in the world — 21.7 Mbps and 17.6 Mbps respectively; not to mention Japan with a whooping 63.6 Mbps. At these speeds, even if these countries were to put in place the worst ISP-based filter in terms of performance degradation (87%), their broadband speed would still be faster than in Australia (1.7 Mbps).

Hopefully, the National Broadband Network will address the performance issue, but in the meantime we have to deal with sluggish connections hoping than an Internet filter will not be make them even slower.

Australia’s Internet filter

Saturday, November 15th, 2008

After a year-long trip that took me to countries such as Syria, Egypt and China where Internet censorship keeps busy a whole bunch of public servants, I was certainly not impressed to learn that Australia was going ahead with an Internet filtering system and that tests were under way. In fact, I originally thought it was a joke! But I wasn’t sure whether I should be amused or concerned.

From my experience of browsing the Web in these countries, the task is futile. For one, censors – computers or humans – are always too late, they simply cannot catch up with the amount of information out there. Even when they do, there are always so many ways around it that I am not sure it is even worth trying. In Syria, it took less than a minute for the guy minding the Internet Cafe in Damascus to give me access to FaceBook (blocked at the time) via one of the numerous proxies they use.

If people go to the trouble of looking for “inappropriate content”, they will also be savvy enough to find a proxy or setup one. I bet that as soon as the Internet filter is declared effective, there will be as many ways to circumvent it posted on the net. And there will always be ways around it, for the simple reason that secure protocols and cryptography cannot be made illegal – otherwise it would be end of secure banking and e-commerce – not withstanding a serious encroachment upon our privacy.

Adverse effects

Without even going into the ethical aspects, I can think of many other adverse effects to an Internet filter scheme:

  • False positives – how many legitimate sites will risk being blocked?
  • Internet Speed, already below the standard of comparable countries – how will this effect the performance of the Internet?
  • The black list, even secret – will it risk becoming a reference list for people looking for Internet content?
  • Censorship creep – how can we ensure than the black list will not be used for blocking other sites? How do we ensure that the scope of the censorship scheme will not be extended to cover other areas which have nothing to do with protecting children from harmful content?
  • More monitoring is less monitoring – will it drive people accessing “illegal content” to tighten their security and anonymity?
  • Blame shifting – How can someone clear their site if it has been a victim of an attack which inserts links to “illegal content”? Would the same happen if your site shares the same IP as a site that has been blocked?
  • And let’s not forget that there is nothing more appealing for teenagers than something they are not allowed to do or see…

Back to when I travelled in Syria, I remember meeting a programmer who also made a living out of software piracy – quite common there due to the US export ban. He explained that most of his work consisted of removing the potential security threats from hacked programs so that he could sell them. Without going at length into computer security, these threats are designed to infect computers so that they propagate links to porn sites, poker sites, and the like; these threats are designed specifically so that legitimate websites are turned into hosts for less legitimate sources, thereby shifting the blame and risk onto people who have nothing to do with it.

A friend of mine was recently blackmailed by an attacker who launched a DoS attack and wanted to include hidden porn and online gaming links on my friend’s portal (which serves thousands of blogs). The site was down for a few hours with all the engineers working frantically to repel the attack.

These two examples are certainly not an excuse for not taking security seriously, but as any administrator knows, anyone can be a victim of these kind of attacks and end up with links to inappropriate content. When you have been battling with a Internet attack, I doubt that you’re in the mood for battling with ACMA to try to get your site removed from the black list.

The Finnish experiment

It is often mentioned that the scheme will cause Australia to join the ranks of North Korea, Burma, Iran, China, Cuba, Belarus and Syria – states which are hardly known for their progressive policies. It is true that when a country finds itself associated with these countries it is generally a sign of bad policy, even if I trust the Australian government a whole lot more than the Syrian one.

But it is interesting to know that Finland also has experimented with the use of an Internet Filter and it seems that the results are less than conclusive. This article on Finnish Internet censorship by EFFI, a Finnish online civil rights organisation, outlines several of the shortcomings; as well as this one. The Finnish scheme seems to have shown several of the side effects that I have listed above: some legitimates sites were blocked; the black list circulated on the Net; and there were plans to include more sites for other reasons (such hate speech, breach of copyright and online gambling), which were not in the scope of the original law.

It is worth noting that the Finnish scheme does not seem to have led to any arrests in relation to child pornography: since most sites blocked are outside of Finland, they cannot be reached by law. The Law however, as it is the case in Australia, is sufficient to prosecute offenders regardless of the presence of an Internet filter.

Little debate, a lot of expense

All that said, I haven’t been able to find much reliable information about the Australian plan apart from the fact that $125.8 million will be dedicated to cyber-safety over the next four years. I have received an email from GetUp without any link to the relevant information. Apart from a couple of articles and opinion pieces from the BBC and ABC, I could not find much detail about the plan either, though a lot of people commented on the subject on the blogosphere.

I tend to believe that there must be more efficient and creative ways of fighting the distribution of inappropriate content online. No-one denies that the Internet brings new challenges to the access of information, but I am sceptical that ISP filtering is an effective way to address the issue.

It will be interesting to see how this debate evolves in Australia, but I wish there was more information on the scheme and a little more debate in the public sphere, before large sums of money are committed to a plan that is legally tricky, ethically debatable and technically impractical.