Trivium: 20sep2008

Did you ever wonder why URLs look like they do?

With a bit of common sense, I found it not to hard to find reasonable explanations of it. Please note that all of this is speculation, but I hope it’s right because it’s just the way to go (please write me if you can contribute).

Remember, back before the invention of the web, there were no URLs at all. FTP sites were shared as free-text and transferred to the command-line client using copy and paste (at most). Then, you had to go to the directory you wanted and issue a fetch.

TimBL’s genius idea now was to find a way to address everything on the Internet, and he didn’t want to limit it to the web, but as well allow to address FTP, GopherSpace and some even more obscure systems that were in use back then.

Consider this URL:

http://foo.org/bar/baz?aleph=1&beta=2

Since URLs were meant to be uniform, there had to be a way to determine the protocol to use, and the protocol had to be separated from the host name in some way: a colon is the common way to say “this is that”, and that’s why it is http:.

foo.org, of course, existed long before. DNS was invented in 1983, and was a reasonable thing to build on. However, it was often criticized for being “the wrong way ‘round”, with most-significant part of the hierarchy last (the top-level domain org). Having a Unix background, Tim decided to keep the path hierarchy like in Unix, and that’s why it is /bar/baz. It works well for FTP, too.

Remember, when HTTP/0.9 was state of the art, there was only GET. (Actually, the initial design shows up some other methods.)

However, Tim quickly discovered serving static data all day was boring: there had to be a way to, for example, make a search page.

And there was. Hands up, who remembers <ISINDEX>? Add this older-than-the-stones tag to your HTML and the browser will automatically place a text box and a button to allow you to call a search script.

How is the data transferred? Having only GET, the data has to go into the URL. Which character would you use to delimit the path and the query? Indeed, a question mark.

Now, <ISINDEX> directly specifies the query string, so in the beginning, it was more common to see URLs like http://goo.org/search?meaning+of+life Later, when forms were added, there had to be a way to specify field names (what could be more logical than = for this purpose? Maybe :…), and a way to separate multiple fields (& sounded like a good idea, but tells enough about the state of SGML parsing back in the old days).

Whitespace in URLs were a problem! How can you know it is over? CGI defines + to surrogate space, but then, how do you transmit a +? Obviously, URL escaping had to be %69%6e%76%65%6e%74%65%64. This is where things got ugly… for example, you really can’t tell if an URL escaped string already had been escaped. Big fun. Why percent encoding? Probably because the percent sign was not used yet?

Finally, I admit, I have a problem. I can’t figure at all why there is a // between the hostname and schema. I suppose it was meant for something special (e.g. some URI/URNs don’t have it), but what is it good for HTTP URLs? Update 17mar2009: tbl doesn’t know himself.

But up to that, I think the design of URLs was perfectly reasonable, wasn’t it?