Did you ever wonder why URLs look like they do?
With a bit of common sense, I found it not to hard to find reasonable
explanations of it. Please note that all of this is speculation,
but I hope it’s right because it’s just the way to go (please write me
if you can contribute).
Remember, back before the invention of the web, there were no URLs at
all. FTP sites were shared as free-text and transferred to the
command-line client using copy and paste (at most). Then, you had to
go to the directory you wanted and issue a fetch.
TimBL’s genius idea now was to find a way to address everything on
the Internet, and he didn’t want to limit it to the web, but as well
allow to address FTP, GopherSpace and some even more obscure systems
that were in use back then.
Consider this URL:
http://foo.org/bar/baz?aleph=1&beta=2
Since URLs were meant to be uniform, there had to be a way to
determine the protocol to use, and the protocol had to be separated
from the host name in some way: a colon is the common way to say “this
is that”, and that’s why it is http:
.
foo.org
, of course, existed long before. DNS was invented in 1983,
and was a reasonable thing to build on. However, it was often criticized
for being “the wrong way ‘round”, with most-significant part of the
hierarchy last (the top-level domain org
). Having a Unix
background, Tim decided to keep the path hierarchy like in Unix, and
that’s why it is /bar/baz
. It works well for FTP, too.
Remember, when HTTP/0.9 was state of the art, there was only GET.
(Actually, the initial
design shows up some
other methods.)
However, Tim quickly discovered serving static data all day was
boring: there had to be a way to, for example, make a search
page.
And there was. Hands up, who remembers <ISINDEX>
? Add this
older-than-the-stones tag to your HTML and the browser will
automatically place a text box and a button to allow you to call a
search script.
How is the data transferred? Having only GET, the data has to go into
the URL. Which character would you use to delimit the path and the
query? Indeed, a question mark.
Now, <ISINDEX>
directly specifies the query string, so in the
beginning, it was more common to see URLs like
http://goo.org/search?meaning+of+life
Later, when forms were added,
there had to be a way to specify field names (what could be more
logical than =
for this purpose? Maybe :
…), and a way to
separate multiple fields (&
sounded like a good idea, but tells
enough about the state of SGML parsing back in the old days).
Whitespace in URLs were a problem! How can you know it is over? CGI
defines +
to surrogate space, but then, how do you transmit a +
?
Obviously, URL escaping had to be %69%6e%76%65%6e%74%65%64. This is where
things got ugly… for example, you really can’t tell if an URL
escaped string already had been escaped. Big fun. Why percent
encoding? Probably because the percent sign was not used yet?
Finally, I admit, I have a problem. I can’t figure at all why there
is a //
between the hostname and schema. I suppose it was meant for
something special (e.g. some URI/URNs don’t have it), but what is it
good for HTTP URLs?
Update 17mar2009: tbl doesn’t know himself.
But up to that, I think the design of URLs was perfectly reasonable,
wasn’t it?