leah blogs

March 2004

28mar2004 · Comparing operating systems by their email storage formats

Much of the design and philosophy of operating environments can be found out by the file formats they use. In this note, I'll compare the file formats of the default mail programs of Unix, ITS and Windows.

On Unix, I take mail(1) as reference mail program, since many other mail programs use the same file format, mbox, but a look at maildir will also be taken. On ITS, RMail and Babyl are examined. Finally on Windows, Outlook Express is the only mail program included by default.

The mbox format of Unix is very simple: Its a file with a number of RFC (2)822 messages separated by lines like

 From MAILER-DAEMON Tue Dec 17 16:53:03 2002

Lines starting with "From" are taken as message delimiters.

The simplicity of this format speaks for Unix, its very easy to edit and manipulate, to search and retrieve messages of the file.

However, as (unfortunately) so often in Unix, this was not thought to the end: What happens if the message itself contains a "From" line? Here, mbox(5) tells us what to do: In order to avoid misinterpretation of lines in message bodies which begin with the four characters "From", followed by a space character, the character ">" is commonly prepended in front of such lines.

What a ugly hack! And still, no-one tells what to do with lines that start with ">From"... (It is interesting by the way that sometimes you see articles in newspapers that include the word ">From"...)

ITS mainly used two MUAs, RMail and Babyl. Both of them have almost the same structure and are very similar to mbox too. All messages are concatenated as RFC (2)822 messages, but they are separated by ^_ (ASCII 31, octal 037, also known as Unit Separator (US)) on a single line.

This is obviously the right thing: A special character was made for this purpose, so it's used. Furthermore, its a non-printable char and so would be encoded using quoted-printable or MIME even anyway. (At least if that existed back than.) Still, this format has all the good sides of mbox stated as above.

Outlook Express, the default mail program on recent Windows versions, uses magical, binary and proprietary PST files not readable by humans. There exist some tools and special libraries to access these files, the format however is neither open nor portable and not used by any other program (except in input filters). This is the usual way of making a monopoly, first force the users to use something, and later force them say there as they cannot switch (you cannot export your mails into some other format with Outlook Express).

All these formats have something unique, they are all stored in a single file. This can easily cause data corruption, for example if several processes access the same file. While not fatal in the case of mbox and RMail, Outlook Express files are likely to be fubar.

Therefore D. J. Bernstein invented a new way to storage mail, the maildir format. Here, mail is stored in—as the name says—directories. Furthermore, maildir doesn't need locking as two processes can write into the same directory concurrently. This helps a lot as many networked file systems handle locking badly or not at all.

Basically, a maildir directory includes three subdirectories, tmp/, new/ and cur/. new/ and cur/ have exactly the same substructure—except that new/ contains unread mail and cur/ mail already seen my the MUA—as they contain files with the single messages in RFC (2)822 format without any content escaping at all.

maildir is available for and being used on many Unixes and clones, including GNU/Linux and various BSD.

It is truly is the best format of them and without any hacks at all, still being open, independent and easy to use. In fact, a user could read his mail without any MUA at all, using only the standard file utilities found on any system.

So, what can you learn of this comparison?

Looking at how elementary things are done, you learn a lot about how the rest of the stuff works. You immediately see if its closed, complex and opaque (Windows), or open, simple, flexible but not always well-thought (Unix) or open, simple, flexible, and done as best as possible (ITS, please note that ITS didn't support nested directories, so maildir wasn't possible way back then).

And sometimes, there's a new technology which is different, but better than everything before. Then go ahead and use it, and drop the old things, but keep compatibility to them (there is maildir2mbox), at least at much as possible and as long as its reasonable.

NP: The Overprivileged—Power Shift

Copyright © 2004–2022