
"mbox" is a family of several mutually incompatible mailbox formats - gnosis
http://homepage.ntlworld.com./jonathan.deboynepollard/FGA/mail-mbox-formats.html
======
JonnieCache
Interesting URL there. A dot after the tld, and hosted on ntlworld, which was
once the UK's most hated ISP (now owned by virgin and hated very slightly
less.)

~~~
tententen
_A dot after the tld_

That's actually how it's supposed to be done:

<http://www.dns-sd.org/TrailingDotsInDomainNames.html>

"It's a little-known fact, but fully-qualified (unambiguous) DNS domain names
have a dot at the end. People running DNS servers usually know this (if you
miss the trailing dots out, your DNS configuration is unlikely to work) but
the general public usually doesn't. A domain name that doesn't have a dot at
the end is not fully-qualified and is potentially ambiguous. This was
documented in the DNS specification, RFC 1034, way back in 1987 ..."

but the site works just fine without the trailing dot as well.

~~~
djcapelis
> but the site works just fine without the trailing dot as well.

In your current environment! And probably in all the environments connected to
the Internet. But potentially without the dot it would be perfect valid for
this to route elsewhere in someone's environment. :)

~~~
easp
Netpick: "resolve elsewhere," not "route elsewhere"

~~~
djcapelis
Upvote for correction. Thanks!

------
riobard
A side question: what's the best way to build an email message store which
stores millions of messages (and adding tens of thousands on average each day)
on disk so that it can support:

1) Fast looking up messages sent/received by a particular email address, time,
tag, priority, and a handful other properties

and

2) Appending thousands of new messages per second at peak?

~~~
JoshTriplett
Maildir (or a very similar one-message-per-file store) plus an index of some
kind.

If not for the continuous addition of messages, I'd suggest notmuch
(<http://notmuchmail.org/>), which does a great job of indexing in any way you
could want. Unfortunately, as far as I know notmuch does not have an "online"
indexing mechanism; someone wrote a notmuch-deliver that adds a message and
indexes it, but I doubt it would meet your performance requirements. You could
probably make notmuch do what you want, but you might want a new indexing
mechanism instead.

I still think you want something very similar to Maildir underneath though.

~~~
riobard
I was using the one-message-per-file approach, but keeping millions of small
files in a filesystem is quite a challenge: I have to worry about available
inodes, and listing/manipulating these files takes minutes!

~~~
mike-cardwell
I think the best system would be one inbetween mbox and maildir. You specify a
filesize limit of say "10MB". Emails are repeatedly added to the same file,
until it hits 10MB, then a new file is created. Emails larger than 10MB would
not be split over multiple files. An index is kept of all of the messages.
When a message is deleted, rather than immediately removing it from the file
and blocking like most mbox implementations work, it should just be flagged as
deleted in the index. A background process can clean up these messages when
the file is not in use.

~~~
JoshTriplett
Seems like in such a system you've gone to a lot of work to recreate the
purpose of a filesystem: to keep the data of different files separate from
each other. Why do you want to keep emails together in 10MB chunks?

~~~
mike-cardwell
To address the limited inode issue

------
Karellen
See also JWZ's article on the "Content-Length" header (mboxcl and mboxcl2
formats from the original article)

<http://www.jwz.org/doc/content-length.html>

------
kijin
One way to get around the different escaping rules between mboxo and mboxrd is
to insert a space (0x20) at the beginning of any line that could be
misinterpreted. This eliminates the need for any further escaping. Leading
whitespace is almost always ignored by mail clients when displaying the
message, especially when the text is flowed.

I've received received e-mails that were formatted this way, though I'm not
sure exactly which software, whether client or server, is responsible for
doing it.

~~~
dsr_
Now you now have n+1 standards to keep track of.

As a side effect, this breaks every reader which is unaware of it and handling
binary content in mail (many, many messages) or languages or encodings which
are space sensitive.

~~~
kijin
Binary content and non-ASCII characters are almost always base64-encoded, and
base64 never produces a From_ sequence (because of the space).

Anyaway, I'm not endorsing the leading-space technique. I'm just observing
that it's out there in the wild.

------
slaven
Another wrinkle: some mailers stuff full 8-bit chars into mbox which will
choke some parsers and end parsing prematurely (most expect 7-bit ASCII).

But it beats mdir handily in one way: if you're dealing with mailboxes with
100s of thousands of messages it is way more efficient.

~~~
ZenPsycho
way more efficient along which dimension? Space, or processing speed? By
roughly how much? Enough to be worth the costs in potential data loss?

~~~
slaven
Each mdir email is a file, so depending on your OS it consumes disk fragments
very quickly and relies on your OS file handlers to allocate and load
messages. mbox as a mapped file and a good index blows it away in speed and
space usage.

~~~
__david__
Very true, but reclaiming space in the middle of a huge mailbox is
dramatically worse in the mbox case.

Doing time-machine style rsync backups is also dramatically more space
efficient in the maildir case.

~~~
ars
Why dramatically more? rsync does a perfectly fine job transmitting and
storing just the diffs on an mbox file.

~~~
__david__
With the time-machine style backups, every time you append an email to your
mbox the whole thing gets backed up. Yes, the rsync protocol makes the wire
transmission efficient (only really matters if your backup server is remote)
but the fact that your 2Gig mbox file gets backed up every night instead of
hard-linked means it's _not_ space effective on your backup disk.

A mail-dir, on the other hand, works nicely with this type of backup. All the
old mails never change so they get hard-linked and only the new mails take up
any new space on the backup.

~~~
ars
"time-machine style backup" seems to be a marketing term, not a type of
backup.

Do you mean a hard link style backup? Where diffs are never stored, you just
keep making hard links of the files? And if they change even one byte you
store the file fresh?

There are plenty of solutions to this, the simplest being storing reverse
diffs - so the latest file is stored plain, but older ones can be generated
from diffs (since doing so is rare).

You can also use modern filesystems like btrfs that can do COW and block de-
duplication and store only changed blocks in such a way that the file appears
to be complete, but actually stores only what changed.

