
How Mailinator compresses its email stream by 90% - zinxq
http://mailinator.blogspot.com/2012/02/how-mailinator-compresses-email-by-90.html
======
jrockway
Mailinator is a great product. My favorite part about it is how whenever I
register for something, the clever form validation software always rejects the
mailinator.com email address. Then I visit mailinator, see their alternate
domain name du jour in image form (so bots can't harvest it, hah!), and then
that works perfectly. It makes me giggle with joy every time I do it.

It's also nice not receiving ads in the mail every hour of every day just
because I wanted to try some new YC-startup's product.

~~~
ohashi
I worry we're going to need disposable social ids now too like email. That's
something facebook/twitter probably won't appreciate. I would LOVE to try a
lot of services that require one of those logins but am not comfortable giving
them access to my info.

~~~
xp84
Create an "alter ego" account on Facebook.
<http://www.fakenamegenerator.com/gen-random-us-us.php> if you need fake name
and address ideas, or just use whatever you wish your name was. If you get too
funny with the name though, FB's filter won't believe your name.

Then, keep a separate Chrome profile that's logged into the fake facebook, and
your identity is safe. Probably best to lock down all privacy settings on the
fake user to absolute maximum lockdown, to avoid information leaking onto
Facebook's public layer that may tie you to it.

That was the strategy I used when I worked for a company that did business
with Facebook games. Those things were so spammy, I would never want my actual
friends to be spammed with all the viral content. To this day I keep my fake
account for any and all untrusted FB-related things.

~~~
markokocic
Facebook now requires your phone number in order to register. Have a way to
use fake phone too?

~~~
lambada
I'm fairly certain this is untrue. Looking at the signup form, all I can see
is name, email, password, birthdate and gender. They may ask for it later, but
that almost certainly is optional.

~~~
zizee
This is sort of true. If you want to use the account as a developer account
you have to register your phone number. Also probably to advertise.

I recently wanted to create an "Admin" account for a client I was developing a
web-app for as I didn't want to link my personal facebook account to my work.
But with the phone number requirement the account got flagged and rejected
from the system.

Frustrating!

------
davesmylie
I run a similar (though waaaay less popular) site. My mail is stored on disk
in a mysql db so I don't have quite the same memory constraints as this.

I had originally created this site naively stashing the uncompressed source
straight into the db. For the ~100,000 mails I'd typically retain this would
take up anywhere from 800mb to slightly over a gig.

At a recent rails camp, I was in need of a mini project so decided that some
sort of compression was in order. Not being quite so clever I just used the
readily available Zlib library in ruby.

This took about 30 minutes to implement and a couple of hours to test and
debug. An obvious bug (very large emails were causing me to exceed the BLOB
size limit and truncating the compressed source) was the main problem there...

I didn't quite reach 90% compression, but my database is now typically around
200-350mb. So about 70-80% compression. So, I didn't reach 90% compression,
but I did manage to implement it in about 6 lines of code =)

~~~
paol
Can you tell us what the site is? I've once or twice found sites that blocked
even mailinator's alternate domains. Having a less well known alternative to
mailinator would be nice.

~~~
modoc
10MinuteMail.com is another option.

(biased creator here)

~~~
kamjam
There's also 20minutemail.com - random address, allows reply, forward and
download of message

yopmail.com - any address (like Mailanator) and allows download of attachments
directly

There are a ton of these available, just depends on your specific needs. I
regularly use them for signing up to those sites that need sign-up before
access to a file/article/image (e.g. forums).

I've also used them in testing web apps, using Selenium to grab the email
address and registering a user and then going back to check the welcome email
had arrived.

------
hello_moto
Both blogs (mailinator and paultyma) are awesome. I need more stuff like this
than a typical Web 2.0, how we use NoSQL, and Cache-Everything (without a clue
how to do cache properly, but 37Signals cache solution is by far in line with
mailinator techniques: smart and elegant).

~~~
mseebach
It's awesome, yes - but it's worth keeping in mind that this trick works
because he has an intensely write-heavy application (he probably does orders
of magnitude more writes than reads). Very few applications have such a data
access pattern.

------
markbao
Great read. I wish there were more articles like these.

~~~
fleaflicker
Then rest of the mailinator blog and his blog
(<http://paultyma.blogspot.com/>) are filed with gems like this.

There are a few of entries on multi-threaded synchronous io vs asynch io; he's
a big proponent of the former.

------
pkulak
Redis works great as an LRU cache and is much more space-efficient than an in-
process LinkedHashMap, especially when the keys and values are small. Plus, an
LRU wreaks havoc with the the Java generational garbage collector as soon as
it fills up (every entry you put in is about guaranteed to last until the
oldest generation, then likely be removed).

~~~
willvarfar
Redis would blow latency budget though, right?

~~~
tfb
Maybe I'm missing something here, but I was under the impression Redis is one
of the fastest data stores out there. What do you mean it would blow the
latency budget? I'm curious because I've switched my startup's backend to
node+redis.

Thanks in advance!

~~~
amalag
He means it is being compared to CPU cache (from the article), so there is
tens of orders of magnitude difference. From this summary on stack overflow:
[http://stackoverflow.com/questions/433105/exactly-how-
fast-a...](http://stackoverflow.com/questions/433105/exactly-how-fast-are-
modern-cpus)

CPU registers (8-32 registers) – immediate access (0-1 clock cycles) L1 CPU
caches (32 KiB to 128 KiB) – fast access (3 clock cycles) L2 CPU caches (128
KiB to 12 MiB) – slightly slower access (10 clock cycles) Main physical memory
(RAM) (256 MiB to 4 GiB) – slow access (100 clock cycles) Disk (file system)
(1 GiB to 1 TiB) – very slow (10,000,000 clock cycles) Remote Memory (such as
other computers or the Internet) (Practically unlimited) – speed varies

~~~
pkulak
But if you have a Redis cache on the same box (he says he only has one box
anyway) it's still in the same category: "Main physical memory", with maybe
some communication overheard.

~~~
willvarfar
"maybe some communication overheard" is orders of magnitude slower than L1/L2.

Hmm, there is something very wrong here. I'll try and explain in a blog post.

~~~
pkulak
But we're not talking register caches. 800 megs stashed in a giant Java hash
are not going to be in L1 or L2 cache.

~~~
willvarfar
[http://williamedwardscoder.tumblr.com/post/18065079081/cogs-...](http://williamedwardscoder.tumblr.com/post/18065079081/cogs-
bad#comment-446733732) might be interesting

------
newman314
Reading about another algo (Locality Sensitive Hashing) as referenced in the
first comment.

<http://www.stanford.edu/class/cs345a/slides/05-LSH.pdf>

~~~
pjscott
See also this library for locality sensitive hashing:

<http://lshkit.sourceforge.net/>

------
Maascamp
Great write up. One of the more interesting things I've read on here in a
while. Thanks for sharing.

------
pbiggar
In an aside he mentions you should use bubblesort instead of quicksort for
small arrays, due to cache locality, etc. I'd recommend using insertion sort
instead of bubblesort - it does much better in both cache locality and branch
performance (one branch prediction miss per key).

~~~
cobrausn
That seems a strange assertion - I was under the impression that insertion
sort beats bubble sort on small arrays, pretty much always, and even in
'nearly sorted' arrays, which is often touted as a strength of bubble sort.

<http://www.sorting-algorithms.com/> gives a decent overview with n equal to
20, 30, 40, or 50, but nothing smaller. Now I'm curious.

------
wolf550e
1\. I think the author calls DEFLATE an LZW algorithm. It isn't.

2\. Has the author looked at Google Snappy? It does 500MB/sec.
[http://code.google.com/p/snappy/source/browse/trunk/format_d...](http://code.google.com/p/snappy/source/browse/trunk/format_description.txt)

There is a pure-C implementation that might be easier to port:
<https://github.com/zeevt/csnappy>

~~~
electrum
There is already an excellent native port of Snappy to Java:
<https://github.com/dain/snappy>

------
steffes
Just when I thought I knew everything there is to know about compression
algorithms, along came Pauli, and Voila, mind now blown.

~~~
alyrik
Check out what RainStor does to compress relational data by a factor of 20-40
or so (that's 95%-97.5%):
[http://www.youtube.com/watch?v=rIsVcaMaAgg&feature=playe...](http://www.youtube.com/watch?v=rIsVcaMaAgg&feature=player_embedded)

------
skrebbel
_Voila. 90%. (Two notes: 1: that's a reasonable average at least... sometimes
better, sometimes worse and 2: I realize I'm not exactly sure what "Voila"
means, looking that up now)._

If mailinator wasn't already awesome, his writing about it sure is.

------
jorangreef
Something else that might work is content-dependent deduplication, with
variable chunk boundaries determined by a sliding Rabin-Karp-style (or XOR)
fingerprint on the content and a second cryptographic hash calculated for
chunks where the cheap fingerprint has a match. It's naive and can find
matches across headers, body and attachment parts.

------
ShabbyDoo
I recently worked on a project where, to cut down on space, I built a custom
"compressor" for lists of tree-like records. You might think of a Person
record with N past addresses although this was not the actual domain. No
records were cross-linked (at least not formally in the code) and the records
were trees, not more general DAGs. The data contained a lot of enumerated data
types (address type, gender, etc.). I didn't really care about the space usage
for 1K records, but I cared about 1M. I used variable length encoding (a
hacked-up subset of the Apache Avro project) for integers to take advantage of
most of them being small in the datasets. Lots of lookup tables for enumerated
values and commonly-repeated in practice string values. Implicit "keying"
based on a record schema to avoid specifying key identifiers all over (our
data was not very sparse, so this beat out a serialized pattern of KVKVKV
etc.). I thought about taking advantage of most strings having very limited
character sets and doing Huffman encoding per data element type, but the
results were good enough before I got there. A co-worker also noted that,
because parts of these records were de-normalized data, subtree pattern
recognition could provide huge gains. I added some record length prefixes to
allow individual records to be read one-at-a-time so that the entire dataset
would not have to be read into memory at once. IIRC, compression speed was
2-3x gzip(9), and large record sets were 1/10th the size of using Java
serialization plus gzip. [Yes, Java serialization is likely the worst way to
serialize this sort of data]

Was all of this worth it? It solved the problem of not burning through network
and memory, but it was a local optima. The root problem was that this data
came from another system which did not provide repeatable reads, and providing
them would have been a massive effort. However, our users wanted to meander
through a consistent data set over the course of an hour or so. To provide
this ability to browse, we throw these records into a somewhat transient
embedded H2 DB instance. The serialized format is required primarily to
provide high availability via a clustered cache. In retrospect, I would have
pushed for using a MongoDB-esque cluster which could have replaced both H2
(query-ability) and the need for the the serialized format (HA).

It surprised me that there were no open source projects (at least Java-
friendly ones) which provided compression schemes taking advantage of the
combination of well-defined record schema and redundant-in-practice data. Kyro
(<http://code.google.com/p/kryo>) comes closest as a space-efficient
serializer, but it treats each record individually. Protobufs, Thrift, Avro,
etc. are designed for RPC/Messaging wire formats and, as an explicit design
decision (at least in the protobufs case) optimize on speed and the size of an
individual record vs. the size of many records. The binary standards for JSON
and XML beat the hell out of their textual equivalents, but they don't have
any tricks which optimize on patterns in the repeated record structures.

Is this just an odd use case? Does anyone else have a similar need?

~~~
ehsanu1
Most people are probably happy relying on gzip for compression. It's "Good
Enough" for most use cases. It seems like it made sense for you though.

------
dredmorbius
Further efficiencies can be gained by removing extraneous apostrophes from
possessive "its".

~~~
pjscott
While that would result in decreased entropy that a decent general-purpose
compression algorithm like LZ77 could take advantage of, this preprocessing
step is lossy and probably not worth the bytes it would save you.

------
__alexs
The line based thing smells like doing LCS but with string elements of length
N rather than 1...

------
funkah
Mailinator, even considering any praise it has ever gotten, is still one of
the most underrated tools on the internet. I love it and use it all the time.

------
iag
Reading this article makes me giggly inside.

