Hacker News new | past | comments | ask | show | jobs | submit login
How Mailinator compresses its email stream by 90% (mailinator.blogspot.com)
424 points by zinxq on Feb 21, 2012 | hide | past | favorite | 69 comments



Mailinator is a great product. My favorite part about it is how whenever I register for something, the clever form validation software always rejects the mailinator.com email address. Then I visit mailinator, see their alternate domain name du jour in image form (so bots can't harvest it, hah!), and then that works perfectly. It makes me giggle with joy every time I do it.

It's also nice not receiving ads in the mail every hour of every day just because I wanted to try some new YC-startup's product.


I worry we're going to need disposable social ids now too like email. That's something facebook/twitter probably won't appreciate. I would LOVE to try a lot of services that require one of those logins but am not comfortable giving them access to my info.


Create an "alter ego" account on Facebook. http://www.fakenamegenerator.com/gen-random-us-us.php if you need fake name and address ideas, or just use whatever you wish your name was. If you get too funny with the name though, FB's filter won't believe your name.

Then, keep a separate Chrome profile that's logged into the fake facebook, and your identity is safe. Probably best to lock down all privacy settings on the fake user to absolute maximum lockdown, to avoid information leaking onto Facebook's public layer that may tie you to it.

That was the strategy I used when I worked for a company that did business with Facebook games. Those things were so spammy, I would never want my actual friends to be spammed with all the viral content. To this day I keep my fake account for any and all untrusted FB-related things.


Facebook now requires your phone number in order to register. Have a way to use fake phone too?


I'm fairly certain this is untrue. Looking at the signup form, all I can see is name, email, password, birthdate and gender. They may ask for it later, but that almost certainly is optional.


This is sort of true. If you want to use the account as a developer account you have to register your phone number. Also probably to advertise.

I recently wanted to create an "Admin" account for a client I was developing a web-app for as I didn't want to link my personal facebook account to my work. But with the phone number requirement the account got flagged and rejected from the system.

Frustrating!


There are SIP providers which offer phone numbers on their free tiers. But I'm sure they would appreciate registering throwaway accounts even less than Facebook and its ilk.


If that's true, a $15 tracfone takes care of it.


Sorry, but fake facebook account isn't worth that much. YMMV.


BrowserID solves that one: a disposable email becomes a disposable identity.


According to this[0] you can also use your own domain and send all the mails to mailinator. I dont think those form validation software would so clever to check your MX records (or their alternate domain name du jour would not work)

[0] http://mailinator.blogspot.com/2008/01/your-own-private-mail...


Warning! If you do this, anyone might be able to create HTTPS certificates for your domain name!

A lot of certification authorities will allow you to get a certificate for example.com if you can prove that you can reply to an email sent to an email address @example.com. If you configure an MX record pointing to the mailinator MTAs, then anybody can request a cert and ask to validate domain ownership by contacting whatever@example.com, and can subsequently browse the whatever@ mailbox on mailinator to read and reply to the confirmation email.


Are you skipping over something in that explanation about which email(s) one must be able to read? Because the way you said it makes it sound as if anyone could make fake certs for hotmail, gmail, etc. And that can't be right... I hope.


This is partially right(!)

A security researcher once obtained a certificate for Microsoft's live.com domain by registering an email account sslcertificates@live.com and using it to reply to the CA's verification email.

http://www.theregister.co.uk/2011/04/11/state_of_ssl_analysi...


Presumably something like "root" or "webmaster," but those aren't illegal on mailinator: http://mailinator.com/maildir.jsp?email=webmaster&x=0...


You're right, but I was just proposing to use one of your own domain, not your real, business, holy domain. You can register a new one for less than 10$/year, or you can just use a free third level subdomain (like trashbin.example.com)


Nice. Hopefully I can just CNAME something.example.com to mail.mailinator.com, and everything will just work.

Edit: yes, this works perfectly.

It would be nice if there was a quick-and-easy open-source version of an in-memory mail server. I'd like to run my own service, but not enough to write a robust SMTP server from scratch.


I still think that mail servers could really use a disruption. It is still way too complex just to get something simple set up.

I haven't surveyed the field extensively but this has been my experience with the common recommendations. If anyone knows of a server that already exists and would make it trivial to set up basic SMTP/IMAP/POP3 with authentication let me know.

Ideally the only configuration info I would need would be something like this:

allow_imap = True

allow_pop3 = True

allow_smtp = True

accept_mail_for = ['this_domain.com', 'another_domain.com']

#users

users = ['jeff@this_domain.com', 'molly@another_domain.com']

#forwards

forwards = {'jeff@this_domain.com': 'cookiecaper@some_domain.com'}

and then, a program like mail_server_passwd would ask me for my username and let me set a password, and that would be that. It shouldn't be any difficult to bootstrap a mail server, and right now setting something that has POP3, IMAP, and SMTP up can take hours for experienced admins unless they know the mail server(s) very well ahead of time.


Keep an eye on OpenSMTPD. For a taste, look at the sample configurations at https://calomel.org/opensmtpd.html.

[Warnings: OpenSMTPD apparently hasn't eaten anyone's mail yet, but it's hardly done. By design, it doesn't include as many options as other mail servers, some of which may be useful. I don't think OpenSMTPD been ported to non-OpenBSD yet. calomel.org tends to be out of date, inaccurate or dangerously incomplete.]


OpenSMTPD comes very close to your ideal simple configuration, a 5 line configuration gets you up and running. Tie this together with openbsd's extremely simple pop3 server, you have something that is both extremely simple and easy to set up.

http:///opensmtpd.org


What about using a ramdisk? Of course, it wont be as optimized since the software thinks it's writing to disk, but..

Also, some SMTP servers are probably pretty easy to modify. I use qpsmtpd[1], which is probably quite easy to modify. Then, all you need is a IMPAD that can read from memory..

I would like to have my mail in something like Redis. It would give easy access from different languages etc..

[1]: http://en.wikipedia.org/wiki/Qpsmtpd


This is what you're looking for: http://quintanasoft.com/dumbster/


My favorite part about it is that most sites I register for do NOT reject the mailinator.com email address. I always think "I'll probably get rejected and have to hunt down an alternate", but 80-90% of the time my mailinator address is accepted.


I run a similar (though waaaay less popular) site. My mail is stored on disk in a mysql db so I don't have quite the same memory constraints as this.

I had originally created this site naively stashing the uncompressed source straight into the db. For the ~100,000 mails I'd typically retain this would take up anywhere from 800mb to slightly over a gig.

At a recent rails camp, I was in need of a mini project so decided that some sort of compression was in order. Not being quite so clever I just used the readily available Zlib library in ruby.

This took about 30 minutes to implement and a couple of hours to test and debug. An obvious bug (very large emails were causing me to exceed the BLOB size limit and truncating the compressed source) was the main problem there...

I didn't quite reach 90% compression, but my database is now typically around 200-350mb. So about 70-80% compression. So, I didn't reach 90% compression, but I did manage to implement it in about 6 lines of code =)


Can you tell us what the site is? I've once or twice found sites that blocked even mailinator's alternate domains. Having a less well known alternative to mailinator would be nice.


http://dudmail.com is my site. (@scatmail.com and @trayna.com addresses will also end up there)

I made a submission here about it a few months back: http://news.ycombinator.org/item?id=3026892

In terms of mailinator being blocked . . . if that's the case, he's probably in the database http://www.block-disposable-email.com (as are my domains). The guy is pretty quick at finding at adding new disposable email domains.

I plan to add a few new domains on every now and then to see if I can at least keep some getting past his filter...


10MinuteMail.com is another option.

(biased creator here)


There's also 20minutemail.com - random address, allows reply, forward and download of message

yopmail.com - any address (like Mailanator) and allows download of attachments directly

There are a ton of these available, just depends on your specific needs. I regularly use them for signing up to those sites that need sign-up before access to a file/article/image (e.g. forums).

I've also used them in testing web apps, using Selenium to grab the email address and registering a user and then going back to check the welcome email had arrived.


ah you're one of the sites I'm trying to catch in page rank =)

I'm currently floating in google near the top of page two =/


Both blogs (mailinator and paultyma) are awesome. I need more stuff like this than a typical Web 2.0, how we use NoSQL, and Cache-Everything (without a clue how to do cache properly, but 37Signals cache solution is by far in line with mailinator techniques: smart and elegant).


It's awesome, yes - but it's worth keeping in mind that this trick works because he has an intensely write-heavy application (he probably does orders of magnitude more writes than reads). Very few applications have such a data access pattern.


Great read. I wish there were more articles like these.


Then rest of the mailinator blog and his blog (http://paultyma.blogspot.com/) are filed with gems like this.

There are a few of entries on multi-threaded synchronous io vs asynch io; he's a big proponent of the former.


Redis works great as an LRU cache and is much more space-efficient than an in-process LinkedHashMap, especially when the keys and values are small. Plus, an LRU wreaks havoc with the the Java generational garbage collector as soon as it fills up (every entry you put in is about guaranteed to last until the oldest generation, then likely be removed).


Redis would blow latency budget though, right?


Maybe I'm missing something here, but I was under the impression Redis is one of the fastest data stores out there. What do you mean it would blow the latency budget? I'm curious because I've switched my startup's backend to node+redis.

Thanks in advance!


He means it is being compared to CPU cache (from the article), so there is tens of orders of magnitude difference. From this summary on stack overflow: http://stackoverflow.com/questions/433105/exactly-how-fast-a...

CPU registers (8-32 registers) – immediate access (0-1 clock cycles) L1 CPU caches (32 KiB to 128 KiB) – fast access (3 clock cycles) L2 CPU caches (128 KiB to 12 MiB) – slightly slower access (10 clock cycles) Main physical memory (RAM) (256 MiB to 4 GiB) – slow access (100 clock cycles) Disk (file system) (1 GiB to 1 TiB) – very slow (10,000,000 clock cycles) Remote Memory (such as other computers or the Internet) (Practically unlimited) – speed varies


But if you have a Redis cache on the same box (he says he only has one box anyway) it's still in the same category: "Main physical memory", with maybe some communication overheard.



"maybe some communication overheard" is orders of magnitude slower than L1/L2.

Hmm, there is something very wrong here. I'll try and explain in a blog post.


But we're not talking register caches. 800 megs stashed in a giant Java hash are not going to be in L1 or L2 cache.



There is a HUGE overhead when going through the network. Even if you don't (localhost), there's overhead when using TCP/IP. Even if you don't, there's a overhead when using UNIX sockets or whatever you use for Inter-process-communication.

It probably doesn't matter though..

And yes, Redis is very fast and you gain a lot when using it compared to just a Hash in the same process.


So are you saying that because of the overhead involved with "talking" to redis, the fastest datastore would actually be an implementation of my own version (or a readily available version like node-lru-cache) of an LRU cache in my node app script - the datastore would essentially be a simple JSON object embedded directly in the script with methods specific to LRU data sets, gets, and backup?


well, the point is only that In-process is the fastest there is. By using something like Redis you give up a some speed but gain features: ques, pub/sub etc etc.. But that doesn't mean you have to implement those features yourself, they might be available as a library for your language of choice as well.

The bigger win, IMHO, is that you gain flexibility: Since Redis (or whatever) is decoupled from you process it can run on another processor, another machine or perhaps run on many machines etc..

Not sure how in-process cache would work in node, being async and all, but yes in-process is faster. But then you have to think about stuff like:

  - how do you avoid loosing everything when node crashes / restarts?
  - what if another process needs to read write to the cache?
  - what if you need more memory than a single machine provides (probably not going to happen). 
  - implementation bugs
In-process: Faster but probably harder to scale. But then again, it might be so fast you don't have to.


Would Redis evicting something from the LRU make some older mails unreadable?


Don't know.. But you make it sound like a bad thing. If you're out of memory you kind of have to evict stuff, don't you?

I get the feeling you are kind of anti-Redis and I don't get why? Redis is a very cool project and could be useful for a lot of things.. It's not Redis' fault some people misuse it..


I am not at all anti-redis. Its not at all redis's fault it gets misused either; I say so that in the blog-post, even.

So why would you compress something that you can only decompress if its recently-reused?

How would you do mailinator with your strings in redis - and taking O(n) calls to redis to recover them to decompress an email where n is the number of lines (or consecutive lines, granted) in the email?


"I am not at all anti-redis. Its not at all redis's fault it gets misused either; I say so that in the blog-post, even."

Yeah, you actually do.. sorry.

"So why would you compress something that you can only decompress if its recently-reused?"

Not sure I understand your questions, and I've just started looking at Redis. But I guess you could do it the same way, but the added latency may make it infeasible. But the better answer is probably that you don't: You would modify the implementation to fit Redis' (or whatever) strength and weaknesses.


Trying to work out how fit mailinator into Redis rather than questioning if Redis fits into mailinator is exactly the cargo-cult cool-kids zombism I was ranting against, though ;)

You really can process mailinator- quantities of email with a simple Java server using a synchronized hash-map and linked list LRU and have some CPUs left over for CPU-intensive opportunistic LZMAing.

Trying to do it with IPC TCP ping-pong for each and every line though; well I'm not sure you could process mailinator quantities of email within any reasonable hardware budget...

Luckily you have a chance to see the error of your ways :)


Well, I never claimed Redis should be used for this, you are the one who asked how to do it.

But you have to remember that most people can't have important data in just one process; it's going to crash and your data is gone. The LMAX guys solved this in a cool way, but I wouldn't call it easy: http://martinfowler.com/articles/lmax.html#KeepingItAllInMem...


Reading about another algo (Locality Sensitive Hashing) as referenced in the first comment.

http://www.stanford.edu/class/cs345a/slides/05-LSH.pdf


See also this library for locality sensitive hashing:

http://lshkit.sourceforge.net/


Great write up. One of the more interesting things I've read on here in a while. Thanks for sharing.


In an aside he mentions you should use bubblesort instead of quicksort for small arrays, due to cache locality, etc. I'd recommend using insertion sort instead of bubblesort - it does much better in both cache locality and branch performance (one branch prediction miss per key).


That seems a strange assertion - I was under the impression that insertion sort beats bubble sort on small arrays, pretty much always, and even in 'nearly sorted' arrays, which is often touted as a strength of bubble sort.

http://www.sorting-algorithms.com/ gives a decent overview with n equal to 20, 30, 40, or 50, but nothing smaller. Now I'm curious.


1. I think the author calls DEFLATE an LZW algorithm. It isn't.

2. Has the author looked at Google Snappy? It does 500MB/sec. http://code.google.com/p/snappy/source/browse/trunk/format_d...

There is a pure-C implementation that might be easier to port: https://github.com/zeevt/csnappy


There is already an excellent native port of Snappy to Java: https://github.com/dain/snappy


Just when I thought I knew everything there is to know about compression algorithms, along came Pauli, and Voila, mind now blown.


Check out what RainStor does to compress relational data by a factor of 20-40 or so (that's 95%-97.5%): http://www.youtube.com/watch?v=rIsVcaMaAgg&feature=playe...


Voila. 90%. (Two notes: 1: that's a reasonable average at least... sometimes better, sometimes worse and 2: I realize I'm not exactly sure what "Voila" means, looking that up now).

If mailinator wasn't already awesome, his writing about it sure is.


Something else that might work is content-dependent deduplication, with variable chunk boundaries determined by a sliding Rabin-Karp-style (or XOR) fingerprint on the content and a second cryptographic hash calculated for chunks where the cheap fingerprint has a match. It's naive and can find matches across headers, body and attachment parts.


I recently worked on a project where, to cut down on space, I built a custom "compressor" for lists of tree-like records. You might think of a Person record with N past addresses although this was not the actual domain. No records were cross-linked (at least not formally in the code) and the records were trees, not more general DAGs. The data contained a lot of enumerated data types (address type, gender, etc.). I didn't really care about the space usage for 1K records, but I cared about 1M. I used variable length encoding (a hacked-up subset of the Apache Avro project) for integers to take advantage of most of them being small in the datasets. Lots of lookup tables for enumerated values and commonly-repeated in practice string values. Implicit "keying" based on a record schema to avoid specifying key identifiers all over (our data was not very sparse, so this beat out a serialized pattern of KVKVKV etc.). I thought about taking advantage of most strings having very limited character sets and doing Huffman encoding per data element type, but the results were good enough before I got there. A co-worker also noted that, because parts of these records were de-normalized data, subtree pattern recognition could provide huge gains. I added some record length prefixes to allow individual records to be read one-at-a-time so that the entire dataset would not have to be read into memory at once. IIRC, compression speed was 2-3x gzip(9), and large record sets were 1/10th the size of using Java serialization plus gzip. [Yes, Java serialization is likely the worst way to serialize this sort of data]

Was all of this worth it? It solved the problem of not burning through network and memory, but it was a local optima. The root problem was that this data came from another system which did not provide repeatable reads, and providing them would have been a massive effort. However, our users wanted to meander through a consistent data set over the course of an hour or so. To provide this ability to browse, we throw these records into a somewhat transient embedded H2 DB instance. The serialized format is required primarily to provide high availability via a clustered cache. In retrospect, I would have pushed for using a MongoDB-esque cluster which could have replaced both H2 (query-ability) and the need for the the serialized format (HA).

It surprised me that there were no open source projects (at least Java-friendly ones) which provided compression schemes taking advantage of the combination of well-defined record schema and redundant-in-practice data. Kyro (http://code.google.com/p/kryo) comes closest as a space-efficient serializer, but it treats each record individually. Protobufs, Thrift, Avro, etc. are designed for RPC/Messaging wire formats and, as an explicit design decision (at least in the protobufs case) optimize on speed and the size of an individual record vs. the size of many records. The binary standards for JSON and XML beat the hell out of their textual equivalents, but they don't have any tricks which optimize on patterns in the repeated record structures.

Is this just an odd use case? Does anyone else have a similar need?


Most people are probably happy relying on gzip for compression. It's "Good Enough" for most use cases. It seems like it made sense for you though.


Further efficiencies can be gained by removing extraneous apostrophes from possessive "its".


While that would result in decreased entropy that a decent general-purpose compression algorithm like LZ77 could take advantage of, this preprocessing step is lossy and probably not worth the bytes it would save you.


The line based thing smells like doing LCS but with string elements of length N rather than 1...


Mailinator, even considering any praise it has ever gotten, is still one of the most underrated tools on the internet. I love it and use it all the time.


Reading this article makes me giggly inside.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: