
How Mail.Ru Reduced Email Storage From 50 To 32 PB - riqbal
https://www.smashingmagazine.com/2017/01/reducing-email-storage-mailru/
======
Xeoncross
Reading about their fix with a "magic" number it seems that when there was
corruption - the file would never be deleted again since you've lost
information at this point. (As they claim, even if the magic number went back
to "0" you would still need to keep the file.)

Rather than this questionable solution, why not create an transaction-like
system where a delete would result in a tombstone/marker placed on the email
database with a rollback mechanism if one or both of the services crashed.

For example, an index marking the email as deleted, then you delete the file,
then if the file was successfully deleted actually clear the email record.

So the email table would be: {id uint64, ... deleted bool}

1\. If the file delete service failed, then unmark the deleted index (or try
again).

2\. If the marking the email as deleted failed, then you're back to square one
with no change.

3\. If the email is marked, the file is deleted, but actually deleting the
email failed. Restart/Tryagain/etc until you get that pesky email record
removed.

~~~
PSIAlt
Thanks for your question. In your solution there is some consistency
vulnerabilities:

\- (Step 1) If "delete service failed" \- timeout is a fail? The decrement
probably was done, but your app do not see response. In this case you dont
know whatever decrement was done really or not. You cannot do re-try with this
error (unless you use "magic" =))

\- (Step 1) If you do "unmark deleted index" this does not mean it was fixed
on disk forever. Your app or even server can shutdown/reboot before you ensure
disk write.

\- (Step 2) If marking failed you generally cannot rollback. You can try to
rollback the decrement, but operation can possibly fail too. On next delete
iteration the counter in filedb will be corrupted in this case.

\- (Step 3) Indexes become exclusive data owner. If you lose indexes for 1
mailbox somehow - you cannot rely on counters in filedb anymore. So, you
cannot do any decrements anymore until full double-side recheck.

\- Any bug in any app in this chain can destroy filedb consistency.

~~~
zzzcpan
Your magic numbers don't seem to guarantee consistency either, multiple faulty
exchanges and the file could be gone forever: maybe random numbers happened to
be the same, or the sum of substractions during faulty exchanges got to 0, or
there was an overflow, etc.

Distributed algorithms are not easy, but for this case there are proven
solutions, look up how Riak deletes records for example, with "tombstones" and
"reaping". Basically you have to think in terms of CRDTs and have a log of
operations, a versioning scheme, a synchronization, etc., all of which help
you to make sure that no operation is applied twice anywhere and that in case
of faulty communications nodes can synchronize and apply missing operations.
This gives you consistency, eventually, and a reliable way to know when files
can be physically deleted. Or, you know, just use Riak.

~~~
grogenaut
or dynamo, or cassandra, or several others that all exhibit good behavior
around these things. Also all of the above (and riak) have event streams when
tombstones go away so you can hook into this and reap the file.

------
kevincox
This is a surprisingly primitive and simple system. A twice replicated file is
a huge space overhead for not very reliable storage. (I mean the probability
of losing two disks at the same time isn't high, but it's also not
immeasurably low)

For example fb talked about how they use erasure coding to use 1.4x space
overhead for far higher reliability.

[https://code.facebook.com/posts/1433093613662262/-under-
the-...](https://code.facebook.com/posts/1433093613662262/-under-the-hood-
facebook-s-cold-storage-system-/)

While erasure coding can have performance impact (although mostly only on
failures) a multi-tier storage system will eliminate that. I presume that most
attachments aren't read more the a week or two after being received.

So while it is definitely nice to get some savings through deduplication there
is a lot more gains that can be had by more intelligent storage of the data.

------
petercooper
As an aside, this provides an interesting before/after example of the editing
process. The same article was on
[http://highscalability.com/blog/2017/1/2/efficient-
storage-h...](http://highscalability.com/blog/2017/1/2/efficient-storage-how-
we-went-down-from-50-pb-to-32-pb.html) earlier this month, but has been edited
here.

The title also changed from "Efficient Storage: How We Went Down From 50 PB To
32 PB" to "How Mail.Ru Reduced Email Storage From 50 To 32 PB" \- the latter
of which seems a lot better for places like HN where headline edits for
context can be tricky to come by.

------
loumf
Instead of using random numbers and adding/subtracting, you could use random
primes and multiply/divide. If the deletion is doubled, the random prime is no
longer a factor of the product and you ignore it.

Of course, You need to be able to store a very large number :)

------
mike-cardwell
I got my email storage down to ~66% of it's previous requirement by enabling
the zlib plugin for Dovecot and running a batch job to gzip compress mail in
my Maildir.

~~~
kiwijamo
That sounds like a good idea and I've bookmarked your post for future
reference. Did you notice any performance issues after switching to gzip?

~~~
mike-cardwell
I didn't no, but it's not something I measured.

------
vmateixeira
I don't get this:

If two systems (one with the index, another with the file) need to communicate
in order for an email to be deleted, why would you need a random number in
order to track email deletion in case of error? Wouldn't make more sense to
decrement the index only once the file deletion was successfull?

~~~
Jakob
This was answered in the comments section of the high scalability article [0]:

> Deleting email (and generally any file) is not transactional. It can be
> rollbacked by server poweroff, for example. This can generate additional
> decrement, that is the problem. Generally, 2-phase commit between "email
> deleter" and "counter decrement" is not enough. You will also need 2-phase
> commit between "deleter" and physical disk(not its cache), and so on. Also,
> a network error can occur between this 2 phases. We do not want to lose
> user's data in _any_ case.

[0] [http://highscalability.com/blog/2017/1/2/efficient-
storage-h...](http://highscalability.com/blog/2017/1/2/efficient-storage-how-
we-went-down-from-50-pb-to-32-pb.html)

~~~
vmateixeira
Thanks for pointing me in the right direction! :)

------
floatboth
How does application-level deduplication compare to ZFS deduplication?

~~~
GrayShade
Application-level dedup usually works at the file level, while ZFS dedups data
blocks (of variable size). The ZFS way of doing it allows saving space on
large files that are similar, like VM images. On the other hand it requires
storing block hashes in memory.

A ZFS dedup table entry is 320 bytes in size [1], so assuming 128 KB blocks
that comes to 2.5 GB / TB. Moreover, the tables are kept in the ZFS cache
(ARC) and they have to compete with other data and metadata. Since metadata is
by default limited to 75% of ARC [2], you'll need at least 3.3 GB per TB and
on the Internet you'll find even worse figures, like 20 GB [3].

[1] [http://www.oracle.com/technetwork/articles/servers-
storage-a...](http://www.oracle.com/technetwork/articles/servers-storage-
admin/o11-113-size-zfs-dedup-1354231.html) [2]
[https://github.com/zfsonlinux/zfs/commit/9907cc1cc8c16fa2c7c...](https://github.com/zfsonlinux/zfs/commit/9907cc1cc8c16fa2c7c03aa33264153ca2bdad6c)
[3] [http://constantin.glez.de/blog/2011/07/zfs-dedupe-or-not-
ded...](http://constantin.glez.de/blog/2011/07/zfs-dedupe-or-not-dedupe)

~~~
raattgift
tl;dr: I agree with you but would put it: DON'T DO DEDUP AT ALL, since even
having all that RAM doesn't save you from real world problems.

Ultimately, the deduplication table (DDT) is consulted on each write to any
dataset or zvol where deduplication is not "off".

The DDT layout on the underlying storage is effectively random. It is roughly
an AVL tree with each node corresponding to a block checksum. While the tree
is packed into as few underlying blocks as possible, as the DDT grows a larger
number of seeks are needed to reach each node. This is important because back-
to-back writes even of very similar data will need to consult spatially
separated parts of the DDT.

Bear in mind that the DDT is brought into ARC on a demand basis, i.e., all
that memory you advise goes unused if there are no writes to a dedup!=off
dataset or zvol.

Where you have enough IOPS to service a couple of random reads per write, then
you do not need the DDT in memory, so it's good that it is pushed out by
hotter zfs metadata. This can be the case for high-4k-random-IOPS SSDs, for
instance.

On rotating media -- spinning disks -- you will NEVER have enough IOPS.

[1] In perfectly ordinary circumstances, after a reboot or pool import or
major arc shrink, the DDT will have to be brought in by future writes, which
increases write latency by a factor of the track-to-track seek time for every
write. The DDT reads are synchronous, affecting the entire TXG sync, so the
extra latency affects all writes to the same pool, even to other (dedup=off)
datasets and zvols. The extra seeking will also interfere with uncached
asynchrnonous reads from the same pool.

[2] After a crash it is wholly possible that the DDT has to be walked during
the import phase, so your import will have to wait on many many track-to-track
seeks before it becomes available; on openzfs, this is liable to stall all
sync tasks for all pools (i.e., "zpool list" is likely to hang, "zfs snapshot"
on another already-imported pool might.) This is especially likely in the
presence of a deferred destroy.

[3] Dataset and snapshot destroys require consulting the DDT for practically
every DMU object in the snapshot or dataset; that's basically every file and
directory at the POSIX layer, or the inside-the-zvol filesystem's metadata
(and possibly many many ordinary blocks because of UNMAP). If your DDT isn't
in RAM during this, that's multiple seeks per object. Additionally the DDT and
the zfs metadata pointing to it needs to be updated as well, and that is
almost never seek-free.

[4] 2 & 3 combine: if you crash or reboot out of frustration while a destroy
of a deduped dataset or zvol is in progress, your import will have to wait
until the destroy is completed, only this time NONE of the DDT will be in main
memory to start with.

[5] resilvers and scrubs always walk the whole DDT first. If you have to
replace a leaf vdev and you have a substantial DDT, you will have a bad time.

[6] the times involved to deal with a deferred destroy or cold DDT walk can
range from hours to _weeks_ depending on the deduplicated data and the
freespace at the time each write was first committed.

Mitigations like persistent L2ARC, or vdev pinning (allocating a top level
vdev for DDT metadata, for instance, so you can use a mirrored nvram to hold
the DDT for a pool with rotating media holding the file data) have not yet
landed in openzfs.

On fast nvram for datasets that are read-mostly and which are highly
deduplicatable, dedup=skein can be a win. However, even if seeks are virtually
"free", heavy writing to such a dataset is almost certainly going to make
one's ARC less effective for caching other zfs metadata and file data, and
main RAM is almost always much much more expensive and scarce than solid state
leaf vdev hardware. Additionally, the benefits of queueing on the hardware
itself tend to evaporate because of the additional random read load (random
reads are the enemy of throughput!).

So even where DDT goes from merely survivable (which is never the case on
media where random reads take on the order of milliseconds, because one fault
will lead you to data and system unavailability) to pretty tolerable (a wide
n-way-mirrored SSD-based pool) ZFS deduplication is unlikely to be worth
turning on, and it's hard to back out of.

It's better to focus on shared subtree strategies (zfs clone, zfs receive -o
origin=snap, checksums that allow nopwrite, etc.) and compression rather than
deduplication. And where you can't, it's better to spend the money adding more
top level vdevs than it is to spend the time dealing with a situation where
deduplication leads to unavailability.

------
Animats
They realized they're mostly storing multiple copies of spam.

~~~
jandrese
TFA says that they just deduped attachments with their homegrown system.

As I think about it more, it would make a lot of sense for a Warez group to
use big storage email services with lots of accounts and some bot where you
request a file and it mails you all of the compressed/encrypted chunks for you
to put together. I've never seen anything like that in the wild, but it seems
like it might work.

~~~
jacksonsabey
you just described usenet, requests or fills are usually done over irc

~~~
jandrese
Usenet was a different system entirely. It was an application layer broadcast
mechanism. All of the content pushed to Usenet was replicated all across the
world so people could locally access it. Fantastic system for distributing
current information to thousands of people worldwide over slow backbone links.
Unfortunately it was also fantastic for broadcasting advertising to the entire
world for free, which is why Usenet for news is dead.

------
ramshanker
Does GMAIL does something similar too?

This should be standard feature of mail softwares. At my organization,
whenever a 2MB file is mailed by management to entire organization mail system
crawls for some times.

EDIT: Our mail system is hosted in-house.

~~~
jodrellblank
Organisations have different concerns; Microsoft Exchange used to deduplicate
email attachments internally with SIS (Single Instance Storage). It did it
within a mailbox database, not company-wide.

But when there was a single attachment and lots of people wanted to get it, it
generated a lot of disk IO looking up where the attachment was stored and
going to get it. Storage space is cheap, disk IO is more expensive, and in
Exchange 2007 to 2010 timeframe - seven or eight years ago - they took SIS out
of Exchange and it no longer does this. Now every user reads their mailbox and
all the data is right there and the disk IO is reduced.

Disk IO went down, the storage design of Exchange changed to the point where
it's recommended to be run on cheaper SATA disks entirely, which is a bigger
saving than reducing storage size a bit.

[https://social.technet.microsoft.com/Forums/exchange/en-
US/e...](https://social.technet.microsoft.com/Forums/exchange/en-
US/e41d1146-0d00-45b5-a939-84b83db81e96/why-microsoft-remove-sis-features-
from-exchange-2010?forum=exchange2010)

Mail.ru has different needs to a company internal mail server, companies can
enforce quotas and provide alternative file storage.

------
rodionos
Notice that the mail.ru audience is some 25% off the April 2015 peak levels.
Not sure what caused prior spikes in usage, acquisitions maybe?

[https://top.mail.ru/visits?id=250&period=2&date=&gender=0&ag...](https://top.mail.ru/visits?id=250&period=2&date=&gender=0&agegroup=0&ytype=visitors&ytype=visitors&aggregation=sum&days=4383)

------
IgorPartola
Is the W an attack one can execute by finding collisions in SHA-1 and
delivering a different payload than expected?

~~~
phil21
If you could do that, you'd likely have much better and more profitable things
to use your SHA-1 cracking skills on.

~~~
dchest
Secret Service shouldn't protect vice president, because there's a higher
target!

This gets repeated so many times, I wonder if we even should use crypto for
anything that costs less than a billion dollars. Imagine that you can steal a
billion by figuring out how to generate SHA-1 collision. You generate one,
then steal the money. Then what? Then you still have a way to generate
collisions.

Yes, the possibility of hash collisions is real and it's generally assumed
we'll have SHA-1 collisions "real soon now". Just because there are more
profitable things to attack doesn't mean that we shouldn't try protecting
other systems. For example, consider MD5 collisions: anyone can generate one
for free. If someone invented a way to produce collisions for SHA-1, after
profiting from the highest profitable target, they will probably try to profit
from the next profitable target, etc. and eventually we'll learn that there
are collision. What's next? The next step is to protect our systems from
collisions. So, we're back to square one.

The simplest way to protect from SHA-1 collisions, is of course switch to the
better hash function. For example, BLAKE2 - it's faster and more secure. We'll
win some time until it's broken (prediction: it won't happen anytime soon).

Another way to protect against collisions is to build your system in such a
way that collisions don't matter. For example, they can use HMAC with a secret
key.

------
lucio
Are they "reinventing" reference counting and transactions?

Both concepts are in the article, but the words "reference counting" and
"transaction" are nowhere to be found

------
bluedino
Is it uncommon for email servers to NOT use de-duplication?

~~~
coredog64
ISTR something about an older version of Exchange that did dedup, but that as
soon as someone _opened_ the attachment it was no longer a candidate for
dedup.

------
wildchild
One of the crappiest services in russia. You can remove all the propaganda
posts and double free space.

------
kondro
I can understand an initial mail product not having deduplication implemented,
but to get to 1PB let alone 50PB before implementing this seems a little
extreme.

Glad they were able to avoid upgrading for a while now. :)

------
ajnin
To those who were confused at first like me, there is a typo in the title of
the article and it should read "from 50PB to 32PB"

------
nik736
I love such articles!

