
Optimizing Magic Pocket for cold storage - el_duderino
https://blogs.dropbox.com/tech/2019/05/how-we-optimized-magic-pocket-for-cold-storage/
======
wallflower
> We killed the project after more than nine months of active work. As an
> engineer, it is not easy to give up something you have been trying to make
> work for so long. Some of the early conversations were controversial but
> ultimately everyone agreed killing the project was the right decision for
> Dropbox and our users.

They found the solution did not work once deployed to be tested in staging.
Being able to walk away from that sunk cost is of merit.

~~~
NKosmatos
They don’t provide that much info on why Erasure coding didn’t work out for
them. I’m curious to see how Backblaze will handle something similar while
they scale up now that they’re goin to have multiple data centers (2 in US and
1 in EU).

------
flas9sd
The authors writing style in a well structured post resonates positively with
me. I've read some of their technical posts but not methodically. I'm curious
on what the process is at Dropbox to bring such an article to life - if this
is done in isolation and left to the writers abilities or if there are
feedback loops with an editor. Certainly a document I can take cues from.

~~~
simonebrunozzi
Most probably PR-approved. In my experience, pretty much all public companies
(or soo-to-go public) will have a strict PR process in place, to handle almost
everything that goes out.

------
NKosmatos
Nice article, well written and a good read. Always nice to see a technical
post on something many of us use daily. Extra points for them being honest
about how they treat incoming files :-) “...such as perform OCR, parse content
to extract search tokens, or generate web previews for Office documents...”

------
lukax
It looks like they use something very similar to Erasure Coding in Ceph with
RadosGW [1].

[1] [https://ceph.com/planet/using-erasure-coding-with-
radosgw/](https://ceph.com/planet/using-erasure-coding-with-radosgw/)

~~~
kdkeyser
They have been doing erasure coding since at least 2016 (it is mentioned in
their original Magic Pocket post), just like almost all distributed storage
system, so nothing really new there.

This article is about single-region storage vs. multi-region storage (and how
to reduce the cost in this case). There is very little public info available
about distributed storage systems in multi-region setup with significant
latency between the sites.

------
kdkeyser
The Dropbox technical blog is always a very interesting read, I love that they
provide so much detail into the technical background of their solution. Still
was left with quite some questions when trying to understand this article
though:

\- Dropbox did not want to be running a single version of the software / a
single region, because they consider the risk of a single software bug / human
error resulting in data loss too high. However, the alternative they choose
introduces a completely new code base which will have to be battle tested.
This increases the risk of a data loss bug, which would affect a smaller
fraction of the data, but any significant data loss issue would be game-over
for a company like Dropbox. Did they consider partitioning the system into
smaller subsets (some single region, other multi-region), using staged roll-
outs of new software versions? Or is there really some fundamental
incompatibility between Magic Pocket and multi-region?

\- The "New Replication Model" story sounds a bit too simplified. It seems to
re-introduce some issues that the single region Magic Pocket solution had
already solved: the size of the IO operations becomes quite small again
(fractions of the 4M block size), placement of data on the disks becomes less
predictable, which could cause increasing rebuilt times when a disk fails.
Also, the number of IOs to read or write an object increases significantly
(2-3x in the example), which means that the observed advantages in latency go
hand in hand with a 2-3x lower maximum supported load than in the Magic Pocket
case, before the latency explodes due to running out of IOPS on the HDD's. The
whole design seems to ask for far more IOPS than the Magic Pocket solution,
which sounds like an odd match to SMR HDD's.

These issues are maybe alleviated by the fact that moving data to the cold
tier happens asynchronously, and the cold data is accessed very infrequently,
resulting in far less IOPS being required for the cold storage region.
However, it also makes the option of combining hot and cold data on a single
disk much more difficult (which for HDDs is the way to make optimal use of the
limited IOPS vs. their huge capacity - I suspect Amazon / Google use this for
their near-line storage solution). Moving from the 2+1 example to e.g. 4+1, to
reduce cross region storage costs even more, becomes now a though call as it
now goes hand in hand with an even larger increase in IOPS cost.

\- The claimed "simplicity" of deleting data in the proposed scenario is
rather relative. If they are using SMR drives, deleting data and reclaiming
space are complex and expensive operations. They might reduce it to a non-
distributed problem (which is still a significant gain, of course), but it is
far from trivial.

Probably a lot of the finer, left-out details of their cross-region system
address these issues, and if not, maybe the cross region system and the single
region Magic Pocket solution will converge again in a later phase.

~~~
preslavle
This is Preslav from Dropbox here. All great questions! We would have
absolutely loved to put all interesting details in the blog post but we need
to keep the length limited in order to not overwhelm the reader. Will try to
answer your questions here:

> Dropbox did not want to be running a single version of the software / a
> single region, because they consider the risk of a single software bug /
> human error resulting in data loss too high. However, the alternative they
> choose introduces a completely new code base which will have to be battle
> tested. This increases the risk of a data loss bug, which would affect a
> smaller fraction of the data, but any significant data loss issue would be
> game-over for a company like Dropbox. Did they consider partitioning the
> system into smaller subsets (some single region, other multi-region), using
> staged roll-outs of new software versions? Or is there really some
> fundamental incompatibility between Magic Pocket and multi-region?

Magic Pocket already employees partitioning, staged rollouts, multiple
versions, stringent operator controls, and extensive testing. This is
discussed in more detail in [https://blogs.dropbox.com/tech/2016/07/pocket-
watch/](https://blogs.dropbox.com/tech/2016/07/pocket-watch/)

There’s no real incompatibility between Magic Pocket and multi-region, just a
general trade-off in software that we’re not willing to make in this case.
Globally replicated state would elevate availability and durability risks.
It’s true that we can introduce protections to avoid this, and we do employ
these protections, but it’s not a silver bullet - a single system would still
be vulnerable to rare “black swan” events we may not anticipate. (There is a
great example of how unexpected correlation triggered a subtle bug in third
party vendor software in the beginning of
[https://www.infoq.com/presentations/dropbox-
infrastructure.](https://www.infoq.com/presentations/dropbox-infrastructure.))

In our approach the additional codebase for cold storage is extremely small
relative to the entire Magic Pocket codebase and importantly does not mutate
any data in the live write path: data is written to the warm storage system
and then asynchronously migrated to the cold storage system. This provides us
an opportunity to hold data in both systems simultaneously during the
transition and run extensive validation tests before removing data from the
warm system.

We use the exact same storage zones and codebase for storing each cold storage
fragment as we use for storing each block in the warm data store. It’s the
same system storing the data, just for a fragment instead of a block. In this
respect we still have multi-zone protections since each fragment is stored in
multiple zones.

> The "New Replication Model" story sounds a bit too simplified. It seems to
> re-introduce some issues that the single region Magic Pocket solution had
> already solved: the size of the IO operations becomes quite small again
> (fractions of the 4M block size), placement of data on the disks becomes
> less predictable, which could cause increasing rebuilt times when a disk
> fails. Also, the number of IOs to read or write an object increases
> significantly (2-3x in the example), which means that the observed
> advantages in latency go hand in hand with a 2-3x lower maximum supported
> load than in the Magic Pocket case, before the latency explodes due to
> running out of IOPS on the HDD's. The whole design seems to ask for far more
> IOPS than the Magic Pocket solution, which sounds like an odd match to SMR
> HDD's. > > These issues are maybe alleviated by the fact that moving data to
> the cold tier happens asynchronously, and the cold data is accessed very
> infrequently, resulting in far less IOPS being required for the cold storage
> region. However, it also makes the option of combining hot and cold data on
> a single disk much more difficult (which for HDDs is the way to make optimal
> use of the limited IOPS vs. their huge capacity - I suspect Amazon / Google
> use this for their near-line storage solution). Moving from the 2+1 example
> to e.g. 4+1, to reduce cross region storage costs even more, becomes now a
> though call as it now goes hand in hand with an even larger increase in IOPS
> cost.

Yes, the new replication model does change the average block size. There are
implications on IO, metadata to file data ratio and memory to disk ratio,
which we have taken into account when building the system. As you noted the
issue are largely alleviated by the data being cold. Also even with SMR disks,
Magic Pocket is not necessary limited by the IOs for serving live users
requests but also from load from background operations, such as repairing
after disk or machine failure or compaction.

> The claimed "simplicity" of deleting data in the proposed scenario is rather
> relative. If they are using SMR drives, deleting data and reclaiming space
> are complex and expensive operations. They might reduce it to a non-
> distributed problem (which is still a significant gain, of course), but it
> is far from trivial.

Yes, compaction is a complicated problem in general and claimed simplicity is
relative to the other proposals discussed in the blog post. We are not
changing what each Magic Pocket region needs to do internally, after we delete
a fragmentation from each region, each region needs to reclaim the space
separately. This is the same problem for both the warm and cold storage
systems.

~~~
toomuchtodo
Thank you for taking the time to write this reply, all very interesting!

------
bluedino
>> Maintaining a globally available data structure with these pairs of blocks
came with its own set of challenges. Dropbox has unpredictable delete patterns
so we needed some process to reclaim space when one of the blocks gets
deleted.

I wonder what that means.

~~~
Scaevolus
If you're storing A, B, and A+B, and then the file holding block A is deleted,
what happens? You can't immediately remove A, because then you lose the
redundancy for B. You'd need to somehow find a _new_ pair for B, C (maybe of
another block in the same situation, where D is supposed to be deleted?), and
write B+C, then delete A, A+B, D, C+D.

Pretty fiddly.

