> Dropbox did not want to be running a single version of the software / a single region, because they consider the risk of a single software bug / human error resulting in data loss too high. However, the alternative they choose introduces a completely new code base which will have to be battle tested. This increases the risk of a data loss bug, which would affect a smaller fraction of the data, but any significant data loss issue would be game-over for a company like Dropbox. Did they consider partitioning the system into smaller subsets (some single region, other multi-region), using staged roll-outs of new software versions? Or is there really some fundamental incompatibility between Magic Pocket and multi-region?
Magic Pocket already employees partitioning, staged rollouts, multiple versions, stringent operator controls, and extensive testing. This is discussed in more detail in https://blogs.dropbox.com/tech/2016/07/pocket-watch/
There’s no real incompatibility between Magic Pocket and multi-region, just a general trade-off in software that we’re not willing to make in this case. Globally replicated state would elevate availability and durability risks. It’s true that we can introduce protections to avoid this, and we do employ these protections, but it’s not a silver bullet - a single system would still be vulnerable to rare “black swan” events we may not anticipate. (There is a great example of how unexpected correlation triggered a subtle bug in third party vendor software in the beginning of https://www.infoq.com/presentations/dropbox-infrastructure.)
In our approach the additional codebase for cold storage is extremely small relative to the entire Magic Pocket codebase and importantly does not mutate any data in the live write path: data is written to the warm storage system and then asynchronously migrated to the cold storage system. This provides us an opportunity to hold data in both systems simultaneously during the transition and run extensive validation tests before removing data from the warm system.
We use the exact same storage zones and codebase for storing each cold storage fragment as we use for storing each block in the warm data store. It’s the same system storing the data, just for a fragment instead of a block. In this respect we still have multi-zone protections since each fragment is stored in multiple zones.
> The "New Replication Model" story sounds a bit too simplified. It seems to re-introduce some issues that the single region Magic Pocket solution had already solved: the size of the IO operations becomes quite small again (fractions of the 4M block size), placement of data on the disks becomes less predictable, which could cause increasing rebuilt times when a disk fails. Also, the number of IOs to read or write an object increases significantly (2-3x in the example), which means that the observed advantages in latency go hand in hand with a 2-3x lower maximum supported load than in the Magic Pocket case, before the latency explodes due to running out of IOPS on the HDD's. The whole design seems to ask for far more IOPS than the Magic Pocket solution, which sounds like an odd match to SMR HDD's.
> These issues are maybe alleviated by the fact that moving data to the cold tier happens asynchronously, and the cold data is accessed very infrequently, resulting in far less IOPS being required for the cold storage region. However, it also makes the option of combining hot and cold data on a single disk much more difficult (which for HDDs is the way to make optimal use of the limited IOPS vs. their huge capacity - I suspect Amazon / Google use this for their near-line storage solution). Moving from the 2+1 example to e.g. 4+1, to reduce cross region storage costs even more, becomes now a though call as it now goes hand in hand with an even larger increase in IOPS cost.
Yes, the new replication model does change the average block size. There are implications on IO, metadata to file data ratio and memory to disk ratio, which we have taken into account when building the system. As you noted the issue are largely alleviated by the data being cold. Also even with SMR disks, Magic Pocket is not necessary limited by the IOs for serving live users requests but also from load from background operations, such as repairing after disk or machine failure or compaction.
> The claimed "simplicity" of deleting data in the proposed scenario is rather relative. If they are using SMR drives, deleting data and reclaiming space are complex and expensive operations. They might reduce it to a non-distributed problem (which is still a significant gain, of course), but it is far from trivial.
Yes, compaction is a complicated problem in general and claimed simplicity is relative to the other proposals discussed in the blog post. We are not changing what each Magic Pocket region needs to do internally, after we delete a fragmentation from each region, each region needs to reclaim the space separately. This is the same problem for both the warm and cold storage systems.