Apologies for taking so long to get this blog post out! We'll follow up with a few more posts going into more technical details.
I'm curious why you choose to what seems like an identical bucket stored on many nodes. I have recently built a specialized system that resembles facebook's haystack for storing several petabytes of images. I think it is way more simple and efficient to precreate all the possible 'slabs' on each machine, and just place the data in a round robin fashion over all open slabs. You can periodically grow these with system calls to allocate blocks of disk to make it efficient, maybe several 100's of megs at a time when running at scale. That way each server makes its own placement decisions, and no master is necessary to figure out data placement. Especially since you went with a mysql architecture, I think this makes a lot of sense. I guess it does make things more complex on the erasure coding side, since each block would then have to be erasure coded individually. It might be worth exploring though, since you could completely eliminate the requirement of a master.
Of course my system is considerably simpler since my smallest atomic unit is a whole file (keyed by sha1 hash + length to deduplicate), like facebook's haystack it was designed primarily to be an image store. I also do full replication rather than erasure coding, and only plan to implement that later.
One thing you didn't go into very much that would be very interesting would be the on disk format of the bucket. I used a modified version of folly RecordIO that returns the disk offsets and stored my payload using thrift with the sha1 hash and some flags. This allows for both quick indexing to each block and a nice recovery story. I'd also be interested in what sort of index format you are using, I store mine in both mysql and in a google sparsehash table by record token. Storing the index locally has great benefits if the mysql server goes down, I'd be curious if your system does that as well.
Note that I just described the replicated data storage in the blog post. Approximately 24 hours later these volumes get packaged up and erasure coded into a much more efficient storage representation, so the initial replicated writes are just there for high durability and efficient writes while the buckets are mutable.
I'm not sure I understand exactly what your question is but I think it's asking why we use 1GB buckets allocated all over the place rather than just picking a set of disks and filling them up RAID-style?
This is a pretty important point: If we picked X nodes and stored all the same data on them then if we lost one of those nodes we'd only have X-1 other nodes to re-replicate from. This sets up a hotspot in the system and limits how quickly we can re-replicate. This can be a pretty big deal if we lose a machine storing 1PB of data - we need to amortize that reconstruction cost over a large number of NICs and disk spindles (this is covered in more detail in the blog post).
The other advantage of having "small" buckets is that they're a manageable unit for combining into larger volumes when erasure coding, or for tracking partial repair operations, or for garbage collection and compaction when data is deleted.
Seems like it might be worth going into more detail on erasure coding in a future blog post.
Interesting approach. It definitely achieves your goal of spreading out the load. I think this can be achieved without making sure you store the blocks always together though, and it might help move towards a less coordinated system.
Here is how I solved the problem:
You can arrange each instance (you call it cell in your post) so it has different pools of storage. Maybe something that might make sense is each machine at the top of each rack is one pool, all the next machines down is another pool, and so on. All replication is always done across a pool. All the machines in a pool are stored to in a round robin fashion at the client. The end user can choose how many pools they want to replicate over.
Abstract each of the machines as a storage unit. Each storage unit actually contains a series of 'slabs' which are largish files (32 gb in size generally), that grow in some large increment and close out when they hit a high watermark. Each write to a machine goes to a different writable slab in a round robin fashion. Each server knows the path and the contents of each slab, and on disk failure all the individual pieces are replicated to another set of open slabs just like the blocks came in fresh (with an exclusion of the slabs that already have a copy). Since all requests are balanced across the pools, and each client has a different round robin list it manages locally the individual pieces are evenly distributed over the other machines in the other pools.
I tried really hard though to optimize for a zero master situation, so my clients handle contacting all the different machines and doing caching and retrying. So it sounds like I had different goals than you guys did, so I'm not surprised your have different trade offs.
Do you think a 512MB bucket might be more optimal compared to a 1GB bucket? 2x less data required to be read when restoring an erasure coded bucket, and 2x quicker restore times, for only a 2x increase in bucket metadata costs? Erasure coded bucket slices (around 85MB per slice assuming 6/9 Reed Solomon) should be just right to amortize disk seeks.
It's always a bit of a tradeoff. A 512MB bucket size would double the size of the Replication Table and we want make sure we keep the entire working set in the buffer cache. 1GB also has the advantage of filling up slower, which gives us more time to close full buckets and create new ones as needed.
That said, 1GB is a bit of an arbitrary number which just happens to work pretty well. 512MB might work just fine. The exact number isn't as significant since typically we lose a whole disk at a time and thus needs to recover a certain number of TB regardless of the bucket size.
Has the investment into bespoke storage technologies and data centers paid off already, given the real costs to build and set it up, and opportunity costs for other uses of engineering time? How long did it take to pay off through reduced costs, or when do you expect it will in the future? It would be interesting to learn at what point it might be worthwhile for a large player in a space like file storage to switch from existing commodity cloud solutions or proprietary on-premise solutions to custom ones -- as an extrapolation basis for other problems.
The general implication behind my question is: a custom solution can provide lower cost when viewed in isolation on a per-unit basis, but the total costs to the business must also account for the engineering work to develop, troubleshoot, deploy, and migrate to the storage system, as well as the time needed to lease and build data centers, etc., as well as to operate facilities and software stacks long-term; and to hire people to do that specifically; and also take into account the other things one could have done but cannot due to the decision to focus on a given problem (e.g., if the investment purely lowers cost, then that might be a tradeoff compared to another project that adds differentiating features. I know your blog post also mentioned performance, so it's not either-or) To make a good decision one must also weigh the expected value of alternatives, e.g., the current and predicted feature-sets of existing cloud or on-premise storage technologies, their expected prices over time, their expected regional availability. There's also the complication of owning substantially more assets rather than having operational expenses on cloud providers or other third parties.
I'm just curious to learn more about how you think about navigating the decision space. Of course, there's significant strategic value to "owning one's own destiny", but it also seems to have required significant investment and attention from Dropbox engineering team for perhaps a few years.
Any general advice regarding how big your use-case needs to be before this kind of engineering is worthwhile? How do you analyze the situation to identify when that's the case -- how do you identify the inflection point with confidence? Was it your prototyping that helped you determine that you could execute on this approach?
The tradeoff, it depends on your business and your team, your risk tolerance, time scales. But a rule of thumb might be if you don't have more than 10PB or so, you probably shouldn't bother.
Of course if you don't need 11+ nines of durability or multi-datacenter redundancy, or if you have lower access requirements, then you can exploit those reduced requirements to your advantage.
We just benefit from our very-large scale and close alignment between hardware and software to tailor our infrastructure to our workloads.
I'm curious about how you're managing the data on a drive itself. Are you storing the blocks as individual files on a filesystem? Are you doing direct management of the block device itself? Something else?
That said, up until somewhat recently our data really was written into 1GB extents stored as files on an XFS filesystem. This mostly worked fine but we switched away from this so that we could directly manage disk layout for SMR storage (a new type of hard drive that has bad random write performance), along with some performance improvements, increases in storage density, and some minor reliability improvements from avoiding the filesystem.
It takes a lot more operational tooling to directly manage a block device since you obviously can't use all the standard filesystem tools.
a) a huge number of smallish files, accessed by a large number of clients
b) file writes, rather than appends
c) cross-cell, cross-zone replication
Is this impression correct, or am I missing other major design differences?
If I remember correctly GFS has a coordinator node in the critical path, much like the name node in HDFS, which impacts availability if it goes down. The nice thing about the Master in MP is that it's completely disjoint from the data-plane and not a critical component in the system.
Your assessment is pretty accurate, and the cross-zone replication component is one distinguishing characteristic. MP also makes heavy use of highly-efficient erasure coding as a first-class design element, which wasn't included in the original implementation of GFS but was added later.
The biggest difference from an API perspective is that MP is only a blob store whereas GFS is an actual filesystem (we implement the filesystem at a higher abstraction layer). From what I understand MP is closer in spirit to the underpinnings of Colossus than GFS.
I'd be very interested in any information on how you provided yourselves confidence in the product? There are so many ways that a software issue could bring all that redundancy down.
I'll give a slightly lazy answer here however and point you to a talk I gave about building durable systems, which covers a lot of this material:
Memory usage/control, easy FFI with C, safety/correctness of code. Those were the big advantages for our projects.
Edit: deterministic destruction of resources is pretty great too. Yay RAII.
Edit 2: So is the ability to use the "standard" C tool suite for performance, memory, debugging, etc. valgrind, jemalloc profiling, gdb/lldb, perf. All that stuff.
There are higher order problems that result in that insight.