Hacker News new | comments | show | ask | jobs | submit login

Hi HN! A couple of us from the Magic Pocket software team are around to answer questions if anyone has some.



The article makes a brief mention of Go causing issues with RAM usage. Was this due to large heap usage, or was it a problem of GC pressure/throughput/latency? If the former, what were some of the core problems that could not be further optimized in Go? If the latter, I've heard recent versions have had some significant improvements -- has your team looked at that and thought that you would have been OK if you just waited, or was there a fundamental gap that GC improvements couldn't close?

Could you comment more generally on what advantages Rust offered and where your team would like to see improvement? Are there portions of the Dropbox codebase where you'd love to use Rust but can't until <feature> RFCs/lands/hits stable? Are there portions where the decision to use Rust caused complications or problems?


Good questions, let me try to tackle them one by one.

> The article makes a brief mention of Go causing issues with RAM usage. Was this due to large heap usage, or was it a problem of GC pressure/throughput/latency? If the former, what were some of the core problems that could not be further optimized in Go?

The reasons for using rust were many, but memory was one of them.

Primarily, for this particular project, the heap size is the issue. One of the games in this project is optimizing how little memory and compute you can use to manage 1GB (or 1PB) of data. We utilize lots of tricks like perfect hash tables, extensive bit-packing, etc. Lots of odd, custom, inline and cache-friendly data structures. We also keeps lots of things on the stack when we can to take pressure off the VM system. We do some lockfree object pooling stuff for big byte vectors, which are common allocations in a block storage system.

It's much easier to do these particular kinds of optimizations using C++ or Rust.

In addition to basic memory reasons, saving a bit of CPU was a useful secondary goal, and that goal has been achieved. The project also has a fair amount of FFI work with various C libraries, and a kernel component. Rust makes it very easy and zero-cost to work closely with those libraries/environments.

For this project, pause times were not an issue. This isn't a particularly latency-sensitive service. We do have some other services where latency does matter, though, and we're considering Rust for those in the future.

> Could you comment more generally on what advantages Rust offered and where your team would like to see improvement?

The advantages of Rust are many. Really powerful abstractions, no null, no segfaults, no leaks, yet C-like performance and control over memory and you can use that whole C/C++ bag of optimization tricks.

On the improvements side, we're in close contact with the Rust core team--they visit the office regularly and keep tabs on what we're doing. So no, we don't have a ton of things we need. They've been really great about helping us out when those things have sprung up.

Our big ask right now is the same as everyone else's--improve compile times!

> Are there portions where the decision to use Rust caused complications or problems?

Well, Dropbox is mostly a golang shop on the backend, so Rust is a pretty different animal than everyone was used to. We also have a huge number of good libraries in golang that our small team had to create minimal equivalents for in Rust. So, the biggest challenge in using Rust at Dropbox has been that we were the first project! So we had a lot to do just to get started...

The other complication is that there is a ton of good stuff that we want to use that's still being debated by the Rust team, and therefore marked unstable. As each release goes on, they stabilize these APIs, but it's sometimes a pain working around useful APIs that are marked unstable just because the dust hasn't settled yet within the core team. Having said that, we totally understand that they're being thoughtful about all this, because backwards compatibility implies a very serious long-term commitment to these decisions.


  > On the improvements side, we're in close contact with the Rust core team
One small note here: this is something that we (Rust core team) are interested in doing generally, not just for Dropbox. If you use Rust in production, we want to hear from you! We're very interested in supporting production users.


Thanks very much for the detailed and thoughtful answers!

I've read before (somewhere, I think) that Dropbox effectively maintains a large internal "standard library" rather than relying on external open source efforts. How much does Magic Pocket rely on Rust's standard library and the crates.io ecosystem? Could you elaborate on how you ended up going in whichever direction you chose with regards to third-party open source code?


We use 3rd parties for the "obvious" stuff. Like, we're not going to reinvent json serialization. But we typically don't use any 3rd party frameworks on the backend. So things like service management/discovery, rpc, error handling, monitoring, metadata storage, etc etc, are a big in-house stack.

So, we use quite a few crates for the things it makes no sense to specialize in Dropbox-specific ways.


Cool. This might be getting into the weeds a bit, but are you still on rustc-serialize for json or are you trying to keep up with serde/serde_json? If you're using serde, are you on nightly? From your comment above I got the impression that only using stable features was very important, so I'm curious how your codebase implements/derives the serde traits.


We're on rustc-serialize. JSON is not really a part of our data pipeline, just our metrics pipeline. So the performance of the library is not especially critical.


Are you guys hiring Rust developers by chance? Asking for a friend :)


Very possibly :) Drop me an email at james@dropbox.com.


How do you do network io with rust? Thread-per-connection, non-blocking (using mio or?), or something else?


We have an in-house futures-based framework (inspired by Finagle) built on mio (non-blocking libevent like thing for rust). All I/O is async, but application work is often done on thread pools. Those threads are freed as soon as possible, though, so that I/O streams can be handled purely by "the reactor", and we keep the pools as small as possible.


Any plans to open-source the futures-based Rust framework? :)


From a parallel conversation[0] on the Rust subreddit:

>Are you going to open source anything?

>Probably. We have an in-house futures-based I/O framework. We're going to collaborate with the rust team and Carl Lerche to see if there's something there we can clean up and contribute.

[0]: https://www.reddit.com/r/rust/comments/4adabk/the_epic_story...


Did you guys use a custom allocator for rust? And if so how did it differ from jemalloc and how could it be compared to C++ allocators like tbb::scalable_allocator?


We use a custom version of jemalloc, with profiling enabled so that we can use it.

We also tweak jemalloc pretty heavily for our workload.


Please do a technical blog post on some of these:

1. Go vs Rust at Dropbox's scale and requirements.

2. Maintaining availability whilst moving from AWS to Diskotech and Magic Pocket.

3. Internals of Magic Pocket (file-system, storage engine, availability guarantees, scaling up and scaling out, compression details, load balancing, cloning etc)

4. Improvements in perf, stability, security, and cost.

Thanks.


Yep we're going to get at least a few actual technical blog posts online in the coming month. We haven't got around to writing them yet tho so feel free to surface any requests :)

Will most likely start with the following: 1. Overall architecture and internals. 2. Verification and projection mechanisms, how we keep data safe. 3. How we manage a large amount of hardware (hundreds of thousands of disks) with a small team, how we automatically deal with alerts etc. 4. A deep-dive into the Diskotech architecture and how we changed our storage layer to support SMR disks.

Hopefully these will be of a sufficient level of detail. We certainly won't be getting into any details on cost but we're pretty open from a technical perspective.

(Lemme just take a moment to say how great Backblaze's tech blog posts are btw.)


> Yep we're going to get at least a few actual technical blog posts online in the coming month. We haven't got around to writing them yet tho so feel free to surface any requests :)

Here's a request, how did you migrate all that data. Are you going to be open-sourcing any of the tools that you built up? We'd be interesting in hearing about that process. A lot of folks using B2 right now are trying to do this exact thing (or making copies if not actual migrations). Cheers! ;-)


I would be especially interested to hear how a focus on actual customer experience was maintained during all of this deep technology work.

One of the ways tech companies wrong is by putting the tech first and the users second. I think Dropbox's early competitive advantage was putting the user experience first. Everybody else had weird, complex systems; you folks just had a magic folder that was always correct. User focus like that is easier to pull off when it's just a few people. But once you break the customer problem up into enough pieces that hundreds of engineers can get involved, sometimes the tail starts wagging the dog and technically-oriented decisions start harming the user experience.

How did you folks avoid that?


Yev here -> Let me just say you're welcome and thanks for joining the party :D


Do you know what Amazon is going to do with their excess 500PB of capacity? Are they scaling up fast enough that it isn't a big deal? Is Glacier selling your vacated storage?

You must have left a large hole in AWS' revenue stream. I assume their fees for transferring your data out of their system helped cover that short-term, but 500PB of data is a lot of storage capacity to sell.

Your other answers indicate your separation with AWS was very amicable, much more amicable than I would have expected. Either "its just business" or they have plans to backfill the hole you left.


One of the AWS people at Re:Invent mentioned that Amazon is storing 8EB across all storage services (S3, Glacier etc.)

So 500PB is ~6% of that... Although I'm sure Amazon would have liked to keep Dropbox as a customer it's probably a pretty small percentage of their revenue. Also, if you assume that AWS is growing at ~50% year over year then 6% of S3 is only about one or two months worth of growth.

I think it was this talk but I'm not sure: https://www.youtube.com/watch?v=3HDQsW_r1DM


With storage you're constantly mothballing and replacing old iron behind the scenes, so Amazon might just ditch a bunch of old depreciated drives without hard feelings. AWS being on the up and up for SMEs, they are probably not crying over losing one large but very demanding customer -- and DB are still on AWS for European customers anyway.


Who's part of your security team over there? Most of the big cloud companies have people I trust working for them and I've been pretty comfortable telling most of my clients that cloud services are often going to have a better security team than they will. :)

But now that you've moved away from AWS, have you expanded your security team to help make up for the fact that you need folks to cover all your infrastructure security needs now too? Have you made the hires you needed to deal with the low level security issues in hw and kernel? How have you all been dealing with this so far?

I ask, because I help people make sense of their security issues and my clients explicitly ask me about your company. So keeping track of what you all are doing there is pretty relevant!


We have several dozen very, very good engineers working on a few discrete teams--infrastructure security, product security, etc. I don't feel comfortable getting any more specific than that; we keep security information pretty close to the vest.


To be honest, that response is less than comforting. But I understand how that'd be your answer given your position in the organization. Hopefully someone else at Dropbox feels like sharing more at some point and being more forthright about your company's practices in protecting data others entrust to them.

Feel free to refer others to this thread if you think it'd be helpful.


That's disingenious. The awesome sec teams at AWS or Google are not going to review and maintain your internet-facing VM network config, which is where the real threats are. Sure, their firewalls might be a bit more robust (you don't really know, you can't see them), but anything behind them is just as insecure as you configure it yourself, and that's where penetration happens in real life. Threats at hw or kernel level in a hypervisor are extremely hard to exploit, by the time you get close to that stack you can usually see much easier and juicier targets to harvest.


Sure. So your point is they've always needed a good security team? I'd agree with that. Either way, they should still be open about their security practices and have good answers to the questions I asked.


Still only one shared private key for all of dropbox...


I'm curious what your inter-language interop situation looks like. Do you have any Go code calling directly into Rust code (or vice versa)? Are they segregated into standalone programs? If so, are they doing communication via some form of IPC?

Cool stuff, really interesting read!


Not yet, but we will very soon. We'll be using Rust as a library with a C ABI, and calling it via CGO.


Does magic pocket use any non-standard error-correction algorithms, or just parity-style RAID5 or 6?


We use a variant on Reed-Solomon coding that's optimized for lower reconstruction cost. This is similar to Local Reconstruction Codes but a design/implementation of our own.

The data placement isn't RAID. We encode aggregated extents of data in "volumes" that are placed on a random set of storage nodes (with sufficient physical diversity and various other constraints). Each storage node might hold a few thousand volumes, but the placement for each volume is independent of the others on the disk. If one disk fails we can thus reconstruct those volumes from hundreds of other disks simultaneously, unlike in RAID where you'd be limited in IOPS and network bandwidth to a fixed set of disks in the RAID array.

That probably sounded confusing but it's a good topic for a blog post once we get around to it.


That is extremely cool. Please tell the author(s) of that system that a stranger on the internet has appreciation for that feat!


You might be interested in a paper Microsoft published a few years ago, entitled "Erasure Coding in Windows Azure Storage", which describes some similar concepts in greater detail:

http://msr-waypoint.com/en-us/um/people/yekhanin/Papers/Usen...


Hey that is cool. I worked on the first commercial disk array at Thinking Machines Corporation in 1988. We used Reed-Solomon encoding because we didn't trust that the drives would catch all the errors (they did). Good work solving some of those bandwidth constraints in rebuilding the data. John (Smitty) Hayes


We use Reed Solomon variants with local reconstruction codes.


What's it like working with Rust full-time? Is it similar to using any other language for a job? Do you find yourself enjoying coding more? Do you get unnecessarily caught up in micro-optimizations for things you wouldn't blink at in Go/Java?


Do you use much in the way of formal specifications? I've heard that Amazon currently uses TLA+ to verify the semantics of their systems.


Interesting. Here's the paper amazon published: http://research.microsoft.com/en-us/um/people/lamport/tla/fo...

It's only covering a few use cases and doesn't go into too much detail and shows example code but it seems like using TLA+ has been beneficial for them.


> If a bunch of people shared some files via Dropbox, the company stored the files on Amazon’s Simple Storage Service, or S3, while housing all the metadata related to those files—who they belonged to, who was allowed to download them, and more—on its own machines inside its own data center space.

I'm surprised at this. I would have expected the opposite: using AWS for the application and metadata but managing your own storage infrastructure. Much like how Netflix runs on AWS but streams video through its self-managed CDN. Can you shed some light on why it was designed this way?


Building a massive, super-reliable storage system was not really feasible given Dropbox's resources back when it first started up.


Not related to Magic Pocket, but when the article talks about Dropbox in house cloud, are they saying that you have your own instance of some minimal cloud-computing services? If you're permitted to disclose, I would love to know whether its based off OpenStack.


Which OS and FS the rust code is running on ?


No filesystem - the rust codebase directly handles the disk layout and scheduling.

Our previous version of the storage layer ran on top of XFS which was probably a bad choice since we ran into a few XFS bugs along the way, but nothing serious.


this is really cool! Can you give more details on how blocks, replications, metadata, etc. are being handled?


Lazy answer but we'll blog about this in the next month.

On a high level it's variable-sized blocks packed into 1GB extents, which are then aggregated into volumes and erasure coded across a set of disks on different machines/racks/rows/etc. We also replicate cross-country in addition to this. Live writes are written into non-erasure-coded volumes and encoded in the background.

The volume metadata on the disks contains enough information to be self-describing (as a safety precaution), but we also have a two-level index that maps blocks to volumes and volumes to disks.

More info to come later.


Are live writes replicated in real time, or are they locally staged (which is what Facebook does, I believe)

Also, do you mimic the eventually consistent behaviour of AWS, or do you offer a stronger form of consistency?


Live writes are written out with 4x redundancy in the local zone, then asynchronously replicated out to the remote zone. Some time later, it is erasure coded into a more space-efficient format, independently in each zone.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: