
Millions of Tiny Databases [pdf] - aratno
https://assets.amazon.science/c4/11/de2606884b63bf4d95190a3c2390/millions-of-tiny-databases.pdf
======
mjb
This is my paper (along with Tao and Fan). It's a great feeling to have this
published and available, and I'm super proud of the team behind Physalia.

There's a lighter-weight introduction to the work here:
[https://www.amazon.science/blog/amazon-ebs-addresses-the-
cha...](https://www.amazon.science/blog/amazon-ebs-addresses-the-challenge-of-
the-cap-theorem-at-scale) and for those attending NSDI, I'll be talking about
Physalia in the "Deployment Experience" session on Wednesday.

~~~
maxmcd
Very cool. I haven't read the whole paper yet, but from a quick overview it
seems somewhat similar to SLOG (which also deals with world-scale replication
by trying to keep data closer to the nodes that use it):

\-
[http://www.vldb.org/pvldb/vol12/p1747-ren.pdf](http://www.vldb.org/pvldb/vol12/p1747-ren.pdf)

\-
[https://blog.acolyer.org/2019/09/04/slog/](https://blog.acolyer.org/2019/09/04/slog/)

Any thoughts on this comparison?

~~~
mjb
Very interesting, I hadn't seen SLOG before. It seems like there's a fairly
similar core insight: placing data near where it's used can help with some
system properties. They appear to be more latency-focused and we were
(primarily) aiming at availability.

The other different part of Physalia is our focus (again, for availability) on
placement for 'blast radius'. That means we try limit the number of cells than
any one failure (software, infrastructure, etc) can touch. Geo-replicated
systems can have similar concerns, but I haven't seen the same level of focus
on it as a key design goal.

------
kthejoker2
Thought for sure this'd be a thinkpiece on Excel in the enterprise ...

Seriously, though, this whole paper uses an amazing amount of terminology -
blast radius, colony, color, game day, split brain - and an awesome biological
metaphor of the Portuguese man o'war.

Great read even if you don't care about fault tolerance, CAP theorem, or
distributed balancing at AWS-scale.

One sample quote of the value of cheap heuristics over full-blown number-
crunching:

> Globally optimizing the placement of Physalia volumes is not feasible for
> two reasons, one is that it’s a non-convex optimization problem across huge
> numbers of variables, the other is that it needs to be done online because
> volumes and cells come and go at a high rate in our production environment.
> Figure 11 shows the results of using one very rough placement heuristic: a
> sort of bubble sort which swaps nodes between two cells at random if doing
> so would improve locality. In this simulation, we considered 20 candidates
> per cell. Even with this simplistic and cheap approach to placement,
> Physalia is able to offer significantly (up to 4x) reduced probability of
> losing availability.

------
ignoramous
Abstract at: [https://www.amazon.science/publications/millions-of-tiny-
dat...](https://www.amazon.science/publications/millions-of-tiny-databases)

> _...Physalia is a transactional key-value store, optimized for use in large-
> scale cloud control planes, which takes advantage of knowledge of
> transaction patterns and infrastructure design to offer both high
> availability and strong consistency to millions of clients. Physalia uses
> its knowledge of data center topology to place data where it is most likely
> to be available. Instead of being highly available for all keys to all
> clients, Physalia focuses on being extremely available for only the keys it
> knows each client needs, from the perspective of that client._

> _...We believe that the same patterns, and approach to design, are widely
> applicable to distributed systems problems like control planes,configuration
> management, and service discovery._

It'd be interesting to constrast this approach with Route53's or IAM's
datastore which need to be globally-replicated with time-bounded eventually-
consistent reads, and transactional but verifiable writes.

I hope AWS begins publishing about S3, now. One can look at the patents AWS
engineers author to get a feel for some of the internals, but they are
(intentionally?) hard to read.

For instance, patents filed by two of the many S3 founding-engineers:
[https://patents.google.com/?inventor=James+Christopher+Soren...](https://patents.google.com/?inventor=James+Christopher+Sorenson%2c+III,Allan+H.+Vermeulen&oq=inventor:\(James+Christopher+Sorenson%2c+III\)+or+inventor:\(Allan+H.+Vermeulen\))

Also see:

[https://aws.amazon.com/builders-library/](https://aws.amazon.com/builders-
library/)

[https://research.google/pubs/](https://research.google/pubs/)

~~~
mjb
It's not really about the design of S3, but if you're interested in some of
the philosophy and thinking behind S3 you might enjoy "Beyond eleven nines:
Lessons from Amazon S3 culture of durability"
[https://www.youtube.com/watch?v=DzRyrvUF-C0](https://www.youtube.com/watch?v=DzRyrvUF-C0)

------
vimota
Tangentially related, BigQuery uses a similar usage-based approach to place
and replicate data in a manner that's likely to be available for users:

[https://cloud.google.com/blog/products/data-analytics/how-
bi...](https://cloud.google.com/blog/products/data-analytics/how-bigquery-
zone-assignments-work)

