The whole setup seems a lot simpler than nearly every 'web scale' solution I've heard discussed.
EDIT: I'm not saying that is a bad thing - it's a good thing. Most of the articles around scaling on HN often deal with very complex Cassandra style technical solutions. This article is the opposite with a very 'just do it the simplest way' feel to it. One big mysql table for all users (probably with some memcache in front) is the sort of thing you can put together in an hour or two - yet it is the very core of their system.
I (and I bet a lot of other engineers reading HN) try to focus too much on perfect implementations - if someone asked me to design the shard look-up for something as heavily hit as this, I'd probably reach for a distributed in-memory DB with broadcasted updates - and I'd probably be wrong to do so. The mysql/memcache approach is just going to be simpler to keep running.
The alternatives aren't nice.
I think, of the existing solutions, one is not better than the other. The choice in the end is which one fits best with your team and infrastructure, and specially which disadvantages are you willing to put up with.
And that's just not a very big table these days.
User table is unsharded. They just use a big database and have a uniqueness constraint on user name and email. Inserting a User will fail if it isn’t unique. Then they do a lot of writes in their sharded database.
Get the user name from the URL. Go to the single huge database to find the user ID.
I believe the parent is correct.
They wouldn't need Memcached, since the OS caches files anyway; replication with rsync becomes easy; they don't need transactions anyway; and they wouldn't have so much software to manage.
Isn't storing, tracking, and managing billions of entries done directly in a database anyway?
Try taking 300,000 files and copy them somewhere. Then copy 1 file which has the size of the 300,000 combined. The single file is MUCH faster (its also why we usually do a tar operation before copying stuff if its already compressed). Any database that's not a toy will usually lay the 300,000 records out in a single file (depending on settings, sizes and filesystem limits).
The 300,000 files end up sitting all over the drive and disk seeks kill you at run-time. This may not be true for a SSD but I don't have any evidence to to suggest this or otherwise.
Even if the physical storage is fine with this I suspect you may run into filesystem issues when you lay out millions if not hundreds of millions of files over a directory and then hit it hard.
I have played with 1,000,000 files before when playing with crawling/indexing things and it becomes a real management pain. It may seem cleaner to lay each out as a singe file but in the long run if you hit a large size the benefits aren't worth it.
It doesn't have good querying utilities. You'd have to build your own indexer and query engine. Since you can't put billions of files in a single directory, you'd have to split them into a directory tree. That alone requires some basic indexing functionality and rebalancing tools (in case a single directory grows too large or too small). This is without any more sophisticated querying capabilities like “where X=val” where X isn't the object ID/filename.
Write performance is going to be very very horrible.
Existing tools around managing, backing up and restoration, performance optimisation and monitoring aren't suitable for handling huge number of small files as well as a subset of them (give certain criteria related to the data itself)
You could build specialized tools to resolve all of these issues, but in the end, you'd end up with some kind of database after hundreds of man-years anyway.
My understanding is that the term of art is horizontal scalability.
I don't quite understand how it reduces connections. If you have your front-end tier connecting to your service tier, you'll still have the same amount of requests going to the database, just with a little middle-man... If I understand correctly?
Looking at it another way, when I talk to the GitHub API, I have no idea what sort of database I'm ultimately talking to. Their interface abstracts away all of that complexity. Each request they receive from me might be handled by a different machine for all I know. It makes no difference.
Additionally, services layer forms an interface between the client and service provider, which makes it easier to scale.
We randomly select a shard for a new user only. Ofcourse we capacity plan this and open new shards and even out users eventually.
This is probably the best benchmark of Cassandra/MongoDb/Couchbase http://www.slideshare.net/renatko/couchbase-performance-benc...
If you want a scientific, independent, peer-reviewed NoSQL benchmark, read this:
Cassandra is a clear winner here.
That paper is very useful so thanks for posting the link but it has a number of issues as I see it.
1) It considers Cassandra, Redis, VoltDB, Voldermort, HBase and MySQL. It does not cover either MongoDB or Couchbase.
2) Latency values are given as average and do not show p95/p99. In my experience, Cassandra in particular is susceptible to high latency at these values.
3) Even considering average values, the read latency of Cassandra is higher than you would see with either MongoDB or Couchbase.
4) Cassandra does not deal well with ephemeral data. There are issues while GC'ing large number of tombstones for example that will hurt a long running system.
The long and short of it is that Cassandra is a fantastic system for write heavy situations. What it is not good at are read heavy situations where deterministic low latency is required, which is pretty much what the pinterest guys were dealing with.
Another reason it is marketing is because it lacks essential information on the setup of each benchmarked system. E.g for Cassandra I don't even know which version they used, what was the replication factor, what consistency level did they read data at, did they enable row cache (which decreases latency a lot), etc.? Cassandra improved read throughput and latency by a huge factor since version 0.6 and is constantly improving so the version really matters.
The p95 latency issues were largely caused by GC pressure from having a large amount of relatively static data on-heap. In 1.2, the two largest of these: bloom filters and compression data were moved off-heap. It's my experience that with 1.2, most of the p95 latency is now caused by network and/or disk latency, as it should be.
I'm not going to compare it with other data stores in this comment, but I'd encourage people to consider that Cassandra is designed for durable persistence and larger-than-RAM datasets.
As far #4, this is mostly false. Tombstones (markers for deleted rows/columns) CAN cause issues with read performance, but "issues while GC'ing large number of tombstones" is a bit of a hand-wavey statement. The situation in which poor performance would result from tombstone pile-up is if you have rows where columns are constantly inserted and then removed before GC grace (10 days). Tombstones sit around until GC grace, so effectively consider data you insert to live for at least 10 days, unless of course you do something about it.
Usually people just tune the GC grace, as it's extremely conservative. It's also much better to use row-level deletes if possible. If the data is time-ordered and needs to be trimmed, a row-level delete with the timestamp of the trim point can improve performance dramatically. This is because a row-level tombstones will cause reads to skip any SSTables with max_timestamp < the tombstone. It also means compaction will quickly obsolete any succeeded row-level tombstones.
Here's a graph of P99 latency as observed from the application for wide row reads (involving ~60 columns on average, CL.ONE) from a real 12-node hi1.4xlarge Cassandra 1.2.3 cluster running across 3 EC2 availability zones. The p99 RTTs between these hosts is ~2ms.
This also happens to be on data that is "ephemeral" as our goal is to keep it bounded at ~100 columns. The read:write ratio is about even. It has a mix of row and column-level deletes, LeveledCompactionStrategy, and the standard 10 day GC grace.
DataStax Enterprise is a different product than Cassandra. Cassandra is one of its components, but there are more things bundled, e.g. Solr and Hadoop.
Pyres looks like a good alternative for Celery in the Python world when you don't need Celery's cron-like features.
thanks for posting!