What you're describing (uploading photos + storing metadata) sounds like something which Facebook has tech talked at length about at multiple venues. Their solution was to use distributed FS for images (such as HDFS, though FB uses their internal "Haystack") and then use HBase for the metadata. To be honest, your solution while it works now, looks like a weak home-grown HBase, but leveraging PL/PGSQL for unique IDs. Why not go the snowflake+hbase route? While it may add Ops complexity, it is a fairly battle-proven stack, and JVM ops is pretty well documented.
Or, if you insist on using an RDBMS for metadata, why not just throw money (and not that much) at the problem and buy an SSD for your DB? Increase your iops from 100 or 200 up to 30,000 or 40,000 with a cheap drive, and call it a day. Surely this would be less expensive than the engineering effort that went into (and will continue to go into) this project. This has the added benefit of having no impact on Ops complexity and should scale to quite a staggering number of QPS.
We're on EC2, which has its set of limitations but means we can run a 10 million + user system with two-and-a-half engineers (and no ops team / overhead). So while we hear about more and more folks using SSDs in their DBs, it's not an option in our near-term future.
For SQL vs HBase/Haystack, we don't really have to worry about the photo storage itself, since S3 handles all of it. The data we shard out is more suited to an RDBMS, and since we're way more familiar with that world than with HBase and similar, it was the choice that let us make the most progress in a short time with a small team. Hope that's a helpful description of how we thought about it.
So, if one day you want to shard the users table, it will render all current sharding useless, right?
We're generating more GUIDs than ever with this system and those boxes are more or less idle on every metric. They're right in that we don't meet their time-ordered requirement, but I just wanted to say that writing (or reading) is not a bottleneck.
I like that you guys are using your database's stock features to accomplish this. Most of the time you don't need complicated systems to get things done. Reducing that mental overload of YET another system is huge.
The database ran on a machine with enough RAM to let SQL Server do its thing and cache almost the whole table in memory. At the time I left the team in 2005, it was executing over 25,000 requests per second, with an average latency of under 3ms. Pretty sure that on modern hardware it would handle much, much more.
Edit: I like how constructive criticism got downvoted. Thanks for discouraging technical discussion.
They've defined 2^13 virtual shards, which is fine-grained enough that they can move entire shards between physical servers to eliminate hot spots.
That feels flaky to me. Its a nice techy solution, and the write up was interesting, but there are definitely a few weak points in their implementation.
Their computers have to be absolutely in sync for time, their ID space is entirely guessable once you know the algorithm and they have effectively hard coded their current architecture into their software.
Edit: you don't need an object id / shard id mapping. Each time you need to locate a shard for an object, you take the last few bits of the object id modulo the number of shards to get your shard id.
The way we move shards is to use PostgreSQL's built-in streaming replication to create an exact, in-sync copy of a set of tablespaces, then 'fail over' to a new machine and start reading/writing to a subset of those tablespaces (this is similar to how Facebook describes their shard-moving process).
Right. I'm just sayin' that you have to be careful when you move data with a caveat like that. Moving the shard keyspace (the 13 bits) to a new machine that started generating ID's even one second behind (the first 4 bytes) would be troublesome, no?
They should definitely be monitoring their clocks though :)
One thing that bothered me was, "Let’s walk through an example: let’s say it’s September 9th, 2011, at 5:00pm and our ‘epoch’ begins on January 1st, 2011. There have been 1387263000 milliseconds since the beginning of our epoch..."
The number of seconds between 5pm one day and midnight another obviously doesn't end in 3, so I looked into it. It looks like that number is taken from the epoch that is used later in the SQL which starts on Aug 24th.
What if you have two tables with the same autoincrement value being updated at the same millisecond by two users with the same UserID%NumShards?
Or is there some relationship between physical and logical shards that makes this impossible?
There's a 1:1 mapping between the user's hash modulo the number of shards, and the table they're writing to. So, if we had 1000 logical shards, we have 1000 schema/tablespaces, with the same tables in each. And the database's own 'nextval()' feature makes sure never have the same ID twice. Hope that clarified things.
Does anybody know of a database product / solution that will handle hundred thousands of shards? I don't have a need for joins across shards, each user account only uses it's own data. And data and access within a shard would be very small, a SQLlite instance for every user would be a theoretical solution. The idea would be to have one (in memory) table to connect the app to the correct shard for the user and then get "perfect" horizontal scaling. Is anybody doing this?
btw, tambem sou brasileiro vivendo em san francisco (https://twitter.com/#!/artilheiro)
Something still puzzles me though. You wrote a sequence is unique to each table in each schema. However, your example seems to use the same sequence name for all tables within a schema.
Also, shouldn't your mod operation have only one %?
In the 18 months this system has been running, we never experienced any issue.
You could avoid having to create a per-schema id function by passing the shard id and the sequence oid as params to the function, so you'd have
By the way we use & love Sentry!
Or I might be misunderstanding, and you're saying that you're just mapping which physical server has the schemas
user_id % 1000 -> schema ID
schema ID -> database ID
The schemas are definitely a neat way to handle this though so you dont have to worry about table names.
I couldn't even imagine having to _think_ about this sort of thing; CouchDB makes this an absolute no-brainer. I mean, the act of creating a document assigns it a UUID in the database _by_ _default_.
Or, do you want to fetch a UUID before assigning it to any data? localhost:5984/_uuids
Want to fetch 10? localhost:5984/_uuids?count=10
Want to fetch a million? localhost:5984/_uuids?count=1000000
Instagram seems like the absolutely perfect candidate for CouchDB -- unstructured data, speed, HTTP querying, big data, attachments...
Also, we'd still have to write the middleware to assign data to shards and fetch data from shards in our system (unless we used something like BigCouch), so having a more tried-and-tested solution that we already understood well was more appealing.
Come on, 25 pics + 90 likes per second... that almost like... [wait for it]... nothing.
I'm pretty sure you got your number wrong.
1. Define your keyspace in bits.
2. Shard your keyspace in CIDR notation
3. Resharding always splits a keyspace in half
eg. 0.0.0.0/0 becomes 0.0.0.0/1 and 220.127.116.11/1
5. Assign IDs randomly
This gives a couple interesting properties, for replication odds are very high that your corresponding server has a shard of similar size and if not its generally only split over two servers. Servers also only need to speak directly to their one or two replicas in the other data center. The other really nice property if you're using spindle disks is that you can copy the entire HD and just delete the half of the keyspace not used after the split.
It's all written up here if you want to dig into with a full description of all the pieces:
Luckily for the rest of us, there is not just one way to do sharding (which in the original version of your post you assertively referred to as "how to do sharding right").
It depends on the individual architecture and the tradeoffs people consider important. That means you don't get to have an undeserved but legal monopoly holding the entire scalable web hostage, just a part of it. Nevertheless: you should be ashamed of yourself.