
Sharding and IDs at Instagram - sassyboy
http://instagram-engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram
======
mitchellh
Cool technology, good explanation. Legitimate questions below.

What you're describing (uploading photos + storing metadata) sounds like
something which Facebook has tech talked at length about at multiple venues.
Their solution was to use distributed FS for images (such as HDFS, though FB
uses their internal "Haystack") and then use HBase for the metadata. To be
honest, your solution while it works now, looks like a weak home-grown HBase,
but leveraging PL/PGSQL for unique IDs. Why not go the snowflake+hbase route?
While it may add Ops complexity, it is a fairly battle-proven stack, and JVM
ops is pretty well documented.

Or, if you insist on using an RDBMS for metadata, why not just throw money
(and not that much) at the problem and buy an SSD for your DB? Increase your
iops from 100 or 200 up to 30,000 or 40,000 with a cheap drive, and call it a
day. Surely this would be less expensive than the engineering effort that went
into (and will continue to go into) this project. This has the added benefit
of having no impact on Ops complexity and should scale to quite a staggering
number of QPS.

Thanks!

~~~
mikeyk
Valid questions!

We're on EC2, which has its set of limitations but means we can run a 10
million + user system with two-and-a-half engineers (and no ops team /
overhead). So while we hear about more and more folks using SSDs in their DBs,
it's not an option in our near-term future.

For SQL vs HBase/Haystack, we don't really have to worry about the photo
storage itself, since S3 handles all of it. The data we shard out is more
suited to an RDBMS, and since we're way more familiar with that world than
with HBase and similar, it was the choice that let us make the most progress
in a short time with a small team. Hope that's a helpful description of how we
thought about it.

~~~
siong1987
From what I am understanding from the post is that you are actually sharding
other tables like photos, likes, comments but not the main users table, right?

So, if one day you want to shard the users table, it will render all current
sharding useless, right?

------
mncaudill
I work at Flickr and I see they mentioned Flickr's ticket server idea,
(ab)using MySQL's autoincrement and "REPLACE INTO" trick and mentioned that a
con was the write bottleneck.

We're generating more GUIDs than ever with this system and those boxes are
more or less idle on every metric. They're right in that we don't meet their
time-ordered requirement, but I just wanted to say that writing (or reading)
is not a bottleneck.

~~~
mikeyk
Thanks for the info, I updated the post to mention that the write bottleneck
isn't an issue at Flickr.

~~~
mncaudill
Thanks!

I like that you guys are using your database's stock features to accomplish
this. Most of the time you don't need complicated systems to get things done.
Reducing that mental overload of YET another system is huge.

------
ww520
I think it's a mistake to tie the shard id into the object id. Shard id should
be derived from the object id dynamically based on the placement of the object
among the shards. If the shards grow or shrink or object migrated, a different
shard id is generated, but the object id doesn't have to change.

Edit: I like how constructive criticism got downvoted. Thanks for discouraging
technical discussion.

~~~
underwater
If you do that then you need to keep a mapping of object IDs to shards. You
can't store it on the application servers because of race conditions and
memory constraints, so each object lookup will require two network trips.

They've defined 2^13 virtual shards, which is fine-grained enough that they
can move entire shards between physical servers to eliminate hot spots.

~~~
bandushrew
I hear what you are saying, but what they have done is derived the their IDs
based partially on their current choice of architecture. If, for any reason,
they would like to change their architecture later, suddenly they have a real
problem.

That feels flaky to me. Its a nice techy solution, and the write up was
interesting, but there are definitely a few weak points in their
implementation. Their computers have to be absolutely in sync for time, their
ID space is entirely guessable once you know the algorithm and they have
effectively hard coded their current architecture into their software.

------
seattlematt
Back in 2003 I was a developer on a team at Microsoft that was responsible for
building a new storage backend for all the MSN Messenger and Hotmail contact
lists. This store had to hold data for close to 300 million user accounts,
which we sharded out to a few hundred SQL server databases. Our sharding
system consisted of a database with a single table with a row for each user
that mapped their 128 bit guid user ID to the ID of their assigned shard. New
user creation involved generating a new guid and inserting into this table.
Each read operation involved a select from this table.

The database ran on a machine with enough RAM to let SQL Server do its thing
and cache almost the whole table in memory. At the time I left the team in
2005, it was executing over 25,000 requests per second, with an average
latency of under 3ms. Pretty sure that on modern hardware it would handle
much, much more.

~~~
revorad
Can you share some more details? How many shards did you have? Did it all run
on a single machine? How much RAM did it have?

------
jconley
I've used the "snowflake" like approach in the past with great success. It's
really not all that complicated. The reliance on time in the Instagram
approach is a bit scary. A few ms off here and there could really hurt this
scheme. How do you handle seamlessly transitioning these across machines when
your shards move?

~~~
stevefink
Not really. Like MongoDB, the first four bytes of an ObjectId are a timestamp.
That the timestamp is synced with other instances isn't paramount because the
actual value of the timestamp does not matter. What does matter is that the
timestamp is new and increasing every second. This is to retain sorting
capabilities. With the 13 bits that represent the logical shard ID from this
article, Instagram will guarantee uniqueness of an ID within the granularity
of a second.

~~~
jconley
"What does matter is that the timestamp is new and increasing every second."

Right. I'm just sayin' that you have to be careful when you move data with a
caveat like that. Moving the shard keyspace (the 13 bits) to a new machine
that started generating ID's even one second behind (the first 4 bytes) would
be troublesome, no?

~~~
mikeyk
Yep--definitely something to watch out for. At worst, though, you'd have a
duplicate key when trying to insert, and can re-try without the risk of having
a duplicate ID floating around your system.

------
fratrik
Interesting article. It seems like a pretty good solution the problem,
especially when judged on the ease of maintenance.

One thing that bothered me was, "Let’s walk through an example: let’s say it’s
September 9th, 2011, at 5:00pm and our ‘epoch’ begins on January 1st, 2011.
There have been 1387263000 milliseconds since the beginning of our epoch..."

The number of seconds between 5pm one day and midnight another obviously
doesn't end in 3, so I looked into it. It looks like that number is taken from
the epoch that is used later in the SQL which starts on Aug 24th.

------
kenjackson
I'm sure I missed something, but this doesn't guarantee unique keys across the
database does it?

What if you have two tables with the same autoincrement value being updated at
the same millisecond by two users with the same UserID%NumShards?

Or is there some relationship between physical and logical shards that makes
this impossible?

~~~
mikeyk
Hey Ken, (author here)

There's a 1:1 mapping between the user's hash modulo the number of shards, and
the table they're writing to. So, if we had 1000 logical shards, we have 1000
schema/tablespaces, with the same tables in each. And the database's own
'nextval()' feature makes sure never have the same ID twice. Hope that
clarified things.

~~~
speleding
I understand 1000 logical shards is probably a good number to use with
PostgreSQL. I'm currently looking at doing something like this myself but I
was thinking that it would be so much more elegant if each user could live in
his own shard.

Does anybody know of a database product / solution that will handle hundred
thousands of shards? I don't have a need for joins across shards, each user
account only uses it's own data. And data and access within a shard would be
very small, a SQLlite instance for every user would be a theoretical solution.
The idea would be to have one (in memory) table to connect the app to the
correct shard for the user and then get "perfect" horizontal scaling. Is
anybody doing this?

------
santosh79
Did you evaluate using something like Redis for this? It's got an atomic
increment command that guarantee's unique ID's and the performance is stupid
fast.

~~~
mikeyk
We love Redis at Instagram (it powers a lot of our systems...more write-ups on
these soon), and considered it for our ID generation, but it would have
introduced a single point of failure, unless we split the load between several
Redis instances, at which point it would be hard to make the IDs time-
sortable. Also, most of our Redis systems are durable within a minute (we
write to disk on a slave every minute), but if we were to lose the master and
slave simultaneously (imagine an EC2 network issue or such), then it would be
hard to know what the last known 'good' ID was--not an insurmountable problem,
but one more moving part to worry about. That said, using Redis with something
like the upcoming Redis Cluster could be a good choice, though.

------
artilheiro
I like it. Question, did you consider using composite keys of shard_id and id
to make up a single primary key? If so what were the pros/cons you found with
that approach?

btw, tambem sou brasileiro vivendo em san francisco
(<https://twitter.com/#!/artilheiro>)

~~~
mikeyk
We considered the composite key, but we often have to store keys in other
systems like Redis, where having a single 64-bit integer makes it more
portable and stored compactly (Redis, for example, has an optimization when
storing integer values vs string values in its lists). Valeu pelo comentário!

------
pabitas
Hi, very helpful post. Thank you.

Something still puzzles me though. You wrote a sequence is unique to each
table in each schema. However, your example seems to use the same sequence
name for all tables within a schema.

Also, shouldn't your mod operation have only one %?

Thanks again!

------
ryansking
One nit about snowflake in your article (i'm the author of snowflake)– the
zookeeper integration is optional and only used for sanity checking the
configuration (that the worker ids are distinct, etc).

------
iampims
We use a solution similar to snowflake at Formspring:

<https://github.com/formspring/flake>

In the 18 months this system has been running, we never experienced any issue.

------
andrewdunstan
Neat.

You could avoid having to create a per-schema id function by passing the shard
id and the sequence oid as params to the function, so you'd have default
public.next_id(5,'insta5.table_id_seq'::regclass)

------
sean9
Hi Mike, excellent article! One question about sharding, do you find that it
reduces High Availability overall, as your uptime depends on the vagaries of
additional database servers??

------
zeeg
What I'd be more curious to hear about, is how they deal with super nodes and
if they're storing a map (or any kind of routing table) or simply using
generic modulo hashing on a key.

~~~
mikeyk
Right now, it's a lookup dict in our Django app--which involves brief downtime
just to update the shard map when moving the data (more on this in another
post). We hash on user ID (in most cases) and then look the shard # in the
dict, then look up the shard # in a logical-to-physical dict.

By the way we use & love Sentry!

~~~
zeeg
Ah so you don't actually have 1000 (or whatever) schemas set up right off the
bat?

Or I might be misunderstanding, and you're saying that you're just mapping
which physical server has the schemas

~~~
mikeyk
We set-up the schemas ahead of time, and then each lookup is (with, say, 1000
schemas):

    
    
      user_id % 1000 -> schema ID
      schema ID -> database ID
    

then select FROM schemaID.tablename etc on that particular database.

~~~
zeeg
Ah ok that's what I figured. We do a lot of the same things right now with
various DBs (we just describe it as pre-sharding so we can avoid the actual
re-sharding dilemma). We've had to do this with both PostgreSQL and Redis now.

The schemas are definitely a neat way to handle this though so you dont have
to worry about table names.

------
mark242
Holy god. The post mentions "We evaluated a few different NoSQL solutions, but
(...)".

I couldn't even imagine having to _think_ about this sort of thing; CouchDB
makes this an absolute no-brainer. I mean, the act of creating a document
assigns it a UUID in the database _by_ _default_.

Or, do you want to fetch a UUID before assigning it to any data?
localhost:5984/_uuids Want to fetch 10? localhost:5984/_uuids?count=10 Want to
fetch a million? localhost:5984/_uuids?count=1000000

Instagram seems like the absolutely perfect candidate for CouchDB --
unstructured data, speed, HTTP querying, big data, attachments...

~~~
jbyers
Requirement number one was that IDs be sortable by time. CouchDB UUIDs are
random.

~~~
mduvall
The call to default _uuids API call generates random UUIds, however you could
override the call in the _config to have a different algorithm field, one that
potentially could call "utc_random" still while appending a timestamp in the
string to sort by later. Was this thought about when CouchDB was potentially
considered?

------
trim
This seems fairly neat. Is it possible to do something like this with MySQL?

~~~
mikeyk
(author here) I think this should also be possible with MySQL's stored
functions, but one huge benefit to PostgreSQL is the schema/tablespace
feature, since it means all our logical shards all live inside one database
(you could do something similar with MySQL, but it would mean multiple
databases or prefixing table names with the shard ID).

~~~
HarrisonFisk
PostgreSQL schemas are very similar to MySQL databases in functionality. In
fact, in MySQL you can use SCHEMA in all places you can use the term database,
ie. CREATE SCHEMA foo; instead of CREATE DATABASE foo;

------
sirmiller
I just love how startups keep reinventing the wheel to solve problems that
haven't been problems for decades.

Come on, 25 pics + 90 likes per second... that almost like... [wait for it]...
nothing.

I'm pretty sure you got your number wrong.

~~~
sirmiller
Almost forgot: Sharding 101 "Always shard via directory..."

------
fleitz
How to do sharding:

1\. Define your keyspace in bits.

2\. Shard your keyspace in CIDR notation

3\. Resharding always splits a keyspace in half

    
    
      eg. 0.0.0.0/0 becomes 0.0.0.0/1 and 128.0.0.0/1
    

4\. Store your shard keyspace in DNS with SRV/TXT records

5\. Assign IDs randomly

This gives a couple interesting properties, for replication odds are very high
that your corresponding server has a shard of similar size and if not its
generally only split over two servers. Servers also only need to speak
directly to their one or two replicas in the other data center. The other
really nice property if you're using spindle disks is that you can copy the
entire HD and just delete the half of the keyspace not used after the split.

It's all written up here if you want to dig into with a full description of
all the pieces: <http://bit.ly/FredPatent>

~~~
Udo
It takes some nerve to actually be proud of that patent.

Luckily for the rest of us, there is not just one way to do sharding (which in
the original version of your post you assertively referred to as "how to do
sharding _right_ ").

It depends on the individual architecture and the tradeoffs people consider
important. That means you don't get to have an undeserved but legal monopoly
holding the entire scalable web hostage, just a part of it. Nevertheless: you
should be ashamed of yourself.

