Another interesting detail is that this is roughly the 4th iteration on the FriendFeed backend since we launched 17 months ago. If you look at the the graphs at the bottom of Bret's post, you can see that our previous system was about to die -- average pageview latency had increased from about 135ms to 260ms in less than a month! (not a good trend) This new design also accommodates some important upcoming features that would have been problematic in the old system.
This experience reinforces my belief that it's better to be quick than brilliant. If we had wasted a lot of time trying to build some really smart, super-scalable system right from the start, it would have inevitably been over-optimized for the wrong things, since the product and requirements have changed quite a bit since launch. By continually rebuilding the system, it stays relatively simple and close to our actual needs (plus it incorporates everything we learned from the previous iterations).
After reflecting of my own experience with a failed B2C product, I definitely agree with the statement in your second paragraph. We spent a significant amount of time worrying about 'scaling' on a lot of fronts, not just technical, but the highest CCU we ever hit was a couple hundred users over the course of a year and a half.
From a design standpoint, I think its perfectly acceptable to keep said objectives a high priority... but from an execution standpoint, its more important to be as nimble and flexible as possible. Re-writes shouldn't be feared too much, they take less and less time if the team is applying what they've learned.
Increasing the number of shards is similar to changing the backend infrastructure, but simpler. Downtime obviously isn't acceptable, so when switching from one system to the next, we have a period during which we write to both so that it is safe to read from either one. Other schemes could be used for resharding, but this is simple enough and also works for other changes, and in practice we've changed the schema more often than the number of shards.
Also, keep in mind that it is ok to have more shards than you really need (multiple shards can run on the same machine, for example), so resharding needn't be a common operation.
We store data in a really similar way at Spock, pointing all our reads at a big MySQL table which essentially just has an id column and a really big text column containing all the values, JSON-encoded, w/roughly the same number of rows as the FriendFeed table. Our DBA wrote a blog post about it a while back: http://www.frankf.us/wp/?p=17
The way you guys create/delete indices and avoid all the replication nightmares sounds super cool.
All of these solutions ultimately end up sounding like a poor man's Big Table.
We did something similar for GameClay. I stored game properties as JSON-encoded dicts stored in MogileFS, then had a "regular" MySQL table that would point to the MogileFS key for the file, then the Python code would just read it out, use a JSON library to parse it, and manipulate it as a Python object. We had normal MySQL indexes on all the game metadata that appeared in the UI, so if you want all arcade games that nostrademons has posted, it would do a normal index search on (nostrademons, "arcade"), find the Mogile key for that, and then fetch it.
We never got to the point where we'd need to shard, alas. I'd like to think I would've thought of the index-per-table approach independently, but probably not.
Curiously, I did things this way for ease of development, not performance (and there were probably lots of little things that would've killed our performance anyway - it's really hard to diagnose performance bugs until you have real users banging away). We needed a storage system - MogileFS and MySQL work out of the box. We needed a data format - JSON works out of the box. We needed to be able to change game schemas rapidly, since we kept finding our existing design was not really fun to play with (and that's what killed us, ultimately). The rest of it followed pretty obviously.
I just recently wrapped up using a similar "store serialized data" set-up myself. In my example, I'm allowing the user insert/remove/reorder items on a list. This type of operation is pain to do with SQL operations. If you have a list, A,B,C,D,E,F and you want to insert G before C, then you either have to:
delete all, then insert A,B,G,C,D,E,F.
set "sortnumber" on G to "3", and increment sortnumber on all >= 3.
set "sortnumber" of G to be the average of B and C.
The latter is the cleverest, but eventually you run out of space in floating-point land, unless you have a Cron come in and clean everything up periodically. And of course, reordering (the example was an insert), and removing need to be considered as well. So it's a choice between lots of deletes and inserts (but really easy). Semi-annoying logic of reordering/removing and a few updates. Or clever hack that requires a cron to cover your ass.
What I did instead was stored the list as JSON, convert it to an array, and use array splicing functions to reorder things. Then, I convert it back to JSON and store it. It's worked extraordinarily well... It takes a fraction of the amount of time to do native data structure stuff than it does to touch the DB several times.
test=# select * from l;
select * from l;
id | prev | mydata
A | F | dA
B | A | dB
C | B | dC
D | C | dD
E | D | dE
F | E | dF
to insert a new item into the list, you would do:
update l set prev='G' where prev='C';
insert into l (id, prev, mydata) values ('G', 'C', 'data for G');
so that's one update, one insert for an insertion into the list. Note that the two commands have to be in one transaction, because inside the transaction the foreign key constraint is violated (as allowed by the deferrable initially deferred modifier).
Let's inspect our list again:
test=# select * from l;
select * from l;
id | prev | mydata
A | F | dA
B | A | dB
C | B | dC
E | D | dE
F | E | dF
D | G | dD
G | C | data for G
so the predecessor of G is C, and the predecessor of D is G, like specified.
Of course, you loose the ability to sort with 'order by', but that's no big deal: you know the predecessor and successor of each item, so it's easy to traverse the list in either order. This could be done on the client side [probably the best solution in your case], in the application code, or inside the database with a stored procedure or with a recursive query (coming in PostgreSQL 8.4), in Oracle it could probably be done with 'connect by'.
In reality, you would of course choose other datatypes for id and prev (probably integer), but I wanted to translate your example as literally as possible. Another problem that's easily solved: how do I get all elements of one list? Solution: Either give me one 'starting element' and the list is traversed and returned. Or introduce a listId attribute and select by that, which is probably faster but without sort order.
Aha! I did forget the linked list approach. So essentially, each item on the list stores what is before (or) after it. I'm assuming it's an arbitrary choice that you're using "prev" instead of "next," correct?
The use of deferred, I've never heard of, but it makes perfect sense in this case. Unless, of course, you want to insert the record first and then modify the update to exclude the item you just inserted.
Right now, I'm using MySQL.. I only have 4 tables, and product is not launched. Would you advise switching to Postgres?
> I'm assuming it's an arbitrary choice that you're using "prev" instead of "next," correct?
yes, in effect it's a doubly linked list (circularly doubly linked), so you could remember the id of the first element of the list and then traverse in any order as long as this id does not reappear.
> Unless, of course, you want to insert the record first and then modify the update to exclude the item you just inserted.
yes - in this case you would get a violation of the unique constraint:
insert into l (id, prev, mydata) values ('G', 'C', 'data for G');
update l set prev='G' where prev='C';
this would violate the unique constraint for prev, because after the insert (but before the update!), both the new element and the element not yet updated have prev set to 'C' - the transaction will then be rolled back.
In standard SQL this would be possible because it allows to declare unique constraints (and I think other constraints, such as check clauses) as deferrable, too - PostgreSQL doesn't implement this, it allows deferrable only for foreign key constraints. But usually it's no problem to order the commands in a way that only foreign key constraints get violated during a transaction.
> Right now, I'm using MySQL.. I only have 4 tables,
and product is not launched. Would you advise
switching to Postgres?
as an entrepreneur, you should probably do what's best for your customers - and they will likely not care which RDBMS you use:-)
Perhaps you could play around a little bit with PostgreSQL on the side and (perhaps) make the switch once you are comfortable with it. And keep your JSON-based lists - if the system works, why bother with a rewrite / schema change (for now?).
Considering momentum, I think PostgreSQL is gaining steam while MySQL is losing momentum (some key developers left after the aquisition by Sun) - of course, that's my subjective impression.
The data I'm having the user order is not only order-specific, but is recursive. That is, one of the items on the list, rather than be a letter like "B" can be a list in and of itself. It's basically like "files and folders"
A major problem isn't only the UI, which took me days to make (drag and drop for files/folders ... anybody seen anything like this done before in HTML+JS?), but mostly is in the updating/validation. I receive information, like "moved item: /path/to/item to before /some/other/path" and I now need to make sure this is a valid action (eg: can't put a folder into one of its own subfolders, can't put a file in a file, etc), and also update the database to reflect this.
I chose to use JSON to encode objects in the DB, then decode them into an array, and do some easy/fast array stuff to perform the action they requested. Just judging on the array code, which does several "isset()" calls and splices, I'd imagine doing this with a DB would be a major heartache... but I have no doubt you'd be able to come up with some brilliant way to do it.
yeah, of course! And if it works, then, by definition, it's good for your customers and thus for you! Also, your approach is probably more flexible (no fixed schema for the recursive list structure), which makes it easier and faster for you to iterate and react to customer feedback.
If at some time you want to have a more fixed structure or let the database do some of the server-side validation work of your tree-like structure, here is a good writeup by Phil Greenspun showing how to model and query tree-like datastructures with an RDBMS:
(the rest of the document 'SQL for Web Nerds' is also quite good stuff:
he uses 'connect by', which is a non-standard Oracle extension, the same thing can be achieved with 'with recursive'-queries, which is standard SQL (1999) and part of PostgreSQL 8.4
Also, Joe Celko's Book 'SQL for Smarties' has a good chapter covering hierarchical data structures in SQL (he also has a separate book about 'trees and hierarchies' in SQL).
Oh, and good luck and much success with your startup!
This is really interesting...could you do the same thing using S3 instead of MogileFS? The advantage would be cost and simple scalability, especially if you're running on EC2 so you don't have to pay for all the back and forth transfer from EC2 to S3. Concerns might be latency issues and whether this would scale to billions of objects. Other than that, it would seem like you could run a pretty huge site off of just a couple beefy servers.
Absolutely. It'd work with any sort of distributed filestorage. It wouldn't be too hard to wrap the S3 methods so that the API is consistent (the MogileFS API is basically a dict), so app code wouldn't even have to change.
We used Mogile over S3 mostly because I didn't realize there was no minimum monthly fee for S3.
Excellent article. I work in an environment where we have serious vendor lock-in and the people in charge of infrastructure have a real fear of change. I have free reign to implement in any improvements in our performance and code so long as I don't make things difficult for them.
Thus, the idea of a MySQL cache system is something I've been planning to look at for a while as it plays with the existing infrastructure. This (seemingly proven) system will make things a lot easier, so thanks for sharing.
"However, none of them seemed widely-used enough by large sites to inspire confidence. In the tests we read about and ran ourselves, none of the projects were stable or battle-tested enough for our needs"
Ok, just some hour ago I released the beta-3 of Redis (http://code.google.com/p/redis/ if you care) and I'm near to feature-freeze with exactly with this goal. To make it rock solid (I'm going to use it in my startup's web stuff with a lot of users/month, so I care about stability).
The question is: what's in your opinion the right path to make a system like Redis stable and reliable for the real world usage? What to publish on the site in order to inspire a good feeling about stability? Thanks
Not to be a naysayer, but I don't know that there's a lot you can do except run it yourself on a large, popular site that people have heard of. Companies like FriendFeed will use memcached cause LiveJournal/Facebook use it, they'll use MySQL because just about every web startup uses it, but no matter how awesome a project is, they're not going to use something complex that a few developers or some other startup wrote that hasn't been battle-tested on a large, well-known site. It's too risky compared to writing another one yourself that you understand.
I've been evaluating databases recently. It doesn't matter if the site is not well-known in the US as long as you can post traffic numbers. TokyoTyrant is Japanese, after all.
Also these help:
* able to handle large (>200GB) datasets
* client libraries for top N languages
* easy way to write client libs (eg simple protocol, a C library, etc)
* connection pooling and/or cheap connections
* easy to install on Mac, Linux and Windows development machines
* repl (this is not much of a problem with Python & Ruby)
* stable dump / restore format
One problem for you is that MySQL plus serialize() basically does these already, with 10+ years of testing on top. Your system has to do a lot more to make it worth the risk.
Scale enough in which way? Read traffic? Write traffic? Locking? If you use MySQL as a btree it is pretty fast and consistent. I gather from this post that a lot of companies started in the last 18 months are doing this. It's a hack but a remarkably useful one. Good luck! I will keep my eye on redis.
Probably it just depends on the dataset. The issue we have with our service (http://lloogg.com) is that you need to take the last N items of logs for every site. To ask for the latest M items should be fast. To push a new log line on the list should be fast. Every kind of MySQL configuration we tried was unable to reach the 10k writes/second we reach with Redis. Obviously. Even when you use MySQL as a btree implementation you get a lot of overhead. Starting from the protocol and the format of the statements, for example.
The idea to encode things with json or other formats in a blog text is just a ugly hack. People are using this because they are desperate, not because is good computer science. They started with mysql, know mysql, hacked with mysql. Clearly will try to fix their site with MySQL.
The json+blob can work as long as the data that's stored in this fields is trivial to serialize-deserialize. What about having a 10000 elements list in every blob and at every page view you need to append an element?
So: great hack, you found a way to work with the tools you have, but this does not mean in any way that fast key-value persistent DBs don't have something to say into the web-scale theater.
I added consistent ring hashing and a rediscloud class to the ruby client library today. so you can now use a cluster of redis servers and have the client use ring hashing to determine the server for a given key.
it's higher level. Not a plain key value stuff. For instance as value you can have a list or a set, push/pop elements, ask the server for all the keys matching a given glob style pattern and so on. Most of this operations are atomic in order to make sure there are no race conditions.
Has anyone here got TokyoTyrant working under Windows? I found a static library binary for Windows, but the header files need to compile are either not there, or the code references some bizarre POSIX stuff which isn't easy to hack into working under Windows.
That's reassuring, I thought I was the only crazy fool who was storing json objects in database columns =). We do something similar for some of our data models at thesixtyone.com. It's really nice for not having to bring down the site for schema upgrades.
Actually we're on postgres... It just makes me nervous when the migration is taking upwards of 10 minutes to complete since some of our tables have millions of rows and lots of concurrent read/writes are happening.
If the site isn't taken down during an update, some parts of it will always be broken during a schema migration since our ORM expects each table corresponding to a data model to be in sync after a code update. I guess you could always version your data models and handle the pre-upgrade and post-upgrade cases, but that seems like a lot of work in our situation.
I find this interesting because for a previous project, we had to manage separate database instances for each client (regulations). Even though we had a database management tool that allowed us to push schema changes out to the various DBs automatically, it would be better to not have to push those changes out at all.
True in this case, but not always true, the issue often isn't upgrading the schema, it's that there can be no single schema that can hold the data because it varies on a per row basis. No RDBMS can deal with this well, no matter how mature, it's just not what they're designed for. This is trivial for an OODBMS which just stores raw objects, no matter their shape.
We're doing something very similar to store RDF data. Our reason for doing it this way isn't performance but rather RDF's pretty intricate schema requirements (and opportunities). The drawback of using this scheme is that it roughly doubles the amount of data stored.
Contrary to Friendfeed, we have to use joins a lot because analysing data is the purpose of our application. We tried to do it with mysql, but mysql turns out to be completely unsuitable for the task due to its lack of merge or hash joins.
I'm quite surprised that someone like Friendfeed would change their entire data model for performance reasons instead of considering a stronger RDBMS (of which there are many). Their problem with index maintainance isn't exactly new. It's a solved problem that needs no wheel reinventing and doesn't merit the increased complexity of asynchronous index updates in my view.
Really a great way to do things... thank you so much for sharing this.
I can't really see any disadvantages to doing anything this way. You still obtain access to data in a relational sense, though where normally you compare columns of a table you'd now do joins. For example, you can get the unique users that have submitted a link by joining the index_link table to the entities table, then joining that to the index_user table.
The problem here seems to be that sharding would prevent this type of operation... so how do they get around this? It's possible they just don't need this data, but lets assume they do. I'm presuming they have some slave(s) munging non-realtime-needed data into whatever real relational tables they choose. But, if the problem because realtime, I'm at a loss as to how they'd do it.
They explained the sharding in the text - I glossed over it on first read-through too. When they say "join" in quotation marks, they don't actually mean a join in the MySQL sense. Rather, the initial call to user_id_index.get_all reads all entity_ids for that user into the Python code (they say it consults all shards for this, but isn't the index sharded on user_id, so all entities for a given user_id live on one shard?). The Python code then uses whatever shard function applies to the entities table to query its database backends, selecting the relevant entities. Then the Python code filters the returned records by the indexed field (in case the indices are out of date) and returns it.
As for disadvantages - well, it's denormalized, for starters. ("Normalization is for sissies", says Cal Henderson.) If an indexed field changes, you need to update it in both the index and the relevant entities. There're also a bunch of little inefficiencies, places where they traded performance for scalability. Imagine if you naively plugged this engine into an app with only 10 records: instead of a simple index search, it'd have to go to the index table, fetch the relevant entities, go to the entities table, fetch them, filter on indexed value, and then return them all. But then, if your database fits on one machine, you don't have the same sort of engineering challenges FriendFeed does.
When they say it's sharded on "user_id" what they mean is that's the field that decides which database the record is stored. It might go something like: if the user starts with 0-8, store in DB1, otherwise, store in DB2. This is up to their Datastore controller to decide how to hash based on the user_id and the number of databases.
OK, but # of shards is fixed "for ever" under the modulus scheme? You pick it once, when you first shard and then you're looking at downtime to adjust it?
In order to split across multiple dbs, you're looking at creating say 100/1000 dbs in our initial split (when you've got maybe 2-3 machines). And that number then caps the number of machines you can scale to without adding another layer (sharding-shards) or having downtime?
> You pick it once, when you first shard and then you're looking at downtime to adjust it?
Yes, but you say that like it's a bad thing, it's not. If you're ever forced to reshard, that means you grew beyond what you ever hoped... hurray, awesome, nice problem to have. The reality is however, 99% chance that'll never happen.
Like 1 in a 1000 people ever run into a scaling problem of this magnitude but if you read the blogs you'd come away thinking scaling issues that require sharding are common and everyone needs this stuff, but they aren't, and they don't.
I don't see how this scheme can scale simply for the reason that there's no built in balancer. What's to stop shardN from becoming overwhelmed when all the power users end up there, while shardN-1 has no activity?
In practice it's going to be very rare that you'll overwhelm a single shard and adding an extra layer to point to where people actually are is quite simple and fast. You can use a simple cache key (read-through cache of course) that, if it exists means the user is on a specific shard, overriding the default algorithmic pick.
We don't do any joins in MySQL. We query the indexes in all of the shards in parallel to get a list of entity IDs and then query the entities tables in all of the shards in paralel to get the entity bodies as a second operation.
I like the trick for getting around index creation and deletion, I wasn't aware that MySQL required a full table lock. I checked the docs for DB2, and it allows read/write during index creating -- is this a MySQL only limitation or do the other major databases impose the same restriction?
Most of the MySQL posts here are actually about clever ways to do things most DBAs and developers have taken for granted for years. Do you wonder why no-one seems to be writing blog posts about "sharding" Sybase or DB2 or Oracle or Postgres...?
MySQL is popular with startups though, so it's not surprising that there's a lot written about it.
It's true that MySQL has some lame limitations, but I don't believe that there are any silver bullets out there. Google tried to switch their ads system from MySQL to a "real" database once, and it was basically a disaster and had to be abandoned in favor of MySQL (I wasn't working on it, so I can't really give all the details).
Another problem we had with MySQL that Bret didn't mention was that it would try to be "smart", and sometimes it would "randomly" (from our perspective) choose a very inefficient strategy, and we would have to waste a lot of time figuring out what it was doing and how to force it to do the right thing. The approach Bret describes basically avoid any MySQL "smarts" and treats it as a dumb, but fast and well tested B-tree. This gives us fairly reliable and predictable performance characteristics because we know exactly what it's doing (mostly).
No, there are no silver bullets, but it seems to me that people reach for Mysql a bit too quickly, without considering the pros and cons. And while it's improving, Mysql has had many frustrating things in the past... to me it's always seemed like a "worse is better" kind of thing. Sure, it's "fast", but at what cost? Once you go to InnoDB, you lose that speed advantage.
One thing that's not a tech tradeoff, and is generally a Postgres win, is the BSD style licensing. You can take Postgres and do whatever you want with it with no worries.
> Google tried to switch their ads system from MySQL to a "real" database once,
I hope the part that actually handles money has been fiddled with by Google to be robust.
Actually, Postgres doesn't have any great out-of-the-box solution for partitioning the database across machines. The usual suggestion is Slony, but that is no where near as robust and widely deployed as MySQL replication. The GPL licence for MySQL isn't really a problem for webapps anyway.
OTOH, Postgres does somewhat better than MySQL on a single box with multiple cores (it's fairly linear up to 8 CPU, which is much better than MySQL) - mainly because of the work Sun put into scaling it before they bought MySQL. (At least - that's according to some people from Sun who do a lot of performance work with both databases)
Postgres does somewhat better than MySQL on a single box with multiple cores ... mainly because of the work Sun put into scaling it before they bought MySQL.
That's not really the case. Sun has some some benchmarking, but most of the work on Postgres SMP performance was done by others (mostly Tom Lane, who works for Red Hat, and various EnterpriseDB employees).
How do hash joins help with sharding? Surely once you go above a certain number of writes per second you're going to need to have more than one writable database server, at which point you need to start partitioning and better join algorithms aren't going to do anything to help.
If you can't hash join then you can't join over large datasets anyway, so sharding costs you nothing in that respect.
Right now, using off the shelf kit and doing nothing particularly clever, running a major commercial RDBMS you could do 10,000 commits/sec and handle 100T of data on a single instance. Sure it would cost you a pretty penny, but the thing is, unless running a database is the one competitive advantage your company has, you're better off keeping your people focussed on the thing that does make you money.
I'm not too sure if you are arguing for or against the commercial RDBMS.
To me, that sounds like a pretty good argument against it - the great thing about MySQL/Postgres+Sharding is that it scales down, as well as up. You can start out with a single server, then gradually add in extra servers as you need them. With the "single big DB server model" it doesn't work like that - you have to make a pretty decent investment early on in the software licence, build your software to use it, and then the pricing isn't linear, either.
Plus, the backup/redundancy thing sucks too - with sharded MySQL you need a couple of spare cheap servers, but with Oracle etc you need to pay twice for the licence and a second server.
OTOH, in a corporate environment it's much easier to predict your usage and the pricing is easier to justify. Also your priority is safety+justifiability first rather than price or even price/performance.