
How FriendFeed uses MySQL to store schema-less data - paul
http://bret.appspot.com/entry/how-friendfeed-uses-mysql
======
paul
Another interesting detail is that this is roughly the 4th iteration on the
FriendFeed backend since we launched 17 months ago. If you look at the the
graphs at the bottom of Bret's post, you can see that our previous system was
about to die -- average pageview latency had increased from about 135ms to
260ms in less than a month! (not a good trend) This new design also
accommodates some important upcoming features that would have been problematic
in the old system.

This experience reinforces my belief that it's better to be quick than
brilliant. If we had wasted a lot of time trying to build some really smart,
super-scalable system right from the start, it would have inevitably been
over-optimized for the wrong things, since the product and requirements have
changed quite a bit since launch. By continually rebuilding the system, it
stays relatively simple and close to our actual needs (plus it incorporates
everything we learned from the previous iterations).

~~~
jbert
Hi, is there any chance you could have a look at this comment:

<http://news.ycombinator.com/item?id=497070>

and let me know how you resolve this problem in practice? (Or correct my
understanding if there is no problem really).

~~~
paul
Increasing the number of shards is similar to changing the backend
infrastructure, but simpler. Downtime obviously isn't acceptable, so when
switching from one system to the next, we have a period during which we write
to both so that it is safe to read from either one. Other schemes could be
used for resharding, but this is simple enough and also works for other
changes, and in practice we've changed the schema more often than the number
of shards.

Also, keep in mind that it is ok to have more shards than you really need
(multiple shards can run on the same machine, for example), so resharding
needn't be a common operation.

------
wayne
Thanks for sharing. That was very educational.

We store data in a really similar way at Spock, pointing all our reads at a
big MySQL table which essentially just has an id column and a really big text
column containing all the values, JSON-encoded, w/roughly the same number of
rows as the FriendFeed table. Our DBA wrote a blog post about it a while back:
<http://www.frankf.us/wp/?p=17>

The way you guys create/delete indices and avoid all the replication
nightmares sounds super cool.

All of these solutions ultimately end up sounding like a poor man's Big Table.

------
nostrademons
This is really interesting.

We did something similar for GameClay. I stored game properties as JSON-
encoded dicts stored in MogileFS, then had a "regular" MySQL table that would
point to the MogileFS key for the file, then the Python code would just read
it out, use a JSON library to parse it, and manipulate it as a Python object.
We had normal MySQL indexes on all the game metadata that appeared in the UI,
so if you want all arcade games that nostrademons has posted, it would do a
normal index search on (nostrademons, "arcade"), find the Mogile key for that,
and then fetch it.

We never got to the point where we'd need to shard, alas. I'd like to think I
would've thought of the index-per-table approach independently, but probably
not.

Curiously, I did things this way for ease of development, not performance (and
there were probably lots of little things that would've killed our performance
anyway - it's really hard to diagnose performance bugs until you have real
users banging away). We needed a storage system - MogileFS and MySQL work out
of the box. We needed a data format - JSON works out of the box. We needed to
be able to change game schemas rapidly, since we kept finding our existing
design was not really fun to play with (and that's what killed us,
ultimately). The rest of it followed pretty obviously.

~~~
sam_in_nyc
I just recently wrapped up using a similar "store serialized data" set-up
myself. In my example, I'm allowing the user insert/remove/reorder items on a
list. This type of operation is pain to do with SQL operations. If you have a
list, A,B,C,D,E,F and you want to insert G before C, then you either have to:

    
    
      delete all, then insert A,B,G,C,D,E,F.
      set "sortnumber" on G to "3", and increment sortnumber on all >= 3.
      set "sortnumber" of G to be the average of B and C.
    

The latter is the cleverest, but eventually you run out of space in floating-
point land, unless you have a Cron come in and clean everything up
periodically. And of course, reordering (the example was an insert), and
removing need to be considered as well. So it's a choice between lots of
deletes and inserts (but really easy). Semi-annoying logic of
reordering/removing and a few updates. Or clever hack that requires a cron to
cover your ass.

What I did instead was stored the list as JSON, convert it to an array, and
use array splicing functions to reorder things. Then, I convert it back to
JSON and store it. It's worked extraordinarily well... It takes a fraction of
the amount of time to do native data structure stuff than it does to touch the
DB several times.

~~~
jsrn
While your approach to do this in JSON is cool, I think you have overlooked
the 'direct' solution to do this in a RDBMS - with a linked list. Here is a
quick suggestion (works in Postgres):

    
    
        create table l (
               id char primary key references l(prev) deferrable initially deferred,
               prev char unique not null references l(id) deferrable initially deferred,
               mydata text not null
        );

then I populate the table with your example items:

    
    
        insert into l (id, prev, mydata) values ('A', 'F', 'dA'),
                                                ('B', 'A', 'dB'),
                                                ('C', 'B', 'dC'),
                                                ('D', 'C', 'dD'),
                                                ('E', 'D', 'dE'),
                                                ('F', 'E', 'dF');
    

let's see how that looks like:

    
    
        test=# select * from l;
        select * from l;
         id | prev | mydata 
        ----+------+--------
         A  | F    | dA
         B  | A    | dB
         C  | B    | dC
         D  | C    | dD
         E  | D    | dE
         F  | E    | dF
         (6 rows)
    

to insert a new item into the list, you would do:

    
    
        begin;
        update l set prev='G' where prev='C';
        insert into l (id, prev, mydata) values ('G', 'C', 'data for G');
        commit;
    

so that's one update, one insert for an insertion into the list. Note that the
two commands have to be in one transaction, because inside the transaction the
foreign key constraint is violated (as allowed by the deferrable initially
deferred modifier).

Let's inspect our list again:

    
    
        test=# select * from l;
        select * from l;
         id | prev |   mydata   
        ----+------+------------
         A  | F    | dA
         B  | A    | dB
         C  | B    | dC
         E  | D    | dE
         F  | E    | dF
         D  | G    | dD
         G  | C    | data for G
    

so the predecessor of G is C, and the predecessor of D is G, like specified.

Of course, you loose the ability to sort with 'order by', but that's no big
deal: you know the predecessor and successor of each item, so it's easy to
traverse the list in either order. This could be done on the client side
[probably the best solution in your case], in the application code, or inside
the database with a stored procedure or with a recursive query (coming in
PostgreSQL 8.4), in Oracle it could probably be done with 'connect by'.

In reality, you would of course choose other datatypes for id and prev
(probably integer), but I wanted to translate your example as literally as
possible. Another problem that's easily solved: how do I get all elements of
one list? Solution: Either give me one 'starting element' and the list is
traversed and returned. Or introduce a listId attribute and select by that,
which is probably faster but without sort order.

~~~
sam_in_nyc
There's one more thing I forgot to mention...

The data I'm having the user order is not only order-specific, but is
recursive. That is, one of the items on the list, rather than be a letter like
"B" can be a list in and of itself. It's basically like "files and folders"

A major problem isn't only the UI, which took me days to make (drag and drop
for files/folders ... anybody seen anything like this done before in
HTML+JS?), but mostly is in the updating/validation. I receive information,
like "moved item: /path/to/item to before /some/other/path" and I now need to
make sure this is a valid action (eg: can't put a folder into one of its own
subfolders, can't put a file in a file, etc), and also update the database to
reflect this.

I chose to use JSON to encode objects in the DB, then decode them into an
array, and do some easy/fast array stuff to perform the action they requested.
Just judging on the array code, which does several "isset()" calls and
splices, I'd imagine doing this with a DB would be a major heartache... but I
have no doubt you'd be able to come up with some brilliant way to do it.

To each their own :)

~~~
jsrn
> To each their own :)

yeah, of course! And if it works, then, by definition, it's good for your
customers and thus for you! Also, your approach is probably more flexible (no
fixed schema for the recursive list structure), which makes it easier and
faster for you to iterate and react to customer feedback.

If at some time you want to have a more fixed structure or let the database do
some of the server-side validation work of your tree-like structure, here is a
good writeup by Phil Greenspun showing how to model and query tree-like
datastructures with an RDBMS: <http://philip.greenspun.com/sql/trees.html>
(the rest of the document 'SQL for Web Nerds' is also quite good stuff:
<http://philip.greenspun.com/sql> )

he uses 'connect by', which is a non-standard Oracle extension, the same thing
can be achieved with 'with recursive'-queries, which is standard SQL (1999)
and part of PostgreSQL 8.4 Also, Joe Celko's Book 'SQL for Smarties' has a
good chapter covering hierarchical data structures in SQL (he also has a
separate book about 'trees and hierarchies' in SQL).

Oh, and good luck and much success with your startup!

------
joshu
This is great. I sketched out a similar design for the Delicious backend, but
the team that built it went in a completely different and stupid direction. Oh
well.

Mine also had:

\- some way to do server locality (I don't think FF needs this)

\- some way to do table locality (I don't think FF needs this either)

\- per blob versioning and type

\- a virtual shard layer, so new machines can be added inexpensively

Sorry for any inadvertent whining, incidentally.

------
DavidMcLaughlin
Excellent article. I work in an environment where we have serious vendor lock-
in and the people in charge of infrastructure have a real fear of change. I
have free reign to implement in any improvements in our performance and code
so long as I don't make things difficult for them.

Thus, the idea of a MySQL cache system is something I've been planning to look
at for a while as it plays with the existing infrastructure. This (seemingly
proven) system will make things a lot easier, so thanks for sharing.

------
antirez
"However, none of them seemed widely-used enough by large sites to inspire
confidence. In the tests we read about and ran ourselves, none of the projects
were stable or battle-tested enough for our needs"

Ok, just some hour ago I released the beta-3 of Redis
(<http://code.google.com/p/redis/> if you care) and I'm near to feature-freeze
with exactly with this goal. To make it rock solid (I'm going to use it in my
startup's web stuff with a lot of users/month, so I care about stability).

The question is: what's in your opinion the right path to make a system like
Redis stable and reliable for the real world usage? What to publish on the
site in order to inspire a good feeling about stability? Thanks

~~~
richcollins
How is redis different from TokyoTyrant?

~~~
antirez
it's higher level. Not a plain key value stuff. For instance as value you can
have a list or a set, push/pop elements, ask the server for all the keys
matching a given glob style pattern and so on. Most of this operations are
atomic in order to make sure there are no race conditions.

You can read more about the difference between Redis and other key-value
stores here: <http://code.google.com/p/redis/wiki/FAQ>

Basically the long term goal is to have something between a relational DB and
a key-value DB. Not all the higher level features must be killed in order to
be scalable.

------
hsiung
That's reassuring, I thought I was the only crazy fool who was storing json
objects in database columns =). We do something similar for some of our data
models at thesixtyone.com. It's really nice for not having to bring down the
site for schema upgrades.

~~~
gaius
_It's really nice for not having to bring down the site for schema upgrades._

All the other popular databases let you modify the schema online. This feature
has been taken for granted for over a decade.

~~~
newt0311
Indeed. This is not a RDBMS problem. Its a MySQL problem and the correct
solution is to use a mature DB.

~~~
apgwoz
Even in Oracle, we've had to schedule downtime for a single column schema
change and population for 45 minutes. I would call Oracle "mature."

Note: I wasn't involved in doing the update, so it's possible there was a
better way to do it. The table is used by about 50 different applications.

~~~
fendale
Almost all major oracle operations can be done online ... Index rebuilds,
table reorgs etc. Depends how much hardware you have how much it affects
things ...

------
fauigerzigerk
We're doing something very similar to store RDF data. Our reason for doing it
this way isn't performance but rather RDF's pretty intricate schema
requirements (and opportunities). The drawback of using this scheme is that it
roughly doubles the amount of data stored.

Contrary to Friendfeed, we have to use joins a lot because analysing data is
the purpose of our application. We tried to do it with mysql, but mysql turns
out to be completely unsuitable for the task due to its lack of merge or hash
joins.

I'm quite surprised that someone like Friendfeed would change their entire
data model for performance reasons instead of considering a stronger RDBMS (of
which there are many). Their problem with index maintainance isn't exactly
new. It's a solved problem that needs no wheel reinventing and doesn't merit
the increased complexity of asynchronous index updates in my view.

------
sam_in_nyc
Really a great way to do things... thank you so much for sharing this.

I can't really see any disadvantages to doing anything this way. You still
obtain access to data in a relational sense, though where normally you compare
columns of a table you'd now do joins. For example, you can get the unique
users that have submitted a link by joining the index_link table to the
entities table, then joining that to the index_user table.

The problem here seems to be that sharding would prevent this type of
operation... so how do they get around this? It's possible they just don't
need this data, but lets assume they do. I'm presuming they have some slave(s)
munging non-realtime-needed data into whatever _real_ relational tables they
choose. But, if the problem because realtime, I'm at a loss as to how they'd
do it.

~~~
nostrademons
They explained the sharding in the text - I glossed over it on first read-
through too. When they say "join" in quotation marks, they don't actually mean
a join in the MySQL sense. Rather, the initial call to user_id_index.get_all
reads all entity_ids for that user into the Python code (they say it consults
all shards for this, but isn't the index sharded on user_id, so all entities
for a given user_id live on one shard?). The Python code then uses whatever
shard function applies to the entities table to query its database backends,
selecting the relevant entities. Then the Python code filters the returned
records by the indexed field (in case the indices are out of date) and returns
it.

As for disadvantages - well, it's denormalized, for starters. ("Normalization
is for sissies", says Cal Henderson.) If an indexed field changes, you need to
update it in both the index and the relevant entities. There're also a bunch
of little inefficiencies, places where they traded performance for
scalability. Imagine if you naively plugged this engine into an app with only
10 records: instead of a simple index search, it'd have to go to the index
table, fetch the relevant entities, go to the entities table, fetch them,
filter on indexed value, and then return them all. But then, if your database
fits on one machine, you don't have the same sort of engineering challenges
FriendFeed does.

~~~
sam_in_nyc
When they say it's sharded on "user_id" what they mean is that's the field
that decides which database the record is stored. It might go something like:
if the user starts with 0-8, store in DB1, otherwise, store in DB2. This is up
to their Datastore controller to decide how to hash based on the user_id and
the number of databases.

~~~
nostrademons
Yeah. They actually explained that in the article too - the shard number =
user_id % num_of_shards. So user 1 is on DB1, user 2 is on DB3, etc. If they
have 10 shards, user 11 starts back on DB1 etc.

~~~
jbert
I don't understand how that scheme can work, since changing the number of
shards changes the location of most users.

e.g. we have 4 shards, so user 5 is on shard1. If we go to 6 shards, user5 is
now on shard5.

I guess it works with downtime to move the users, or another layer of
indirection, where the newly created shards can "point back" to existing
shards, but otherwise I don't see it.

My understanding of sharding is that it's best to have a global lookup table
user->shard. That's not a huge amount of data, even for millions of users.

Anyone care to educate me as to how they get user%num_shards working in
practice?

~~~
apgwoz
I don't see how this scheme can scale simply for the reason that there's no
built in balancer. What's to stop shardN from becoming overwhelmed when all
the power users end up there, while shardN-1 has no activity?

~~~
aaronblohowiak
Edge-Case YAGNI. Have _lots_ of shards, put multiple shards on each machine,
shuffle the shards around as necessary. Rinse. Repeat.

------
derwiki
I like the trick for getting around index creation and deletion, I wasn't
aware that MySQL required a full table lock. I checked the docs for DB2, and
it allows read/write during index creating -- is this a MySQL only limitation
or do the other major databases impose the same restriction?

~~~
gaius
Pretty much it's MySQL only.

Most of the MySQL posts here are actually about clever ways to do things most
DBAs and developers have taken for granted for years. Do you wonder why no-one
seems to be writing blog posts about "sharding" Sybase or DB2 or Oracle or
Postgres...?

~~~
paul
Actually, they do, e.g. [http://highscalability.com/skype-plans-postgresql-
scale-1-bi...](http://highscalability.com/skype-plans-postgresql-
scale-1-billion-users)

MySQL is popular with startups though, so it's not surprising that there's a
lot written about it.

It's true that MySQL has some lame limitations, but I don't believe that there
are any silver bullets out there. Google tried to switch their ads system from
MySQL to a "real" database once, and it was basically a disaster and had to be
abandoned in favor of MySQL (I wasn't working on it, so I can't really give
all the details).

Another problem we had with MySQL that Bret didn't mention was that it would
try to be "smart", and sometimes it would "randomly" (from our perspective)
choose a very inefficient strategy, and we would have to waste a lot of time
figuring out what it was doing and how to force it to do the right thing. The
approach Bret describes basically avoid any MySQL "smarts" and treats it as a
dumb, but fast and well tested B-tree. This gives us fairly reliable and
predictable performance characteristics because we know exactly what it's
doing (mostly).

~~~
davidw
No, there are no silver bullets, but it seems to me that people reach for
Mysql a bit too quickly, without considering the pros and cons. And while it's
improving, Mysql has had many frustrating things in the past... to me it's
always seemed like a "worse is better" kind of thing. Sure, it's "fast", but
at what cost? Once you go to InnoDB, you lose that speed advantage.

One thing that's not a tech tradeoff, and is generally a Postgres win, is the
BSD style licensing. You can take Postgres and do whatever you want with it
with no worries.

> Google tried to switch their ads system from MySQL to a "real" database
> once,

I hope the part that actually handles money has been fiddled with by Google to
be robust.

~~~
nl
Actually, Postgres doesn't have any great out-of-the-box solution for
partitioning the database across machines. The usual suggestion is Slony, but
that is no where near as robust and widely deployed as MySQL replication. The
GPL licence for MySQL isn't really a problem for webapps anyway.

OTOH, Postgres does somewhat better than MySQL on a single box with multiple
cores (it's fairly linear up to 8 CPU, which is much better than MySQL) -
mainly because of the work Sun put into scaling it before they bought MySQL.
(At least - that's according to some people from Sun who do a lot of
performance work with both databases)

~~~
davidw
It's not as robust? Do you have a source for that? You're doubtless right
about it not being as widely used.

I've always preferred the fact that Postgres tried to do things correctly.
Most recently, I bumped into this with Mysql, and it reminded me why I get
irritated when I use it:

[http://journal.dedasys.com/2008/11/11/another-mysql-
doesnt-d...](http://journal.dedasys.com/2008/11/11/another-mysql-doesnt-do-
that-moment)

I've often bumped into things like that that just irk me.

------
yesimahuman
Reading this makes me realize how naive my DB setup is for my startup.

~~~
easp
Did you read the part where they said this was the third or 4th iteration of
their DB backend? Or maybe I read that somewhere else.

Point being, experience will come.

------
dmaclay
This looks better suited to the B-Tree version of Berkeley-db than MySQL.

------
rads
I love this key-value store trend. Looking at CouchDB in particular has made
databases are fun again.

------
zafarali
cool. very interesting.

