
Facebook's Top Open Data Problems - huangwei_chang
https://research.facebook.com/blog/1522692927972019/facebook-s-top-open-data-problems/
======
EricBurnett
I strongly dislike Facebook the product, and to lesser extent Facebook the
company, but I'm continually impressed with Facebook's approach to engineering
in the open. I find this an interesting dichotomy. Would I want to work there?
I still don't think so, but my opinion on that front is getting less strong
over time.

~~~
mFixman
Former Facebook intern here. Facebook the company is a lot more 'hack-y' in
the right way of the word than what it looks like on the outside. The company
and its products are extremely open, and the projects you do working there
have extremely little management and corporate bs.

My experience was really similar to my Google internship, and probably even
closer to a "cool startup". I know several people who worked on Google,
Facebook, and X (with X another major Silicon Valley company) and say that the
first two were a lot closer to each other than to X.

~~~
Kiro
Where do you work now?

~~~
mFixman
I'm currently finishing my degree while working in a smallish start-up in
Buenos Aires.

------
ransom1538
I had a really great time talking to the Facebook engineers during my
interviews there. The main pattern I noticed was Harvard (i was applying in
management). Even more so, the guys interviewing me were extremely talented
and smart. What always weirded me out was... the problems they work on are not
that difficult. Once you grasp sharding and operations, you are pretty much
set. These guys are _not_ the Manhattan project. The true hard problems in
their space: developing their own mobile hardware, keeping teens engaged,
pushing the boundaries of design, losing tracking systems in mobile, etc; they
don't face head on. Moving petabytes around or caching lots of things in
memcache - my roomate and I could do with an aws account and a few beers.
Memcache for god sakes is what 300 lines of C?

~~~
e12e
No idea how much you're trolling, but the single file:

[https://github.com/memcached/memcached/blob/master/memcached...](https://github.com/memcached/memcached/blob/master/memcached.c)

is more like 4000 lines of C (by guesstimating the amount of comments etc).

~~~
mqsiuser
I liked his comment (don't shoot me for that), made me smile.

4000 isn't that much either & whats memchached? A hashmap?

He maybe means 300 lines of _relevant_ code (+ ~3700 sugar)

~~~
e12e
Hehe. Well, this is just _one_ file. I might agree that memcached is 300 lines
of relevant pseudocode -- but then again, it's implemented in C, not
pseudocode...

I agree that 4k lines isn't that much, but it's an order of magnitude off from
300. And again, that's just for that one single file.

------
mandeepj
Can any body throw some light on how facebook's database is designed? I am
sure it will be an interesting read.

I was reading somewhere sometime back that each user at fb has its own
database. I think that is not possible.

edit: I am googling now again on this topic. First link found is
[http://www.quora.com/What-is-Facebooks-database-
schema](http://www.quora.com/What-is-Facebooks-database-schema)

~~~
nbm
There isn't one database, although there are a few major types.

The majority of core information (attributes of people and places and pages
and so forth, as well as posts and comments) is stored in MySQL and queried
through TAO.

Some data is primary stored in things like HBase, such as messages.

Non-primary-storage data (indexes and so forth) exist in various forms
optimised for different workloads - so data in either MySQL or HBase might
also exist in Hive for data warehouse queries, or in Unicorn for really fast
search-style queries.

Other data (such as logs) might reside in one or more of the various data
stores, such as Scuba, Hive, HBase, and accessible via Presto, for example.

TAO:
[https://www.facebook.com/publications/507347362668177/](https://www.facebook.com/publications/507347362668177/)

Unicorn:
[https://www.facebook.com/publications/219621248185635](https://www.facebook.com/publications/219621248185635)

Hive:
[https://www.facebook.com/publications/374595109278618/](https://www.facebook.com/publications/374595109278618/)

Scuba:
[https://www.facebook.com/publications/148418812023978/](https://www.facebook.com/publications/148418812023978/)

Presto: [http://facebook.github.io/presto/](http://facebook.github.io/presto/)

------
crazypyro
This is slightly off topic, but has any experienced an increase in "fake"
toasts from facebook mobile? It seems if I haven't used facebook mobile in a
few days or I don't respond to their toasts about very minor people in my life
uploading a photo, I tend to start getting toasts that say "You have 5
notifications, 3 pokes and 2 messages.", then I open the app and it takes me
to an unknown error page.

Am I being too cynical in thinking that Facebook is intentionally misleading
its users in an attempt to bump up their metrics? It interests me that they
are seeing jumps in their mobile users (and consequently, ad sales) at the
same time that I have been receiving more notifications than ever.
Interestingly, the slowdown in fake toast notifications coincided with their
quarterly earnings report that show mobile ads accounting for an increasingly
large portion of revenue and also mentions an increase in mobile user usage.

Comparing Q1 with Q2 with Q3, Q2-Q3 showed double the increase in ad revenue
percent from mobile (59% to 62% to 66%). Maybe this is just all anecdotal
evidence, but it seems like these sort of fake notifications should either not
be sent out (failure of the system that keeps track of what user receives what
toasts) or there was a conscious effort to send these notifications....

~~~
coolsunglasses
It's been doing this for me via email lately and it's really annoying.

~~~
srcmap
I use gmail's filter to dump them and 99% of my emails to various labels. Once
every 2-4 weeks, when I am in the mood, I check them out then delete them all.
:-)

------
beagle3
Something does not add up about hive: They say it has 300 PB, and it generates
4PB per day - which means, at this rate, all data was generated within the
last 75 days.

~~~
boomzilla
Most likely that the 300PB are distilled/normalized/compacted data whereas the
4PB per day are raw logs.

~~~
bbillings
This is the correct answer. Source: I am a FB employee.

~~~
beagle3
Thanks.

So, how much does the 4PB of data go down to when it goes into hive? My guess
would be something like 200TB.

Is it zipped (or lz4/lzo/zopfli/lzma whatever)? Or is it just "distilled"?

~~~
valarauca1
99% of raw data is useless. Just as a rule.

Most of it likely gets tossed into a program to determine if somebody actually
needs to do anything, or if something is actually breaking.

For example raw user interaction doesn't really grow. While the event is
likely ~1kb or so of raw data, at the end your just incrementing a 64bit
counter.

This is a baseless exaggerated post but should shed some light on.

~~~
beagle3
> 99% of raw data is useless. Just as a rule.

Well, as a non-facebook user, I think 99.9999% of facebook data is useless :)
But facebook is in the habit of tracking everyone's surfing habits across the
web (through "like" links), and I assume they do more than just "increment a
counter with it", even if they don't keep every single detail.

------
Cakez0r
I'm really curious how they handle paging if they're only using memcached.
E.G. If a a photo node has 10,000 comment nodes (and thus 10,000 edges linking
the photo to the comments), chances are you only want to display the most
recent 50 comments. Are all of the 10,000 edges stored in memcached under one
key and then paged on the application servers? Are they stored in chunks under
multiple keys? How is cache consistency maintained if somebody makes a new
comment (maintaining the time ordering seems tricky and expensive)?

This is a problem I'm actively trying to solve for a project, so if somebody
knows the answer, please get in touch!

~~~
alexgartrell
That's what TAO (mentioned in the article) is for

~~~
Cakez0r
Yes, I've read the TAO paper and referenced it heavily for my project. I'm
curious about the implementation specifics of the caching layer.

EDIT: More specifically, how do you efficiently cache your data such that an
assoc_range query can be answered from cached without O(n) operations on your
application servers. Memcached can't answer a query like "give me 50 items
starting from position 0 in the list" as far as i'm aware, so you'd need to
pull the whole list to the application server and slice it up there. When you
consider that you want the 50 most recent items too, maintaining sorted lists
adds extra complexity.

~~~
yuliyp
TAO is used in place of memcached. You don't ask TAO for a whole list and
place it into memcached. You ask TAO for "give me 50 items starting from
position 0 in the list", and the TAO cache keeps this list in sorted order.

~~~
Cakez0r
Yes, that's the mechanism I'm wondering about. Let's say I have:

    
    
        Application -> TAO -> Cache -> Database
    

I have a photo node (P1) and 125 comments nodes [C1, C2, ..., Cn] attached to
P1 by the edges [(P1,C1), (P1,C2), ..., (P1,Cn)]. I'll ignore the fact that
there can be different edge types for simplicity.

Lets say my page size is 50 and I want to view 3 pages of comments for the
photo from my application. My application makes the following TAO queries:

    
    
        assoc_range(P1, 0, 50)
        assoc_range(P1, 50, 50)
        assoc_range(P1, 100, 50)
    

My question is, assuming all the necessary data cached such that all of those
queries will be a cache hit, how are those edges stored and retrieved from
memcached? How are the keys named in memcached?

A naive implementation might be to store the list of all edges for P1 with a
key of "P1". To answer te above 3 queries, TAO then needs to pull "P1" (all
125 edges) from memcached 3 times to answer each of those 3 queries and slice
the edge list up on the TAO application server... Not great, but probably an
improvement over hitting the DB for it (up to a certain list length at least).

A less naive implementation might be to store the edges in buckets of 50, such
that the 125 edges are stored with keys of "P1_0_50", "P1_51_100",
"P1_101_150", but then time ordering comes in to play...

If my application now wants the 50 most recent items, we could store the edge
lists by created date descending and I can retrieve "P1_0_50" from the cache
and guarantee I have the 50 most recent items. However, lets say 10 new
comments are posted... Now I need to update all my cache pages to ensure the
ordering is correct, which is horrendously ineffecient!

To fix this issue, edge lists could be stored in created date ascending order
instead, but then how do I know which cache page to fetch to retrieve the 50
most recent comments (seeing as "P1_0_50" is now the oldest 50 items)?

I hope that makes sense!

~~~
nbm
There is no memcache - there is just TAO, and it talks to the database. TAO is
a read/write-through cache, so the only way the data changes in the database
is through TAO. TAO contains the indices necessary to answer queries like that
efficiently (although there may be other systems for doing similar but more
specialized or slightly different queries), as well as a cache of the data.

So, when you add a comment, TAO updates its internal structures with the new
comment in the right place (after the DB is updated), and there are no "keys"
that need to be updated beyond that.

~~~
tkinom
Are the TAO APIs asynchronous? Are there any document on the parameters of
those APIs? :-)

~~~
Cakez0r
There is a whitepaper about the details of TAO at
[https://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers...](https://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/11730-atc13-bronson.pdf)

------
swah
I'd like to use this opportunity to ask: is it a technical limitation that
users still can't search their timeline?

------
mmmooo
So ~650M daily active users..4PB of data warehouse created each day, that
means ~7MB of new data on each active user per day. Given that its data
warehouse, I'm going to guess its not images, seems like a lot to me. I guess
it shouldn't surprise anyone that every interaction on and off the site, is
heavily tracked.

~~~
nbm
A lot of that data is duplicated to allow for efficient querying or
transformation. It often is too slow to process the data as it comes in, so an
initial process will write the data in a raw form, and some other process
might select a subset of the data to process, and then submit it in an
"annotated" form (filling in, say, the AS number of the client IP). Another
process will run later in a batched fashion and perhaps annotate the full set
of information and summarize it into a bunch of easily-queried tables.

A lot of that data is also not tied to individuals either - for example the
access logs for the CDN (which, being on a different domain by design, does
not share cookies so is not attached to an account) even reasonably heavily
sampled is probably tens of gigabytes a day, and is rolled up into efficient
forms for queries in various ways. A lot of it isn't even about requests
coming through the web site/API - it may just be internal inter-service
request information, or inter-datacenter flow analysis, or per-machine service
metrics ("Oh, look, process A on machines B through E went from 2GB resident
to 24GB in 30 seconds a few seconds before the problem manifested").

(Not that it makes too much of a difference at this scale, but it is closer to
860M daily actives.)

------
doque
_3\. Hive is Facebook 's data warehouse, with 300 petabytes of data in 800,000
tables. Facebook generates 4 new petabyes of data and runs 600,000 queries and
1 million map-reduce jobs per day._

So 4 PB per day, but only 300 PB total?

~~~
ddoolin
Was wondering the same thing. My guess is that some also gets removed each day
as well, but it seems unlikely.

~~~
evgen
Think of it like monitoring data. You may collect one-second data on 500
counters per system over 1000 systems, but then you will do a weekly or
monthly roll-up where you dump some of the granularity to save space, and
after a year you have aggregates that are basically daily trend lines. The
more you collect smaller percentage you actually keep.

------
Thaxll
Still using Memcache wow.

~~~
toomuchtodo
DevOps here using their mcrouter tool [1] in production. It's a phenomenal
swiss army knife for using memcached.

[1]
[https://github.com/facebook/mcrouter](https://github.com/facebook/mcrouter)

~~~
alexgartrell
Aw shucks.

Mcrouter was the first piece of software I worked on at facebook. It's nice to
see that you like it! (Though the only thing that lives on from me is the
umbrella protocol)

