Hacker News new | past | comments | ask | show | jobs | submit login
Facebook's Top Open Data Problems (facebook.com)
180 points by huangwei_chang on Nov 4, 2014 | hide | past | favorite | 61 comments

I strongly dislike Facebook the product, and to lesser extent Facebook the company, but I'm continually impressed with Facebook's approach to engineering in the open. I find this an interesting dichotomy. Would I want to work there? I still don't think so, but my opinion on that front is getting less strong over time.

Former Facebook intern here. Facebook the company is a lot more 'hack-y' in the right way of the word than what it looks like on the outside. The company and its products are extremely open, and the projects you do working there have extremely little management and corporate bs.

My experience was really similar to my Google internship, and probably even closer to a "cool startup". I know several people who worked on Google, Facebook, and X (with X another major Silicon Valley company) and say that the first two were a lot closer to each other than to X.

Where do you work now?

I'm currently finishing my degree while working in a smallish start-up in Buenos Aires.

Well, The Facebook company can feel rather juvenile to deal with. Some of their decisions in dealing with other businesses sometimes feel like they are being made by fifteen year old kids with no life or business experience.

In terms of the Facebook product one cannot deny the obvious: They touched THE nerve on the internet. World wide. Across languages and cultures.

I don't like the embodiment of the product at all. At the very least it has usability and privacy problems. Yet more people use it successfully than any other web app in the world. So, what do I know? What do experts know?

A similar thing could be said about CraigsList. It's 2014. Every time I use the site I cannot believe what I am looking at. Yet I and lots of other people keep using it. It works.

Facebook has a general lack of elegance (whatever that means). The kind that happens when a product is thrown together and evolved over time. Evolving anything over time means the output "naturally selects" (subverting the theory here) to the environment created by it's users.

They survive because they optimized for what is important to their users. Grandma couldn't care less about UI issues or searchability. She wants to see her grandchildren's pictures and videos. And for that it works very well for a huge percentage of the planet.

At some point it becomes almost impossible to break the mold and clean-up what might be less than ideal. Why would you? It works. Another "Innovator's Dilemma" [1] situation to a large extent.

[1] http://www.amazon.com/The-Innovators-Dilemma-Revolutionary-B...

Glad somebody said that. Facebook is clumsy and weird. It looks like a legacy system, which I suppose it is. I don't even understand the basic premise. You have a 'wall'? Is that a thing anymore? Maybe its a feed now, or a timeline or something. Whatever; its not obvious from the front page. Not obvious what to do, or where to go, or how to friend people, or what will happen if you do that.

I was interviewing with them some time ago and just gave up half way. Those guys are absolute assholes and are hugely arrogant. Not a place I would want to work.

Assuming you are referring to the hardware group and it was after 2010 (when the current team formed), then I have no idea why you would think that. I interviewed there back in 2013, and was quite impressed by how open and nice they were. While not working at Facebook, I have continued to work within the Open Compute Project, and the FB engineers have been great in my experience.

That I am sure is also what they said about the Manhattan project. Anyways the train has left the station.

The Manhattan Project was engineering in the open?

EricBurnett was talking about appreciating (some of) FB's values, not relishing the scale of their challenges (another reason people join FB). It's the latter reason that I imagine engineers joined The Manhattan Project (other than those who viewed it as a way to protect against nefarious forces in the world, valid or not).

I had a really great time talking to the Facebook engineers during my interviews there. The main pattern I noticed was Harvard (i was applying in management). Even more so, the guys interviewing me were extremely talented and smart. What always weirded me out was... the problems they work on are not that difficult. Once you grasp sharding and operations, you are pretty much set. These guys are not the Manhattan project. The true hard problems in their space: developing their own mobile hardware, keeping teens engaged, pushing the boundaries of design, losing tracking systems in mobile, etc; they don't face head on. Moving petabytes around or caching lots of things in memcache - my roomate and I could do with an aws account and a few beers. Memcache for god sakes is what 300 lines of C?

No idea how much you're trolling, but the single file:


is more like 4000 lines of C (by guesstimating the amount of comments etc).

I liked his comment (don't shoot me for that), made me smile.

4000 isn't that much either & whats memchached? A hashmap?

He maybe means 300 lines of relevant code (+ ~3700 sugar)

Hehe. Well, this is just one file. I might agree that memcached is 300 lines of relevant pseudocode -- but then again, it's implemented in C, not pseudocode...

I agree that 4k lines isn't that much, but it's an order of magnitude off from 300. And again, that's just for that one single file.

Nice guess. There are 4486 lines of C when the set of lines that start with {/* , * , */} + the set of empty lines are excluded.

true - Facebook scales linearly. If you're interested in really hard problems in distributed systems & blockchains let me know. Facebook should not own social data. the future will be end-to-end encrypted and de-centralised.

Can any body throw some light on how facebook's database is designed? I am sure it will be an interesting read.

I was reading somewhere sometime back that each user at fb has its own database. I think that is not possible.

edit: I am googling now again on this topic. First link found is http://www.quora.com/What-is-Facebooks-database-schema

There isn't one database, although there are a few major types.

The majority of core information (attributes of people and places and pages and so forth, as well as posts and comments) is stored in MySQL and queried through TAO.

Some data is primary stored in things like HBase, such as messages.

Non-primary-storage data (indexes and so forth) exist in various forms optimised for different workloads - so data in either MySQL or HBase might also exist in Hive for data warehouse queries, or in Unicorn for really fast search-style queries.

Other data (such as logs) might reside in one or more of the various data stores, such as Scuba, Hive, HBase, and accessible via Presto, for example.

TAO: https://www.facebook.com/publications/507347362668177/

Unicorn: https://www.facebook.com/publications/219621248185635

Hive: https://www.facebook.com/publications/374595109278618/

Scuba: https://www.facebook.com/publications/148418812023978/

Presto: http://facebook.github.io/presto/

This is slightly off topic, but has any experienced an increase in "fake" toasts from facebook mobile? It seems if I haven't used facebook mobile in a few days or I don't respond to their toasts about very minor people in my life uploading a photo, I tend to start getting toasts that say "You have 5 notifications, 3 pokes and 2 messages.", then I open the app and it takes me to an unknown error page.

Am I being too cynical in thinking that Facebook is intentionally misleading its users in an attempt to bump up their metrics? It interests me that they are seeing jumps in their mobile users (and consequently, ad sales) at the same time that I have been receiving more notifications than ever. Interestingly, the slowdown in fake toast notifications coincided with their quarterly earnings report that show mobile ads accounting for an increasingly large portion of revenue and also mentions an increase in mobile user usage.

Comparing Q1 with Q2 with Q3, Q2-Q3 showed double the increase in ad revenue percent from mobile (59% to 62% to 66%). Maybe this is just all anecdotal evidence, but it seems like these sort of fake notifications should either not be sent out (failure of the system that keeps track of what user receives what toasts) or there was a conscious effort to send these notifications....

If anyone cares, I went and looked at their metrics and it seems that Q2 to Q3 was one of their biggest increases in mobile alone(albeit not by a whole lot) in quite a while, yet if you look at the raw user metrics over all platforms, it was slower than almost every other quarter in terms of users gained. I'm not sure if this adds any credibility to my wild theory, but it does at least show there is something affecting the increase in mobile usage, although that could just be market factors.

Interestingly, Twitter's metrics don't appear to show any similar rise in the rate of adoption.

It's been doing this for me via email lately and it's really annoying.

I use gmail's filter to dump them and 99% of my emails to various labels. Once every 2-4 weeks, when I am in the mood, I check them out then delete them all. :-)

Sorry if this is a silly question, but what are "toasts"?

Thank you! I never knew what those were called.

Something does not add up about hive: They say it has 300 PB, and it generates 4PB per day - which means, at this rate, all data was generated within the last 75 days.

Most likely that the 300PB are distilled/normalized/compacted data whereas the 4PB per day are raw logs.

This is the correct answer. Source: I am a FB employee.


So, how much does the 4PB of data go down to when it goes into hive? My guess would be something like 200TB.

Is it zipped (or lz4/lzo/zopfli/lzma whatever)? Or is it just "distilled"?

99% of raw data is useless. Just as a rule.

Most of it likely gets tossed into a program to determine if somebody actually needs to do anything, or if something is actually breaking.

For example raw user interaction doesn't really grow. While the event is likely ~1kb or so of raw data, at the end your just incrementing a 64bit counter.

This is a baseless exaggerated post but should shed some light on.

> 99% of raw data is useless. Just as a rule.

Well, as a non-facebook user, I think 99.9999% of facebook data is useless :) But facebook is in the habit of tracking everyone's surfing habits across the web (through "like" links), and I assume they do more than just "increment a counter with it", even if they don't keep every single detail.

Also 800.000 tables? Surely not in the sense I'm used to, that is, normalized forms where a table corresponds roughy to some business object/noun? Does table mean something else or are there 800k different types of data in there?

Yeah that's hard to wrap my head around. We have 2k in a large app and that's daunting. Really curious why that came to be (though that's probably the least interesting thing in this article)

Data is not stored forever. If most data is only stored ~30 days, then the numbers make sense.

I'm really curious how they handle paging if they're only using memcached. E.G. If a a photo node has 10,000 comment nodes (and thus 10,000 edges linking the photo to the comments), chances are you only want to display the most recent 50 comments. Are all of the 10,000 edges stored in memcached under one key and then paged on the application servers? Are they stored in chunks under multiple keys? How is cache consistency maintained if somebody makes a new comment (maintaining the time ordering seems tricky and expensive)?

This is a problem I'm actively trying to solve for a project, so if somebody knows the answer, please get in touch!

That's what TAO (mentioned in the article) is for

Yes, I've read the TAO paper and referenced it heavily for my project. I'm curious about the implementation specifics of the caching layer.

EDIT: More specifically, how do you efficiently cache your data such that an assoc_range query can be answered from cached without O(n) operations on your application servers. Memcached can't answer a query like "give me 50 items starting from position 0 in the list" as far as i'm aware, so you'd need to pull the whole list to the application server and slice it up there. When you consider that you want the 50 most recent items too, maintaining sorted lists adds extra complexity.

TAO is used in place of memcached. You don't ask TAO for a whole list and place it into memcached. You ask TAO for "give me 50 items starting from position 0 in the list", and the TAO cache keeps this list in sorted order.

Yes, that's the mechanism I'm wondering about. Let's say I have:

    Application -> TAO -> Cache -> Database
I have a photo node (P1) and 125 comments nodes [C1, C2, ..., Cn] attached to P1 by the edges [(P1,C1), (P1,C2), ..., (P1,Cn)]. I'll ignore the fact that there can be different edge types for simplicity.

Lets say my page size is 50 and I want to view 3 pages of comments for the photo from my application. My application makes the following TAO queries:

    assoc_range(P1, 0, 50)
    assoc_range(P1, 50, 50)
    assoc_range(P1, 100, 50)
My question is, assuming all the necessary data cached such that all of those queries will be a cache hit, how are those edges stored and retrieved from memcached? How are the keys named in memcached?

A naive implementation might be to store the list of all edges for P1 with a key of "P1". To answer te above 3 queries, TAO then needs to pull "P1" (all 125 edges) from memcached 3 times to answer each of those 3 queries and slice the edge list up on the TAO application server... Not great, but probably an improvement over hitting the DB for it (up to a certain list length at least).

A less naive implementation might be to store the edges in buckets of 50, such that the 125 edges are stored with keys of "P1_0_50", "P1_51_100", "P1_101_150", but then time ordering comes in to play...

If my application now wants the 50 most recent items, we could store the edge lists by created date descending and I can retrieve "P1_0_50" from the cache and guarantee I have the 50 most recent items. However, lets say 10 new comments are posted... Now I need to update all my cache pages to ensure the ordering is correct, which is horrendously ineffecient!

To fix this issue, edge lists could be stored in created date ascending order instead, but then how do I know which cache page to fetch to retrieve the 50 most recent comments (seeing as "P1_0_50" is now the oldest 50 items)?

I hope that makes sense!

There is no memcache - there is just TAO, and it talks to the database. TAO is a read/write-through cache, so the only way the data changes in the database is through TAO. TAO contains the indices necessary to answer queries like that efficiently (although there may be other systems for doing similar but more specialized or slightly different queries), as well as a cache of the data.

So, when you add a comment, TAO updates its internal structures with the new comment in the right place (after the DB is updated), and there are no "keys" that need to be updated beyond that.

Nice, BTW, TAO itself is not opensource? right. :-)

I love to try build a golang version of TAO as good mental exercise.

Anyone if something similar in golang or any other open source packages exist already?

Anyone else interested in such thing?

Are the TAO APIs asynchronous? Are there any document on the parameters of those APIs? :-)

There is a whitepaper about the details of TAO at https://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers...

Thank you!

I'd like to use this opportunity to ask: is it a technical limitation that users still can't search their timeline?

So ~650M daily active users..4PB of data warehouse created each day, that means ~7MB of new data on each active user per day. Given that its data warehouse, I'm going to guess its not images, seems like a lot to me. I guess it shouldn't surprise anyone that every interaction on and off the site, is heavily tracked.

A lot of that data is duplicated to allow for efficient querying or transformation. It often is too slow to process the data as it comes in, so an initial process will write the data in a raw form, and some other process might select a subset of the data to process, and then submit it in an "annotated" form (filling in, say, the AS number of the client IP). Another process will run later in a batched fashion and perhaps annotate the full set of information and summarize it into a bunch of easily-queried tables.

A lot of that data is also not tied to individuals either - for example the access logs for the CDN (which, being on a different domain by design, does not share cookies so is not attached to an account) even reasonably heavily sampled is probably tens of gigabytes a day, and is rolled up into efficient forms for queries in various ways. A lot of it isn't even about requests coming through the web site/API - it may just be internal inter-service request information, or inter-datacenter flow analysis, or per-machine service metrics ("Oh, look, process A on machines B through E went from 2GB resident to 24GB in 30 seconds a few seconds before the problem manifested").

(Not that it makes too much of a difference at this scale, but it is closer to 860M daily actives.)

FB and Google can clone what your thinking, maybe predict what your will be thinking? :-)

I wonder if they can predict with some percentage accuracy on what any particular active US user might vote for today base on the user's graph data?

3. Hive is Facebook's data warehouse, with 300 petabytes of data in 800,000 tables. Facebook generates 4 new petabyes of data and runs 600,000 queries and 1 million map-reduce jobs per day.

So 4 PB per day, but only 300 PB total?

Was wondering the same thing. My guess is that some also gets removed each day as well, but it seems unlikely.

Think of it like monitoring data. You may collect one-second data on 500 counters per system over 1000 systems, but then you will do a weekly or monthly roll-up where you dump some of the granularity to save space, and after a year you have aggregates that are basically daily trend lines. The more you collect smaller percentage you actually keep.

Still using Memcache wow.

DevOps here using their mcrouter tool [1] in production. It's a phenomenal swiss army knife for using memcached.

[1] https://github.com/facebook/mcrouter

Aw shucks.

Mcrouter was the first piece of software I worked on at facebook. It's nice to see that you like it! (Though the only thing that lives on from me is the umbrella protocol)

Whats wrong with Memcached? Why are you so surprised?

I would have thought that it's too simple for their workload / needs, something like Redis would have been better?

Simple scales better than complex. What they need is key->value caching and for that memcached is a perfect match. I'm not saying that Redis is bad, but when all you need is key-value memory caching, redis isn't needed.

when all you need is key-value memory caching, redis isn't needed.

Balderdash. memcached is actively hostile to modern infrastructure and it's not being actively developed except for routine maintenance.

The first time someone stores an entire data structure in a memcache key is the moment you've lost. Playing "read blob, deserialize, update, serialize, write blob" is just dumb when you can avoid it.

Other moments of great failure with memcached: needing to implement client-side or proxy-based replication, having the memcached process refuse to cache more things because a slabs become broken, the first time your developers don't understand memcached LRU is per-size category and not global, ....

Just Say No to Memcached.

You shouldn't be doing read-modify-write cycles in memcached. It's use is as a demand-filled, look-aside cache. Modern open source memcached has slab reallocation that works quite well. It certainly beats a malloc() heap, which will become highly fragmented and inefficient.

> it's not being actively developed except for routine maintenance

Why gild the lily?

> needing to implement client-side or proxy-based replication

It's best practice to replicate the persistent store for availability, not the cache.

Almost all your other concerns appear to arise out of fundamental misuse. It's a cache, not a panacea to cover up basic design flaws in a primary data store.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact