Tao: Facebook’s distributed data store for the social graph

sizediterable · on Oct 30, 2021

I'm surprised that they published the more recent paper discussed in part 2 [0] since, IIRC, TAO has supported atomic transactions since at least when I was there a few years ago, but better late than never I suppose. I wonder if that means there's still hope that someone at the company will write a blog post about frameworks like:

  - EntSchema/EntPattern
  - EntQL
  - EntAction
  - PrivacyPolicy
  - Ent Deletion framework
  - Ent consistency checkers/fixers (I forget the name of the framework/tool)

It's sad to think that people's only glimpse into these tools is through https://entgo.io which is a really crude approximation of the real thing due to Go's lack of generics and expressiveness. Imagine if such amazing tools could be used at actually ethical companies.

[0] https://www.micahlerner.com/2021/10/23/ramp-tao-layering-ato...

mhlakhani · on Oct 30, 2021

There’s now a paper on the deletion framework at least. [1] is a high level summary that links to the paper.

[1] https://engineering.fb.com/2020/08/12/security/delf/

sizediterable · on Oct 30, 2021

Oh cool, I didn't know about this. I guess they tend to publish more infrastructure-oriented papers, which is understandable, but I wish there was more in general about the developer APIs that they enable and how they benefit productivity from a product engineer's point of view.

katrielalex · on Oct 30, 2021

Lead author on the deletion paper here. Thanks for the feedback! I agree it would be cool to talk about the Ent APIs some more; they have a lot of obvious and less-obvious benefits. (For example, we are only able to add deletion constraints to Ents as described in the paper because the Ent specification language is expressive enough to do it.)

cletus · on Oct 30, 2021

Technically those are unrelated to TAO. Those are all part of layers that sit on top of TAO. The TAO part is optional too. For example, you can put the Ent data access layer on top of MySQL directly.

sizediterable · on Oct 30, 2021

Right, I wasn't trying to imply a relationship. I only meant to say that if they're willing to write about old news for one piece of tech, then maybe they'd do it for another.

tehlike · on Oct 30, 2021

Isn't ent pretty much an orm like n/hibernate

sizediterable · on Oct 30, 2021

It is an ORM, but I wouldn't call it just an ORM. I don't have experience with NHibernate, but most ORMs I have experience with feel like just a way progammatically build queries. I found EntMutator to provide nice abstractions to couple data changes with business logic like mutation triggers and observers. Again, all the additional tooling built around the Ent framework adds to the value.

tehlike · on Nov 6, 2021

Things like observers are fairly available in the orm land too. Same with mutators and unit of work and so on.

ahupp · on Oct 30, 2021

> IIRC, TAO has supported atomic transactions since at least when I was there a few years ago

TAO has supported single-shard atomic updates forever but this work is about cross-shared transactions. I left a few years ago as well and this didn't exist at the time.

Yeah I think a lot of people would enjoy writeups of those frameworks.

There's also this very nice Ent framework, written by ex-FB engineer: https://ent.dev/

sizediterable · on Oct 30, 2021

Ah, that clears things up. That's sounds huge!

> There's also this very nice Ent framework, written by ex-FB engineer: https://ent.dev/

Awesome, hope it succeeds!

numair · on Oct 30, 2021

> I was there a few years ago

> Imagine if such amazing tools could be used at actually ethical companies.

Congratulations. How long did it take you to realize things are, in fact, as bad as the "haters" (to use a favorite term of the FB inner circle) said they were?

On a more on-topic note, is entGo actually worth implementing as an ORM for Go? I've always been suspicious of most ORMs for all of the various reasons we all know about letting other people manage your database connections, in terms of performance/etc...

sizediterable · on Oct 30, 2021

I didn't leave due to the realization that the company was unethical, but there is a correlation there. I left because I got burnt out by the culture built around performance incentives. I didn't encounter anyone at the company who had inherently malicious or exploitative intent, but the culture drives the individual behaviors that in aggregate leads to all the harm caused by the company. That being said, everyone there, including myself should feel somewhat accountable, but it's really mostly on the leadership for having the power to change the culture but neglecting to do so.

> is entGo actually worth implementing as an ORM for Go

I can't speak to how valuable it would be when judged on its own. I just think that it currently does not and probably will not come close to delivering the same level of value as the FB Ent framework.

numair · on Oct 31, 2021

I appreciate this reply. The sentiment you’ve shared is the same that a lot of current/former Facebook employees have privately emailed / called / etc and told me. The interesting thing is, it’s a lot of the seriously talented engineers and designers who care about these things the most, so there is hope for our future in what all of you will do based on these bad experiences.

EntGo seems to be like it would be useful if used in combination with a bunch of other stuff that was built to take advantage of it, but that’s just my quick impression after looking through the docs. If entity modeling is your thing, I would be interested to know what you thought of Ecto in the Phoenix/Elixir ecosystem.

southerntofu · on Oct 30, 2021

> ethical companies

What a cute oxymoron!

colesantiago · on Oct 30, 2021

why do tools like this and others always get invented at unethical companies?

jitl · on Oct 30, 2021

Are there another kind of company that has 2 billion plus users? Can you store 500 edges * 2 billion users in a off the shelf database system? That was available in 2009? It is from scale and necessity that this kind of software is created.

koolba · on Oct 30, 2021

Developing tools costs money and unethical companies will always be more profitable than ethical ones as they have access to both ethical and unethical revenue streams. More profits means more money to throw at NIH projects.

nefitty · on Oct 30, 2021

The less barriers the people in an organization perceive, the more likely they are to innovate, whether for the greater good or not.

int0x2e · on Oct 30, 2021

I agree with your point, but I've personally also experienced the reverse as well - having very difficult constraints while being extremely motivated and well funded has often resulted in extremely innovative solutions. All of my examples are classified, sadly.

nefitty · on Oct 31, 2021

I can see how more constraints can lead to more focused problem solving for sure.

baby · on Oct 30, 2021

Unethical from your point of view

colesantiago · on Oct 30, 2021

still unethical no matter how you slice it.

sizediterable · on Oct 30, 2021

Unethical companies get rewarded for their unethical behavior instead of being held accountable in any meaningful way. They then use their vast resources to grow their engineering force at an incredible rate. Moving fast means they are also accruing technical debt at an astounding rate, which then leads to throwing more of their resources at people to focus solely on solving these scaling problems.

peterburkimsher · on Oct 30, 2021

Tangentially related: On Facebook, "The average person is connected to every other person by an average of 3.57 steps."

Three and a half degrees of separation

https://research.fb.com/blog/2016/02/three-and-a-half-degree...

When the article was published, there was a tool that helped to visualise that, and I was surprised but excited to see my estimate at 2.91, presumably helped by extensive travelling and diverse social circles.

I believe that if Facebook's success metric is connectedness, rather than profit, the world will benefit.

It's easy to look the other way, and claim that big problems (plastic in the ocean, climate change) aren't my personal responsibility. And for 3 degrees of separation, that's reasonable: Switzerland has no ocean, so can't fix the plastic problem. But here in New Zealand, I bike past the ocean every day. To "love your neighbour" and stop to pick up litter while biking home from work will have a direct impact on the lives of birds and fish right here, and a butterfly effect impact on the whole ecosystem.

By creating software tools like Tao to map the social graph, it's possible to measure the connectedness, and report these metrics. Links are also bidirectional (there's a "thank you" feedback loop) like Xanadu.

For comparison, Wikipedia's connectedness is 3.019 degrees of separation.

https://www.sixdegreesofwikipedia.com/blog/search-results-an...

DaiPlusPlus · on Oct 30, 2021

Looking at TAO just as a data-model (instead of as a database system) and I'm really liking its simplicity.

But when you have a huge system like this, there's always engine-level rules (e.g. referential integrity) and business-rules that need to apply: for example, a Location-object can't "follow" a Comment and a PM cannot "check-in" to a User, and Human Users can't "friend" a declared bot User (I think?), given their overall design I assume the rules about what objects and associations can and cannot exist are enforced by the Association APIs when an association is created - so rejecting an invalid create-association request is straightforward, but how does it handle ever-changing business-requirements that can retroactively apply to existing objects and associations (e.g. supposing that FB decides to allow Human users to befriend bot users, so suddenly Mark Zuckerberg suddenly now has real Friend associations (shots fired), and then they decide to undo that policy change, how is the rule-change applied? What happens to the existing objects? What happens if two different physical FB servers running different versions of the software want to apply conflicting business rules?

Another thing I don't yet understand about using MySQL (or any SQL-first RDBMS for that matter) as a storage layer for an ostensible object/document system (hierarchical structured data) is that now you run into the object-relational impedance mismatch problem - which has many unpleasant solutions (e.g. using the EAV anti-pattern, which defeats the point of using an RDBMS in the first place - or going all-in with Codd's normalization and ending up with a rigid, strict data model schema design which is very difficult to update when business requirements change (i.e. using a table for each Codd relation (tuple) type to for each TAO object type and having to run ALTER TABLE ADD/DROP COLUMN every time you want to update the TAO object schema - which obviously does not scale. I assume each TAO object can be treated as a keyed blob with type-name (i.e. a 3-tuple of (id, type, blob) ) - in which case using an RDMBS is overkill and introduces many overheads, so why use an RDMBS? Especially as I understand that pretty much every major web-service will disable things like referential integrity constraints for the sake of performance (and you can't have a constraint w.r.t data in a blob column anyway).

...so I assume I'm missing something, but what?

bcherny · on Oct 30, 2021

> how does it handle ever-changing business-requirements that can retroactively apply to existing objects and associations

If the change is backwards-compatible:

1. Update the schema definition

2. Run codegen

3. Done

If it’s not:

1. Create a new field/assoc

2. Double-write data to the new and old fields/assoc

3. Backfill data from the old one to the new one

4. Delete the old one

5. Done

> I assume each TAO object can be treated as a keyed blob with type-name (i.e. a 3-tuple of (id, type, blob) ) - in which case using an RDMBS is overkill and introduces many overheads, so why use an RDMBS?

I believe data is stored in tuples, same as most other graph DBs. I can’t speak to why MySQL in particular.

DaiPlusPlus · on Oct 30, 2021

Facebook's database is distributed - it's not as simple as "updating the schema definition" because that process-step alone is non-trivial with many cases to consider.

bcherny · on Oct 30, 2021

The process is more complicated under the hood. I’m describing it from the POV of an engineer that wants to update a field — from that POV, it really is that simple.

DaiPlusPlus · on Oct 30, 2021

> The process is more complicated under the hood.

Right, but that's exactly what I'm asking about :)

RandomBK · on Oct 30, 2021

The key is in making sure every step-wise change is backwards-compatible.

Creating a new field, double-writing, and backfilling are all relatively easy tasks in a distributed environment. Individual nodes can take their time catching up to the latest schema.

After a day or two, the engineer can verify that all data have been correctly migrated and double-written to the new ent/assoc. They then turn off double-writing to begin deprecating the old assoc. This can be done in an A/B test to ensure no major breakages.

Once the double-writing is fully turned off, the engineer checks one last time that all the data have been migrated. They check that the read/write rate to the old ent/assoc is 0, which then confirms that the old field is now safe to delete.

Overall, the trick is to think of the migration as a series of safe, backwards-compatible, easily-distributed steps, rather than a single atomic operation.

bcherny · on Oct 30, 2021

Sorry, I must have misunderstood the question! :)

pbalau · on Oct 30, 2021

> ...so I assume I'm missing something, but what?

Facebook started, as any other web app, with a relational database used as a relational database (fks, indexes, all that stuff). They built tooling around that and acquired a great deal of knowledge on how to run mysql at scale. Switching to a different system is a huge investment with no real benefits. In short, they are using mysql because inertia, caused by the difficulty to run such a system at this scale.

DaiPlusPlus · on Oct 30, 2021

> Switching to a different system is a huge investment with no real benefits

But there would be "real benefits": there are significant performance overheads, especially at Facebook's scale, with using an RDBMS compared to something specifically written for their use-case. If a different storage system has even, say, 10% better performance overall then that translates to a 10% reduction in hardware-costs, which at Facebook's scale is easily tens-of-millions of dollars per year.

The first thing that comes to mind is RocksDB - of course then I remembered just now that Facebook does use RocksDB, but with MyRocks (which combines RocksDB with MySQL - though the MyRocks website and docs don't clearly explain exactly how and where MyRocks sits in-relation to application servers and MySQL - or different MySQL storage engines... ).

pbalau · on Oct 30, 2021

> But there would be "real benefits": there are significant performance overheads, especially at Facebook's scale, with using an RDBMS compared to something specifically written for their use-case. If a different storage system has even, say, 10% better performance overall then that translates to a 10% reduction in hardware-costs, which at Facebook's scale is easily tens-of-millions of dollars per year.

And how would you know that? Based on what did you come up with those numbers? I find that hyperbole puts me off discussions of this kind.

cranekam · on Oct 30, 2021

There's some info on MyRocks here:

https://engineering.fb.com/2017/09/25/core-data/migrating-a-...

Of note is that it halved the storage requirements for the user DB tier. That's a pretty big win.

As for why MySQL persists: I no longer work at FB so I'm not up to speed on the current thinking but one thing to remember is that TAO isn't the only thing that talks to MySQL. The web tier itself still has (or had) data types that were backed by MySQL and memcache. Lots of other systems for data processing, configuration, and so on stored data in MySQL. Replacing all of that would be a huge undertaking. I doubt there's a 10% win laying around just waiting for the taking with all this work, especially after MyRocks rolls out.

Also note that MySQL isn't the only storage system at Facebook. There are several others that were specifically written for specific use cases (e.g. ZippyDB, https://engineering.fb.com/2021/08/06/core-data/zippydb/).

techbio · on Oct 30, 2021

Later in the article the name is explained in that TAO is used as short for "The associations and objects" and while it's clever and I'm not really offended it really should be TAAO.

Regardless, I'm surprised not to see anything about caching objects as application state using[1] or similar.

[1]https://www.php.net/manual/en/book.apcu.php

ehnto · on Oct 30, 2021

Is there a philosophical or historical reason behind the world Tao being used for things that are distributed?

travisd · on Oct 30, 2021

In the post, they give the reason for the name in this case: The Associations and Objects.