
Scuba: Diving into Data at Facebook [pdf] - mpweiher
https://research.fb.com/wp-content/uploads/2016/11/scuba-diving-into-data-at-facebook.pdf
======
sophiebits
I'm so happy to see this posted. I've been at Facebook for two years, and
Scuba is hands-down one of my most favorite internal tools we have (and we
have a lot of good ones).

The article focuses a lot on the implementation, but thankfully you don't need
to worry about that when using it. The flexible/quick/easy UI is what seals
the deal for me, combined with the fast query times and realtime data -- you
can use it to query things that happened literally seconds ago. On the React
team we use it to collect all dev-time JS warnings from React that our
engineers see so that we can easily track what the most high-firing messages
are and how frequently they occur.

I haven't tried them extensively, but honeycomb.io and Interana are both
Scuba-inspired products by ex-Facebookers. If this tool sounds at all
interesting to you, I'd definitely look into using them.

~~~
krn1p4n1c
I really miss the awesome tooling like Scuba since I left.

~~~
emfree
I'll second the post above -- if you miss Scuba, honeycomb.io is for you.
[https://honeycomb.io/blog/2016/11/honeycomb-faq-
in-140-chars...](https://honeycomb.io/blog/2016/11/honeycomb-faq-in-140-chars-
getting-started/)

~~~
jmtulloss
There's also Snorkel, mostly written by Okay Zed (one of the authors listed on
that Scuba paper) [http://snorkel.logv.org/](http://snorkel.logv.org/)

------
nocarrier
I was at FB from 2006-2015, and Scuba caused a phase change in how Facebook's
performance was analyzed and how quickly insights were gained. For many of us
in Infrastructure, we spoke of the pre-Scuba, post-Scuba eras. It was such a
joy to have aggregated realtime performance metrics for our internet scale
services and have the ability to drill down on just about any dimension when
doing perf investigations. It gave us so much more confidence in our systems,
both in our ability to detect issues, as well as pinpoint what was causing
them.

I'd also like to echo what spicyj said about the UI, it is very usable. It was
common for managers, PMs, and even people in completely non-technical business
roles to use Scuba. It was my favorite internal product at FB and the one I
miss the most.

------
inlined
Former Facebook employee and current Googler. I really really miss Scuba. The
visualizations are what really cinched things for me.

At Parse I was able to diagnose some seriously complex performance issues for
customers by breaking splitting queries (precomputed in logs) into families
and looking at probability distributions. It was amazing. Every time the
customer told me XYZ was the problem I could send them a screenshot and
refocus the conversation where the data sent us.

Some ex Parse employees left FB to build a visualization tool based on the
same white papers (honeycomb.io). I'm really hoping to add that to my tool
belt again.

------
liorabraham
There's also a blog post about it here
[https://www.facebook.com/notes/facebook-engineering/under-
th...](https://www.facebook.com/notes/facebook-engineering/under-the-hood-
data-diving-with-scuba/10150599692628920/)

My YC company [https://www.interana.com](https://www.interana.com) took a lot
of lessons from this and is doing something that I think is even better for
cos like Reddit, Sonos, Comcast, Bing. You can sign up if you'd like a demo :)

~~~
anymoonus
It would be really nice if I didn't have to fill out a form to see a demo--
having used Scuba I'd love to advocate for Interana at my current company, but
it keeps getting pushed down my TODO list because of the extra friction

------
georgewfraser
This is a special-purpose time-series data warehouse and a UI for querying it.
If you are not Facebook, it is _almost always_ better to do projects like this
by using a standard data warehouse like Redshift or BigQuery, a queue like
Kinesis, and a BI tool like Looker or Tableau. Your data won't be quite as
real-time and your queries won't be quite as fast, but it will take much less
engineering effort and you'll be able to use the same tools for other
projects.

~~~
siliconc0w
Here is a good list of redshift 'gotchas' \- [https://github.com/open-
guides/og-aws#redshift-gotchas-and-l...](https://github.com/open-guides/og-
aws#redshift-gotchas-and-limitations)

We ran into several of those. Notably, it's difficult to achieve the 'real-
time' promise of redshift because of the huge performance hit while loading
data into the DB so you have to do it off-hours. You can update a replica and
then 'hot-swap' it in but this gets expensive. For operational analytics it's
better to go with one of the purpose-built timeseries databases and dual write
to that and your data warehouse.

~~~
georgewfraser
The biggest gotcha listed there is how Redshift gets bogged down if you're
loading a lot of tables, frequently. You can't run a production Redshift with
lots of tables at <15m latency. But in most cases, Redshift is still an
overall better choice than a timeseries database because:

* It has all of SQL, including JOINs

* You can use it for both timeseries data and all your other data.

------
spimmy
Yay!! So happy to see Scuba get more visibility. I can't even count the number
of times I said, and heard other engineers say, that the thing they would miss
most about Facebook was Scuba.

That's the whole reason why we built honeycomb.io. If you're a FB expat, check
us out.

------
amenghra
Former Scuba dev, Okay Zed, wrote
[http://snorkel.logv.org/](http://snorkel.logv.org/). It's open source.

~~~
Solinoid
I gotta say, I don't think a snorkel will help the poor person in your logo.

~~~
soothseer
seems apt for this:
[https://www.flickr.com/groups/stickfiguresinperil/](https://www.flickr.com/groups/stickfiguresinperil/)

------
CptJamesCook
The engineer who built this, Lior Abraham, went on to start data analytics
startup Interana (YC W13).

~~~
nvais
Yes, indeed. Cofounded with Ann Johnson and fellow ex-Facebooker, Bobby
Johnson, who led the infrastructure team for many years.

Interana draws much of it's inspiration from Scuba combined with learnings of
analyzing massive amounts of data at scale.

Many high-growth companies use Interana for behavioral analysis of their event
logs (Asana, Reddit, Imgur, Nextdoor, Bing, Azure, Tinder, SurveyMonkey,
Sonos...).

------
buremba
How does Scuba differ from Presto which is also developed by Facebook? It
seems that it stores data in-memory and have data expiration feature but also
has many common features such as SQL and distributed processing.

~~~
ot
Scuba is a complete system of log collection, storage and retrieval, and
UI/visualization.

Presto would only cover the storage/retrival part. Scuba has its own backend
for that which is very optimized for the kind of queries the UI needs to
support, while Presto is a generic SQL store for analytics.

~~~
buremba
How does Scuba optimize the stored data-sets for aggregation queries compared
to Presto (Raptor connector)? They both use common columnar data storage
techniques such as compression, delta encoding and dictionary encoding. The
main difference seems to be the real-time nature of Scuba and the UI.

~~~
ot
Oh yeah good point, I had forgotten that Presto does not support realtime.
About optimizations, I don't know the details, but for one, Scuba is C++ and
Presto is Java.

------
mnort9
Seems very similar to BigQuery. I wonder how the architectures compare.

~~~
kajecounterhack
[https://research.google.com/pubs/pub36632.html](https://research.google.com/pubs/pub36632.html)

^ This is Google's equivalent (mentioned at the end of the paper in "Related
Work")

------
bajsejohannes
Is this replacing the Gorilla database? Or is it using it under the hood? Or
do they co-exist? If so, how are they used differently?

For those who don't know, the Gorilla database is also from Facebook and they
published a paper about it roughly a year ago:
[http://www.vldb.org/pvldb/vol8/p1816-teller.pdf](http://www.vldb.org/pvldb/vol8/p1816-teller.pdf)

~~~
spicyj
They're complementary. The ODS system mentioned there is more for monitoring
numerical metrics (like Graphite) and doesn't support logging arbitrary data.

ODS is good for top-line metrics and can handle more volume but doesn't
compare to Scuba if you want to dig in and look at individual rows in your
data (or even just analyze your data and group based on certain columns).

------
debt
So it samples data less than or equal to a second old and at a rate determined
by the person making the query?

I wonder how often the data is inaccurate given the potentially low sample
size?

~~~
nbm
Not sure where you got the "less than or equal to a second old"? Maybe I'm
misunderstanding what you mean?

There is no single system-wide imposed sampling rate, so it's up to you to set
the sampling rate based on what sort of queries you want to be able to do with
good enough accuracy. We have 1:1 rate data for some things (say errors served
on a particular service), while a ten or a hundred thousand to one data for
other things where there are, say, tens of millions of log lines per second.

~~~
debt
Ah yeah. I misread the pdf; the rows expire at millions per second and not
after 1 second.

I was wondering about the size of the sampling error. Apparently it's
negligible.

------
quotemstr
Can confirm --- was at FB. Scuba is awesome --- especially the surprisingly
sophisticated statistical aggregations and the call-stack view that somebody
added.

~~~
aristus
That was me. Or rather, I created a table with columns s0-s255 and built a
primitive tree interface for a stacktrace dataset called Strobelight.
Searching was literally if s0 == 'foo' || s1 == 'foo'... etc. This horrified
the real Scuba devs enough to add proper vector type and search operators.

------
_fsjdf_
How does this compare to Splunk on machine data or Tableau on relational data?

~~~
nbm
Comparing to Tableau, Scuba is schema-less and doesn't require any setup
beyond creating the table (which doesn't have any approval process in the way
and gives you a reasonable amount of scratch space to test before you get
serious) and then having data arrive. Once Scuba is aware of your column by
you submitting data for it, it allows you to query/group based on it very
quickly. Scuba is entirely real-time. There are some other projects to do pre-
computed aggregates if that's important to you.

I haven't used Splunk in ~7 years, so I can't remember enough about it to
compare well. Splunk has some structured-on-top-of-unstructured stuff in it,
whereas Scuba is always structured. If you want to turn something unstructured
into something structured, you generally run a separate pipeline to do that
(using one of the tailing frameworks for your preferred language). In terms of
the alarm system in Splunk, we have other systems for handling that using the
data that flows into Scuba.

------
jayeshsalvi
The title made me picture Mark Zuckerberg diving into facebook user data like
Scrooge Mcduck

