Scuba: Diving into Data at Facebook [pdf]

sophiebits · on Jan 23, 2017

I'm so happy to see this posted. I've been at Facebook for two years, and Scuba is hands-down one of my most favorite internal tools we have (and we have a lot of good ones).

The article focuses a lot on the implementation, but thankfully you don't need to worry about that when using it. The flexible/quick/easy UI is what seals the deal for me, combined with the fast query times and realtime data -- you can use it to query things that happened literally seconds ago. On the React team we use it to collect all dev-time JS warnings from React that our engineers see so that we can easily track what the most high-firing messages are and how frequently they occur.

I haven't tried them extensively, but honeycomb.io and Interana are both Scuba-inspired products by ex-Facebookers. If this tool sounds at all interesting to you, I'd definitely look into using them.

krn1p4n1c · on Jan 23, 2017

I really miss the awesome tooling like Scuba since I left.

emfree · on Jan 23, 2017

I'll second the post above -- if you miss Scuba, honeycomb.io is for you. https://honeycomb.io/blog/2016/11/honeycomb-faq-in-140-chars...

jmtulloss · on Jan 24, 2017

There's also Snorkel, mostly written by Okay Zed (one of the authors listed on that Scuba paper) http://snorkel.logv.org/

nocarrier · on Jan 23, 2017

I was at FB from 2006-2015, and Scuba caused a phase change in how Facebook's performance was analyzed and how quickly insights were gained. For many of us in Infrastructure, we spoke of the pre-Scuba, post-Scuba eras. It was such a joy to have aggregated realtime performance metrics for our internet scale services and have the ability to drill down on just about any dimension when doing perf investigations. It gave us so much more confidence in our systems, both in our ability to detect issues, as well as pinpoint what was causing them.

I'd also like to echo what spicyj said about the UI, it is very usable. It was common for managers, PMs, and even people in completely non-technical business roles to use Scuba. It was my favorite internal product at FB and the one I miss the most.

inlined · on Jan 23, 2017

Former Facebook employee and current Googler. I really really miss Scuba. The visualizations are what really cinched things for me.

At Parse I was able to diagnose some seriously complex performance issues for customers by breaking splitting queries (precomputed in logs) into families and looking at probability distributions. It was amazing. Every time the customer told me XYZ was the problem I could send them a screenshot and refocus the conversation where the data sent us.

Some ex Parse employees left FB to build a visualization tool based on the same white papers (honeycomb.io). I'm really hoping to add that to my tool belt again.

liorabraham · on Jan 23, 2017

There's also a blog post about it here https://www.facebook.com/notes/facebook-engineering/under-th...

My YC company https://www.interana.com took a lot of lessons from this and is doing something that I think is even better for cos like Reddit, Sonos, Comcast, Bing. You can sign up if you'd like a demo :)

anymoonus · on Jan 24, 2017

It would be really nice if I didn't have to fill out a form to see a demo-- having used Scuba I'd love to advocate for Interana at my current company, but it keeps getting pushed down my TODO list because of the extra friction

georgewfraser · on Jan 23, 2017

This is a special-purpose time-series data warehouse and a UI for querying it. If you are not Facebook, it is almost always better to do projects like this by using a standard data warehouse like Redshift or BigQuery, a queue like Kinesis, and a BI tool like Looker or Tableau. Your data won't be quite as real-time and your queries won't be quite as fast, but it will take much less engineering effort and you'll be able to use the same tools for other projects.

buremba · on Jan 23, 2017

Shameless plug: I'm the maintainer of Rakam project. You don't have to deal with all of these complexities. There are open-source projects that can setup that infrastructure for you. For example, we provide Cloudformation scripts that setup data analytics cluster for you, Kinesis, S3, PrestoDB and RESTFul API for collecting and querying data-sets so that you can use it similar to how you use a SaaS product. We also have an integrated visualization project that can connect your Rakam API and allow you to create dynamic reports and custom dashboards. You can run queries via user interface similar to Scube UI and also complex behavioral queries such as funnel and cohort queries as well as SQL. https://github.com/rakam-io/rakam

siliconc0w · on Jan 23, 2017

Here is a good list of redshift 'gotchas' - https://github.com/open-guides/og-aws#redshift-gotchas-and-l...

We ran into several of those. Notably, it's difficult to achieve the 'real-time' promise of redshift because of the huge performance hit while loading data into the DB so you have to do it off-hours. You can update a replica and then 'hot-swap' it in but this gets expensive. For operational analytics it's better to go with one of the purpose-built timeseries databases and dual write to that and your data warehouse.

georgewfraser · on Jan 24, 2017

The biggest gotcha listed there is how Redshift gets bogged down if you're loading a lot of tables, frequently. You can't run a production Redshift with lots of tables at <15m latency. But in most cases, Redshift is still an overall better choice than a timeseries database because:

* It has all of SQL, including JOINs

* You can use it for both timeseries data and all your other data.

sophiebits · on Jan 23, 2017

I think you could build a Scuba-like UI over BigQuery that is simple to use and works great.

_fsjdf_ · on Jan 23, 2017

Wouldn't InfluxDB be a better comparison?

georgewfraser · on Jan 24, 2017

The advantage of general-purpose data warehouses is they give you all of SQL, and they are compatible with BI tools. Time-series data is just one of the types of data you will want to analyze. It's best to choose tools that will work for all your data sources, even if these tools are suboptimal for time-series in particular.

spimmy · on Jan 23, 2017

Yay!! So happy to see Scuba get more visibility. I can't even count the number of times I said, and heard other engineers say, that the thing they would miss most about Facebook was Scuba.

That's the whole reason why we built honeycomb.io. If you're a FB expat, check us out.

amenghra · on Jan 23, 2017

Former Scuba dev, Okay Zed, wrote http://snorkel.logv.org/. It's open source.

Solinoid · on Jan 23, 2017

I gotta say, I don't think a snorkel will help the poor person in your logo.

soothseer · on Jan 23, 2017

seems apt for this: https://www.flickr.com/groups/stickfiguresinperil/

AlexCoventry · on Jan 23, 2017

Is there a comparison to scuba?

CptJamesCook · on Jan 23, 2017

The engineer who built this, Lior Abraham, went on to start data analytics startup Interana (YC W13).

nvais · on Jan 23, 2017

Yes, indeed. Cofounded with Ann Johnson and fellow ex-Facebooker, Bobby Johnson, who led the infrastructure team for many years.

Interana draws much of it's inspiration from Scuba combined with learnings of analyzing massive amounts of data at scale.

Many high-growth companies use Interana for behavioral analysis of their event logs (Asana, Reddit, Imgur, Nextdoor, Bing, Azure, Tinder, SurveyMonkey, Sonos...).

buremba · on Jan 23, 2017

How does Scuba differ from Presto which is also developed by Facebook? It seems that it stores data in-memory and have data expiration feature but also has many common features such as SQL and distributed processing.

lstyls · on Jan 23, 2017

Scuba made decisive tradeoffs in the functionality that it provides. Notable ones include that it doesn't support joins within a table, and doesn't provide any cross-table operations. Mostly it is used for basic filtering on constant values, and gathering summary statistics on those values. This is less of a limitation than it sounds like because when you know this ahead of time you just log your writes in a denormalized way and you don't need to join anything later.

As @ot said, Presto is just a query engine and it doesn't provide a backend. It provides an API that allows it to be plug in to different data warehousing systems. I would assume functionality depends to some extent on what your data is stored in, but in general Presto supports the full suite of standard relational db style queries.

Source: I work at FB as well. In fact I was using Scuba just now to do a quick analysis of our storage requirements for Scuba itself :)

nvais · on Jan 23, 2017

Here is a great post from Bobby Johnson (ex-Facebooker and CTO of Interana) with his opinion on "in-memory" data stores: https://community.interania.com/t5/Blogs/The-Myth-of-In-Memo....

This was the reasoning behind a very key architectural decision at Interana that makes it different than Scuba - instead of developing an in memory system, Interana created a custom data store that is heavily optimized around using spinning disk and CPU cache. This makes it incredibly fast and less expensive to operative massive clusters at scale.

ot · on Jan 23, 2017

Scuba is a complete system of log collection, storage and retrieval, and UI/visualization.

Presto would only cover the storage/retrival part. Scuba has its own backend for that which is very optimized for the kind of queries the UI needs to support, while Presto is a generic SQL store for analytics.

buremba · on Jan 23, 2017

How does Scuba optimize the stored data-sets for aggregation queries compared to Presto (Raptor connector)? They both use common columnar data storage techniques such as compression, delta encoding and dictionary encoding. The main difference seems to be the real-time nature of Scuba and the UI.

ot · on Jan 23, 2017

Oh yeah good point, I had forgotten that Presto does not support realtime. About optimizations, I don't know the details, but for one, Scuba is C++ and Presto is Java.

mnort9 · on Jan 23, 2017

Seems very similar to BigQuery. I wonder how the architectures compare.

kajecounterhack · on Jan 23, 2017

https://research.google.com/pubs/pub36632.html

^ This is Google's equivalent (mentioned at the end of the paper in "Related Work")

bajsejohannes · on Jan 23, 2017

Is this replacing the Gorilla database? Or is it using it under the hood? Or do they co-exist? If so, how are they used differently?

For those who don't know, the Gorilla database is also from Facebook and they published a paper about it roughly a year ago: http://www.vldb.org/pvldb/vol8/p1816-teller.pdf

sophiebits · on Jan 23, 2017

They're complementary. The ODS system mentioned there is more for monitoring numerical metrics (like Graphite) and doesn't support logging arbitrary data.

ODS is good for top-line metrics and can handle more volume but doesn't compare to Scuba if you want to dig in and look at individual rows in your data (or even just analyze your data and group based on certain columns).

debt · on Jan 24, 2017

So it samples data less than or equal to a second old and at a rate determined by the person making the query?

I wonder how often the data is inaccurate given the potentially low sample size?

nbm · on Jan 24, 2017

Not sure where you got the "less than or equal to a second old"? Maybe I'm misunderstanding what you mean?

There is no single system-wide imposed sampling rate, so it's up to you to set the sampling rate based on what sort of queries you want to be able to do with good enough accuracy. We have 1:1 rate data for some things (say errors served on a particular service), while a ten or a hundred thousand to one data for other things where there are, say, tens of millions of log lines per second.

debt · on Jan 24, 2017

Ah yeah. I misread the pdf; the rows expire at millions per second and not after 1 second.

I was wondering about the size of the sampling error. Apparently it's negligible.

quotemstr · on Jan 24, 2017

Can confirm --- was at FB. Scuba is awesome --- especially the surprisingly sophisticated statistical aggregations and the call-stack view that somebody added.

aristus · on Jan 24, 2017

That was me. Or rather, I created a table with columns s0-s255 and built a primitive tree interface for a stacktrace dataset called Strobelight. Searching was literally if s0 == 'foo' || s1 == 'foo'... etc. This horrified the real Scuba devs enough to add proper vector type and search operators.

_fsjdf_ · on Jan 23, 2017

How does this compare to Splunk on machine data or Tableau on relational data?

nbm · on Jan 23, 2017

Comparing to Tableau, Scuba is schema-less and doesn't require any setup beyond creating the table (which doesn't have any approval process in the way and gives you a reasonable amount of scratch space to test before you get serious) and then having data arrive. Once Scuba is aware of your column by you submitting data for it, it allows you to query/group based on it very quickly. Scuba is entirely real-time. There are some other projects to do pre-computed aggregates if that's important to you.

I haven't used Splunk in ~7 years, so I can't remember enough about it to compare well. Splunk has some structured-on-top-of-unstructured stuff in it, whereas Scuba is always structured. If you want to turn something unstructured into something structured, you generally run a separate pipeline to do that (using one of the tailing frameworks for your preferred language). In terms of the alarm system in Splunk, we have other systems for handling that using the data that flows into Scuba.

lstyls · on Jan 23, 2017

I don't know those platforms, but Scuba doesn't support relational operations. It's not as much of a limitation as it sounds like because you can log your data in a denormalized way.

jayeshsalvi · on Jan 23, 2017

The title made me picture Mark Zuckerberg diving into facebook user data like Scrooge Mcduck