Hacker News new | comments | show | ask | jobs | submit login
Scuba: Diving into Data at Facebook [pdf] (fb.com)
206 points by mpweiher on Jan 23, 2017 | hide | past | web | favorite | 41 comments

I'm so happy to see this posted. I've been at Facebook for two years, and Scuba is hands-down one of my most favorite internal tools we have (and we have a lot of good ones).

The article focuses a lot on the implementation, but thankfully you don't need to worry about that when using it. The flexible/quick/easy UI is what seals the deal for me, combined with the fast query times and realtime data -- you can use it to query things that happened literally seconds ago. On the React team we use it to collect all dev-time JS warnings from React that our engineers see so that we can easily track what the most high-firing messages are and how frequently they occur.

I haven't tried them extensively, but honeycomb.io and Interana are both Scuba-inspired products by ex-Facebookers. If this tool sounds at all interesting to you, I'd definitely look into using them.

I really miss the awesome tooling like Scuba since I left.

I'll second the post above -- if you miss Scuba, honeycomb.io is for you. https://honeycomb.io/blog/2016/11/honeycomb-faq-in-140-chars...

There's also Snorkel, mostly written by Okay Zed (one of the authors listed on that Scuba paper) http://snorkel.logv.org/

I was at FB from 2006-2015, and Scuba caused a phase change in how Facebook's performance was analyzed and how quickly insights were gained. For many of us in Infrastructure, we spoke of the pre-Scuba, post-Scuba eras. It was such a joy to have aggregated realtime performance metrics for our internet scale services and have the ability to drill down on just about any dimension when doing perf investigations. It gave us so much more confidence in our systems, both in our ability to detect issues, as well as pinpoint what was causing them.

I'd also like to echo what spicyj said about the UI, it is very usable. It was common for managers, PMs, and even people in completely non-technical business roles to use Scuba. It was my favorite internal product at FB and the one I miss the most.

Former Facebook employee and current Googler. I really really miss Scuba. The visualizations are what really cinched things for me.

At Parse I was able to diagnose some seriously complex performance issues for customers by breaking splitting queries (precomputed in logs) into families and looking at probability distributions. It was amazing. Every time the customer told me XYZ was the problem I could send them a screenshot and refocus the conversation where the data sent us.

Some ex Parse employees left FB to build a visualization tool based on the same white papers (honeycomb.io). I'm really hoping to add that to my tool belt again.

There's also a blog post about it here https://www.facebook.com/notes/facebook-engineering/under-th...

My YC company https://www.interana.com took a lot of lessons from this and is doing something that I think is even better for cos like Reddit, Sonos, Comcast, Bing. You can sign up if you'd like a demo :)

It would be really nice if I didn't have to fill out a form to see a demo-- having used Scuba I'd love to advocate for Interana at my current company, but it keeps getting pushed down my TODO list because of the extra friction

This is a special-purpose time-series data warehouse and a UI for querying it. If you are not Facebook, it is almost always better to do projects like this by using a standard data warehouse like Redshift or BigQuery, a queue like Kinesis, and a BI tool like Looker or Tableau. Your data won't be quite as real-time and your queries won't be quite as fast, but it will take much less engineering effort and you'll be able to use the same tools for other projects.

Shameless plug: I'm the maintainer of Rakam project. You don't have to deal with all of these complexities. There are open-source projects that can setup that infrastructure for you. For example, we provide Cloudformation scripts that setup data analytics cluster for you, Kinesis, S3, PrestoDB and RESTFul API for collecting and querying data-sets so that you can use it similar to how you use a SaaS product. We also have an integrated visualization project that can connect your Rakam API and allow you to create dynamic reports and custom dashboards. You can run queries via user interface similar to Scube UI and also complex behavioral queries such as funnel and cohort queries as well as SQL. https://github.com/rakam-io/rakam

Here is a good list of redshift 'gotchas' - https://github.com/open-guides/og-aws#redshift-gotchas-and-l...

We ran into several of those. Notably, it's difficult to achieve the 'real-time' promise of redshift because of the huge performance hit while loading data into the DB so you have to do it off-hours. You can update a replica and then 'hot-swap' it in but this gets expensive. For operational analytics it's better to go with one of the purpose-built timeseries databases and dual write to that and your data warehouse.

The biggest gotcha listed there is how Redshift gets bogged down if you're loading a lot of tables, frequently. You can't run a production Redshift with lots of tables at <15m latency. But in most cases, Redshift is still an overall better choice than a timeseries database because:

* It has all of SQL, including JOINs

* You can use it for both timeseries data and all your other data.

I think you could build a Scuba-like UI over BigQuery that is simple to use and works great.

Wouldn't InfluxDB be a better comparison?

The advantage of general-purpose data warehouses is they give you all of SQL, and they are compatible with BI tools. Time-series data is just one of the types of data you will want to analyze. It's best to choose tools that will work for all your data sources, even if these tools are suboptimal for time-series in particular.

Yay!! So happy to see Scuba get more visibility. I can't even count the number of times I said, and heard other engineers say, that the thing they would miss most about Facebook was Scuba.

That's the whole reason why we built honeycomb.io. If you're a FB expat, check us out.

Former Scuba dev, Okay Zed, wrote http://snorkel.logv.org/. It's open source.

I gotta say, I don't think a snorkel will help the poor person in your logo.

Is there a comparison to scuba?

The engineer who built this, Lior Abraham, went on to start data analytics startup Interana (YC W13).

Yes, indeed. Cofounded with Ann Johnson and fellow ex-Facebooker, Bobby Johnson, who led the infrastructure team for many years.

Interana draws much of it's inspiration from Scuba combined with learnings of analyzing massive amounts of data at scale.

Many high-growth companies use Interana for behavioral analysis of their event logs (Asana, Reddit, Imgur, Nextdoor, Bing, Azure, Tinder, SurveyMonkey, Sonos...).

How does Scuba differ from Presto which is also developed by Facebook? It seems that it stores data in-memory and have data expiration feature but also has many common features such as SQL and distributed processing.

Scuba made decisive tradeoffs in the functionality that it provides. Notable ones include that it doesn't support joins within a table, and doesn't provide any cross-table operations. Mostly it is used for basic filtering on constant values, and gathering summary statistics on those values. This is less of a limitation than it sounds like because when you know this ahead of time you just log your writes in a denormalized way and you don't need to join anything later.

As @ot said, Presto is just a query engine and it doesn't provide a backend. It provides an API that allows it to be plug in to different data warehousing systems. I would assume functionality depends to some extent on what your data is stored in, but in general Presto supports the full suite of standard relational db style queries.

Source: I work at FB as well. In fact I was using Scuba just now to do a quick analysis of our storage requirements for Scuba itself :)

Here is a great post from Bobby Johnson (ex-Facebooker and CTO of Interana) with his opinion on "in-memory" data stores: https://community.interania.com/t5/Blogs/The-Myth-of-In-Memo....

This was the reasoning behind a very key architectural decision at Interana that makes it different than Scuba - instead of developing an in memory system, Interana created a custom data store that is heavily optimized around using spinning disk and CPU cache. This makes it incredibly fast and less expensive to operative massive clusters at scale.

Scuba is a complete system of log collection, storage and retrieval, and UI/visualization.

Presto would only cover the storage/retrival part. Scuba has its own backend for that which is very optimized for the kind of queries the UI needs to support, while Presto is a generic SQL store for analytics.

How does Scuba optimize the stored data-sets for aggregation queries compared to Presto (Raptor connector)? They both use common columnar data storage techniques such as compression, delta encoding and dictionary encoding. The main difference seems to be the real-time nature of Scuba and the UI.

Oh yeah good point, I had forgotten that Presto does not support realtime. About optimizations, I don't know the details, but for one, Scuba is C++ and Presto is Java.

Seems very similar to BigQuery. I wonder how the architectures compare.


^ This is Google's equivalent (mentioned at the end of the paper in "Related Work")

Is this replacing the Gorilla database? Or is it using it under the hood? Or do they co-exist? If so, how are they used differently?

For those who don't know, the Gorilla database is also from Facebook and they published a paper about it roughly a year ago: http://www.vldb.org/pvldb/vol8/p1816-teller.pdf

They're complementary. The ODS system mentioned there is more for monitoring numerical metrics (like Graphite) and doesn't support logging arbitrary data.

ODS is good for top-line metrics and can handle more volume but doesn't compare to Scuba if you want to dig in and look at individual rows in your data (or even just analyze your data and group based on certain columns).

So it samples data less than or equal to a second old and at a rate determined by the person making the query?

I wonder how often the data is inaccurate given the potentially low sample size?

Not sure where you got the "less than or equal to a second old"? Maybe I'm misunderstanding what you mean?

There is no single system-wide imposed sampling rate, so it's up to you to set the sampling rate based on what sort of queries you want to be able to do with good enough accuracy. We have 1:1 rate data for some things (say errors served on a particular service), while a ten or a hundred thousand to one data for other things where there are, say, tens of millions of log lines per second.

Ah yeah. I misread the pdf; the rows expire at millions per second and not after 1 second.

I was wondering about the size of the sampling error. Apparently it's negligible.

Can confirm --- was at FB. Scuba is awesome --- especially the surprisingly sophisticated statistical aggregations and the call-stack view that somebody added.

That was me. Or rather, I created a table with columns s0-s255 and built a primitive tree interface for a stacktrace dataset called Strobelight. Searching was literally if s0 == 'foo' || s1 == 'foo'... etc. This horrified the real Scuba devs enough to add proper vector type and search operators.

How does this compare to Splunk on machine data or Tableau on relational data?

Comparing to Tableau, Scuba is schema-less and doesn't require any setup beyond creating the table (which doesn't have any approval process in the way and gives you a reasonable amount of scratch space to test before you get serious) and then having data arrive. Once Scuba is aware of your column by you submitting data for it, it allows you to query/group based on it very quickly. Scuba is entirely real-time. There are some other projects to do pre-computed aggregates if that's important to you.

I haven't used Splunk in ~7 years, so I can't remember enough about it to compare well. Splunk has some structured-on-top-of-unstructured stuff in it, whereas Scuba is always structured. If you want to turn something unstructured into something structured, you generally run a separate pipeline to do that (using one of the tailing frameworks for your preferred language). In terms of the alarm system in Splunk, we have other systems for handling that using the data that flows into Scuba.

I don't know those platforms, but Scuba doesn't support relational operations. It's not as much of a limitation as it sounds like because you can log your data in a denormalized way.

The title made me picture Mark Zuckerberg diving into facebook user data like Scrooge Mcduck

Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact