The article focuses a lot on the implementation, but thankfully you don't need to worry about that when using it. The flexible/quick/easy UI is what seals the deal for me, combined with the fast query times and realtime data -- you can use it to query things that happened literally seconds ago. On the React team we use it to collect all dev-time JS warnings from React that our engineers see so that we can easily track what the most high-firing messages are and how frequently they occur.
I haven't tried them extensively, but honeycomb.io and Interana are both Scuba-inspired products by ex-Facebookers. If this tool sounds at all interesting to you, I'd definitely look into using them.
I'd also like to echo what spicyj said about the UI, it is very usable. It was common for managers, PMs, and even people in completely non-technical business roles to use Scuba. It was my favorite internal product at FB and the one I miss the most.
At Parse I was able to diagnose some seriously complex performance issues for customers by breaking splitting queries (precomputed in logs) into families and looking at probability distributions. It was amazing. Every time the customer told me XYZ was the problem I could send them a screenshot and refocus the conversation where the data sent us.
Some ex Parse employees left FB to build a visualization tool based on the same white papers (honeycomb.io). I'm really hoping to add that to my tool belt again.
My YC company https://www.interana.com took a lot of lessons from this and is doing something that I think is even better for cos like Reddit, Sonos, Comcast, Bing. You can sign up if you'd like a demo :)
We ran into several of those. Notably, it's difficult to achieve the 'real-time' promise of redshift because of the huge performance hit while loading data into the DB so you have to do it off-hours. You can update a replica and then 'hot-swap' it in but this gets expensive. For operational analytics it's better to go with one of the purpose-built timeseries databases and dual write to that and your data warehouse.
* It has all of SQL, including JOINs
* You can use it for both timeseries data and all your other data.
That's the whole reason why we built honeycomb.io. If you're a FB expat, check us out.
Interana draws much of it's inspiration from Scuba combined with learnings of analyzing massive amounts of data at scale.
Many high-growth companies use Interana for behavioral analysis of their event logs (Asana, Reddit, Imgur, Nextdoor, Bing, Azure, Tinder, SurveyMonkey, Sonos...).
As @ot said, Presto is just a query engine and it doesn't provide a backend. It provides an API that allows it to be plug in to different data warehousing systems. I would assume functionality depends to some extent on what your data is stored in, but in general Presto supports the full suite of standard relational db style queries.
Source: I work at FB as well. In fact I was using Scuba just now to do a quick analysis of our storage requirements for Scuba itself :)
This was the reasoning behind a very key architectural decision at Interana that makes it different than Scuba - instead of developing an in memory system, Interana created a custom data store that is heavily optimized around using spinning disk and CPU cache. This makes it incredibly fast and less expensive to operative massive clusters at scale.
Presto would only cover the storage/retrival part. Scuba has its own backend for that which is very optimized for the kind of queries the UI needs to support, while Presto is a generic SQL store for analytics.
^ This is Google's equivalent (mentioned at the end of the paper in "Related Work")
For those who don't know, the Gorilla database is also from Facebook and they published a paper about it roughly a year ago: http://www.vldb.org/pvldb/vol8/p1816-teller.pdf
ODS is good for top-line metrics and can handle more volume but doesn't compare to Scuba if you want to dig in and look at individual rows in your data (or even just analyze your data and group based on certain columns).
I wonder how often the data is inaccurate given the potentially low sample size?
There is no single system-wide imposed sampling rate, so it's up to you to set the sampling rate based on what sort of queries you want to be able to do with good enough accuracy. We have 1:1 rate data for some things (say errors served on a particular service), while a ten or a hundred thousand to one data for other things where there are, say, tens of millions of log lines per second.
I was wondering about the size of the sampling error. Apparently it's negligible.
I haven't used Splunk in ~7 years, so I can't remember enough about it to compare well. Splunk has some structured-on-top-of-unstructured stuff in it, whereas Scuba is always structured. If you want to turn something unstructured into something structured, you generally run a separate pipeline to do that (using one of the tailing frameworks for your preferred language). In terms of the alarm system in Splunk, we have other systems for handling that using the data that flows into Scuba.