Hacker News new | past | comments | ask | show | jobs | submit login
Procella: Unifying serving and analytical data at YouTube (acolyer.org)
31 points by feross 4 days ago | hide | past | web | favorite | 7 comments

This paper finally get published! This is basically Dremel (or Big Query) capable of handling both real-time and ad-hoc analytic queries. Also, being able to serve the production traffic and do efficient ad-hoc analysis on the same data source has a potential of drastic simplification on infrastructures; sometime a scale of data doesn't allow us to use a single representation for all use cases and this fact brings lots of unnecessary complications. Artus also made interesting design trade-offs on performance. I hope more details on this format could be published later.

I'm skeptical of the comparison to Capacitor. Best I can tell from the paper they're using about the same techniques from the seminal Abadi et al's compressed columnar storage paper. As a result I'm having a hard time believing that Artus is several times faster than Capacitor on similar analytical workloads, unless we're seeing the hand picked queries here, where it does really well, or unless the performance difference is attributable to better metadata caching and/or the ability to use inverted indexes on those particular queries (such indexes are usually not much help unless you're running a point query), rather than the lower-level format and runtime itself.

TBH Artus sounds like Capacitor with indexes and a different header format to me. Probably something mmap-able like FlatBuffer if I were to guess, I don't have any direct info

Hi, I'm a technical lead on Procella.

You are correct that Artus shows a significant benefit over Capacitor primarily for point lookups and where indexes can be used. However, we also see better scan performance with Artus today for a few reasons:

1) general purpose compression is used only on disk, not in memory. This decompression cost is then paid only at cache miss time, not on every scan, at the cost of larger in-cache sizes.

2) tighter integration with our evaluation engine allows us to, for example, directly use the on-disk dictionary and run-length encoding at evaluation time rather than just for pushdowns. This could be addressed as we continue to work with the Capacitor team.

3) Artus offers a much wider API surface (beyond just #2 above) and many more on-disk encodings (to some degree, forced by #1), allowing for more optimization opportunities at the cost of increased implementation and client complexity.

Having their own file format could really help them out as it lets them be sole reader and writer of it. File formats for warehouses are generally a little out of date in terms of performance because it requires all the compute engines to be able to read them and some will lag behind others in updating.

The metadata caching helps a ton I’m sure too. When you have to issue the same get file handle on lots of nodes instead of just say one then you lose out a lot on tight latencies and can cause lots of problems for the underlying storage system

This isn't actually about "get file handle", although for, say, a million files that could take a while. This is about having metadata (columns, types, ranges for range partitions, etc) already available to be used in query planning.

But these kinds of optimizations only give you dramatic benefits on very specific and relatively small subset of queries rather than on some realistically mixed workload, so to have a fair comparison you'd need to run this realistic workload on a realistic dataset.

Seems cool. But a lot of the things this blog gets excited over are better understood as features of Google’s platform that are available to and exploited by many of their systems. Eg disaggregated storage with append-only immutable files is just what Colossus does.

The benefits of using googles internal tools isn’t what this paper is mainly about though. They do benefit from it but much more of the paper is explaining how extremely careful caching and having a deep control over the file format allows them to do things insanely fast.

Reading from colossus is only done if it misses the data cache and the data cache has a 90% hit rate. So in effect they’re getting in memory speeds instead of needing to always hit storage.

Otherwise there wouldnt be too much of a difference versus dremel.

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact