
Procella: Unifying serving and analytical data at YouTube - feross
https://blog.acolyer.org/2019/09/11/procella/
======
summerlight
This paper finally get published! This is basically Dremel (or Big Query)
capable of handling both real-time and ad-hoc analytic queries. Also, being
able to serve the production traffic and do efficient ad-hoc analysis on the
same data source has a potential of drastic simplification on infrastructures;
sometime a scale of data doesn't allow us to use a single representation for
all use cases and this fact brings lots of unnecessary complications. Artus
also made interesting design trade-offs on performance. I hope more details on
this format could be published later.

------
m0zg
I'm skeptical of the comparison to Capacitor. Best I can tell from the paper
they're using about the same techniques from the seminal Abadi et al's
compressed columnar storage paper. As a result I'm having a hard time
believing that Artus is several times faster than Capacitor on similar
analytical workloads, unless we're seeing the hand picked queries here, where
it does really well, or unless the performance difference is attributable to
better metadata caching and/or the ability to use inverted indexes on those
particular queries (such indexes are usually not much help unless you're
running a point query), rather than the lower-level format and runtime itself.

TBH Artus sounds like Capacitor with indexes and a different header format to
me. Probably something mmap-able like FlatBuffer if I were to guess, I don't
have any direct info

~~~
WookieRushing
Having their own file format could really help them out as it lets them be
sole reader and writer of it. File formats for warehouses are generally a
little out of date in terms of performance because it requires all the compute
engines to be able to read them and some will lag behind others in updating.

The metadata caching helps a ton I’m sure too. When you have to issue the same
get file handle on lots of nodes instead of just say one then you lose out a
lot on tight latencies and can cause lots of problems for the underlying
storage system

~~~
m0zg
This isn't actually about "get file handle", although for, say, a million
files that could take a while. This is about having metadata (columns, types,
ranges for range partitions, etc) already available to be used in query
planning.

But these kinds of optimizations only give you dramatic benefits on very
specific and relatively small subset of queries rather than on some
realistically mixed workload, so to have a fair comparison you'd need to run
this realistic workload on a realistic dataset.

------
bt848
Seems cool. But a lot of the things this blog gets excited over are better
understood as features of Google’s platform that are available to and
exploited by many of their systems. Eg disaggregated storage with append-only
immutable files is just what Colossus does.

~~~
WookieRushing
The benefits of using googles internal tools isn’t what this paper is mainly
about though. They do benefit from it but much more of the paper is explaining
how extremely careful caching and having a deep control over the file format
allows them to do things insanely fast.

Reading from colossus is only done if it misses the data cache and the data
cache has a 90% hit rate. So in effect they’re getting in memory speeds
instead of needing to always hit storage.

Otherwise there wouldnt be too much of a difference versus dremel.

