Hacker News new | past | comments | ask | show | jobs | submit login

I'm skeptical of the comparison to Capacitor. Best I can tell from the paper they're using about the same techniques from the seminal Abadi et al's compressed columnar storage paper. As a result I'm having a hard time believing that Artus is several times faster than Capacitor on similar analytical workloads, unless we're seeing the hand picked queries here, where it does really well, or unless the performance difference is attributable to better metadata caching and/or the ability to use inverted indexes on those particular queries (such indexes are usually not much help unless you're running a point query), rather than the lower-level format and runtime itself.

TBH Artus sounds like Capacitor with indexes and a different header format to me. Probably something mmap-able like FlatBuffer if I were to guess, I don't have any direct info

Hi, I'm a technical lead on Procella.

You are correct that Artus shows a significant benefit over Capacitor primarily for point lookups and where indexes can be used. However, we also see better scan performance with Artus today for a few reasons:

1) general purpose compression is used only on disk, not in memory. This decompression cost is then paid only at cache miss time, not on every scan, at the cost of larger in-cache sizes.

2) tighter integration with our evaluation engine allows us to, for example, directly use the on-disk dictionary and run-length encoding at evaluation time rather than just for pushdowns. This could be addressed as we continue to work with the Capacitor team.

3) Artus offers a much wider API surface (beyond just #2 above) and many more on-disk encodings (to some degree, forced by #1), allowing for more optimization opportunities at the cost of increased implementation and client complexity.

Having their own file format could really help them out as it lets them be sole reader and writer of it. File formats for warehouses are generally a little out of date in terms of performance because it requires all the compute engines to be able to read them and some will lag behind others in updating.

The metadata caching helps a ton I’m sure too. When you have to issue the same get file handle on lots of nodes instead of just say one then you lose out a lot on tight latencies and can cause lots of problems for the underlying storage system

This isn't actually about "get file handle", although for, say, a million files that could take a while. This is about having metadata (columns, types, ranges for range partitions, etc) already available to be used in query planning.

But these kinds of optimizations only give you dramatic benefits on very specific and relatively small subset of queries rather than on some realistically mixed workload, so to have a fair comparison you'd need to run this realistic workload on a realistic dataset.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact