TBH Artus sounds like Capacitor with indexes and a different header format to me. Probably something mmap-able like FlatBuffer if I were to guess, I don't have any direct info
You are correct that Artus shows a significant benefit over Capacitor primarily for point lookups and where indexes can be used. However, we also see better scan performance with Artus today for a few reasons:
1) general purpose compression is used only on disk, not in memory. This decompression cost is then paid only at cache miss time, not on every scan, at the cost of larger in-cache sizes.
2) tighter integration with our evaluation engine allows us to, for example, directly use the on-disk dictionary and run-length encoding at evaluation time rather than just for pushdowns. This could be addressed as we continue to work with the Capacitor team.
3) Artus offers a much wider API surface (beyond just #2 above) and many more on-disk encodings (to some degree, forced by #1), allowing for more optimization opportunities at the cost of increased implementation and client complexity.
The metadata caching helps a ton I’m sure too. When you have to issue the same get file handle on lots of nodes instead of just say one then you lose out a lot on tight latencies and can cause lots of problems for the underlying storage system
But these kinds of optimizations only give you dramatic benefits on very specific and relatively small subset of queries rather than on some realistically mixed workload, so to have a fair comparison you'd need to run this realistic workload on a realistic dataset.
Reading from colossus is only done if it misses the data cache and the data cache has a 90% hit rate. So in effect they’re getting in memory speeds instead of needing to always hit storage.
Otherwise there wouldnt be too much of a difference versus dremel.