Hacker News new | past | comments | ask | show | jobs | submit login

Having their own file format could really help them out as it lets them be sole reader and writer of it. File formats for warehouses are generally a little out of date in terms of performance because it requires all the compute engines to be able to read them and some will lag behind others in updating.

The metadata caching helps a ton I’m sure too. When you have to issue the same get file handle on lots of nodes instead of just say one then you lose out a lot on tight latencies and can cause lots of problems for the underlying storage system

This isn't actually about "get file handle", although for, say, a million files that could take a while. This is about having metadata (columns, types, ranges for range partitions, etc) already available to be used in query planning.

But these kinds of optimizations only give you dramatic benefits on very specific and relatively small subset of queries rather than on some realistically mixed workload, so to have a fair comparison you'd need to run this realistic workload on a realistic dataset.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact