What open source columnar storage do you use for 100 trillion records? It's very...

StreamBright · on June 18, 2016

Hive/Tez + ORC on HDFS and on S3. I do not think it is very impressive just by the numbers mentioned so far. The performance of data access like how long an average query runs is much more interesting. You can easily store 100T lines but it is much harder to make it queryable. There is a lots of interesting work being done in that area, many in-memory engines are out there to provide access of a subset of the entire dataset at much faster speed.

flr_null · on June 18, 2016

Of course I'm asking with context of massive parallel realtime queries from customers as it's going in yandex metrika or google analytics. And yes 100T lines with Hive+ORC it's not an issue, but it's another league.

I assume that only Kudu is a competitor for ClickHouse now. Or may be Greenplum.

Thank you for your answer.