Hacker News new | past | comments | ask | show | jobs | submit login

What open source columnar storage do you use for 100 trillion records? It's very impressive.



Hive/Tez + ORC on HDFS and on S3. I do not think it is very impressive just by the numbers mentioned so far. The performance of data access like how long an average query runs is much more interesting. You can easily store 100T lines but it is much harder to make it queryable. There is a lots of interesting work being done in that area, many in-memory engines are out there to provide access of a subset of the entire dataset at much faster speed.


Of course I'm asking with context of massive parallel realtime queries from customers as it's going in yandex metrika or google analytics. And yes 100T lines with Hive+ORC it's not an issue, but it's another league.

I assume that only Kudu is a competitor for ClickHouse now. Or may be Greenplum.

Thank you for your answer.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: