Hacker News new | past | comments | ask | show | jobs | submit login

What I find interesting about the whole TiFlash part (Which is not open source as far as I can tell).

Is that looking at the symbols table of the binary, they have embedded the entirety of the Clickhouse database as their processing engine.

I hope they do open source it at some point. The more I think about it, the more I like the idea of having a transaction database do analytic predicate pushdown by just transparently querying an actual OLAP database.

I'm the product owner of TiFlash. Yes. We used ClickHouse as the compute engine for TiFlash. The project started as a modification of ClickHouse (more or less it still is). It was like "pushdown query to an actual OLAP database" style 2 years ago. But later on we made a lot tighter integration (raft instead of binlog sync from TP or data-replication for TiFlash itself, implemented same type system, txn, online ddl, Coprocessor interface as TiKV and etc) to make it more "transparent" for query layer.

We will have more detail explained recently.

It will be open sourced in a year or two. For us we need to make the code open-source ready instead of just turn on github settings.

What do you mean by open-source ready? Make the project more modular and easier to contribute to? Just curious.

That is an interesting observation/claim that TiFlash is based on ClickHouse. I’m not sure what benefits ClickHouse has over the ORC/Parquet based Open Source engines like Presto/Impala.

What is emerging in HTAP is two patterns: scale-up like HANA and scale-out like TiDB 4.0. The engine/system in both cases transparently handles the merge between the OLTP delta row store and the OLAP column store (AutoETL) and there is a transparent federated query that is aware of both store types.

Does Presto or another scale-out solution transparently perform these two HTAP functions?

The reason for ClickHouse is simple: it's fast. And we need to have it function like a TiKV coprocessor which support filtering and aggregation mainly and ClickHouse is good at aggregation and filtering. Also it might take more time and dirtier to do seamless / compatible integration to TiDB if Impala or presto on top of MPP layer. But the price we paid is implementing MPP layer by ourselves now.

Almost all the data lake based products loss full control over storage system. It makes them very hard to build delta-main engine we need. To make HTAP storage transparent to query layer, TiFlash need a lot more control over storage engine than data lake can provide.

"I’m not sure what benefits ClickHouse has over the ORC/Parquet based Open Source engines like Presto/Impala" performance for one

Interesting, why open source is important for you?

getting downvotes... probably because my comment sounds like it should not be important, that's not what I mean.

There are database products like MongoDB with closed source. I was curious about the author's reasoning in this particular case. E.g. apart of transparency and community contributions, etc

MongoDB is not a closed source product. It is licensed under the SSPL and the source code is available to all.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact