TiDB 4.0 GA Release

posnet · on June 20, 2020

What I find interesting about the whole TiFlash part (Which is not open source as far as I can tell).

Is that looking at the symbols table of the binary, they have embedded the entirety of the Clickhouse database as their processing engine.

I hope they do open source it at some point. The more I think about it, the more I like the idea of having a transaction database do analytic predicate pushdown by just transparently querying an actual OLAP database.

ilovesoup · on June 20, 2020

I'm the product owner of TiFlash. Yes. We used ClickHouse as the compute engine for TiFlash. The project started as a modification of ClickHouse (more or less it still is). It was like "pushdown query to an actual OLAP database" style 2 years ago. But later on we made a lot tighter integration (raft instead of binlog sync from TP or data-replication for TiFlash itself, implemented same type system, txn, online ddl, Coprocessor interface as TiKV and etc) to make it more "transparent" for query layer.

We will have more detail explained recently.

It will be open sourced in a year or two. For us we need to make the code open-source ready instead of just turn on github settings.

danielers · on June 22, 2020

What do you mean by open-source ready? Make the project more modular and easier to contribute to? Just curious.

sradman · on June 20, 2020

That is an interesting observation/claim that TiFlash is based on ClickHouse. I’m not sure what benefits ClickHouse has over the ORC/Parquet based Open Source engines like Presto/Impala.

What is emerging in HTAP is two patterns: scale-up like HANA and scale-out like TiDB 4.0. The engine/system in both cases transparently handles the merge between the OLTP delta row store and the OLAP column store (AutoETL) and there is a transparent federated query that is aware of both store types.

Does Presto or another scale-out solution transparently perform these two HTAP functions?

ilovesoup · on June 20, 2020

The reason for ClickHouse is simple: it's fast. And we need to have it function like a TiKV coprocessor which support filtering and aggregation mainly and ClickHouse is good at aggregation and filtering. Also it might take more time and dirtier to do seamless / compatible integration to TiDB if Impala or presto on top of MPP layer. But the price we paid is implementing MPP layer by ourselves now.

Almost all the data lake based products loss full control over storage system. It makes them very hard to build delta-main engine we need. To make HTAP storage transparent to query layer, TiFlash need a lot more control over storage engine than data lake can provide.

qaq · on June 20, 2020

"I’m not sure what benefits ClickHouse has over the ORC/Parquet based Open Source engines like Presto/Impala" performance for one

alexzender · on June 20, 2020

Interesting, why open source is important for you?

alexzender · on June 23, 2020

getting downvotes... probably because my comment sounds like it should not be important, that's not what I mean.

There are database products like MongoDB with closed source. I was curious about the author's reasoning in this particular case. E.g. apart of transparency and community contributions, etc

jd_mongodb · on June 23, 2020

MongoDB is not a closed source product. It is licensed under the SSPL and the source code is available to all.

orhanhh · on June 20, 2020

I recently benchmarked TiDB (pre 4.0), CockroachDB and YugabyteDB. TiDB outperformed the others for write throughout and latency, probably because it has a bit weaker isolation guarantees (snapshot vs serializable). However, for all read operations, CockroachDB performed significantly better than TiDB. It would be interesting to do a new comparison with version 4.0, as it seems from this article that they have improved performance quite a bit.

sanxiyn · on June 20, 2020

I mean, CockroachDB does not target OLAP use cases, but TiDB does. This is a fundamental design choice with tradeoffs, so I think some performance gap for OLTP use cases is expected.

nujabe · on June 20, 2020

How did you benchmark?

orhanhh · on June 20, 2020

I used oltpbenchmark and ran automated tests on hetzner cloud using Terraform and some automation scripts. The comparisons are based on the YCSB workloads executed by oltpbenchmark.

youjiali1995 · on June 22, 2020

Hi, I want to reproduce it. Could you tell me how much scalefactor did you use and did you change the default weights?

orhanhh · on June 27, 2020

I used 100 as the scale factor and used different weights for benchmarks A through F, defined by the original YCSB project here: https://github.com/brianfrankcooper/YCSB/tree/master/workloa...

shenli3514 · on June 27, 2020

May I know how many instances are there in crdb/tidb cluster and the concurrency in the benchmark? We found that for small clusters (for example 3 instances), CRDB could be fast in read-only workload. Because CRDB is single binary, about 1/3 read operation will not involve RPC. TiDB need to involve a RPC for every request. For larger scale cluster or high concurrency, it is a different story.

orhanhh · on June 28, 2020

I tested clusters with between 3 and 12 nodes, and the differences were similar for the different sizes. I’m not sure how it performs for larger clusters than that though. Additionally, the results might have been a bit misleading on the large clusters because of the low scale factor, leading to higher contention on some rows.

sanxiyn · on June 20, 2020

Does anyone find it unfortunate that they abandoned optimistic concurrency control as default because it is less compatible with MySQL?