
TiDB 4.0 GA Release - ceohockey60
https://pingcap.com/blog/tidb-4.0-ga-gearing-you-up-for-an-unpredictable-world-with-real-time-htap-database/
======
posnet
What I find interesting about the whole TiFlash part (Which is not open source
as far as I can tell).

Is that looking at the symbols table of the binary, they have embedded the
entirety of the Clickhouse database as their processing engine.

I hope they do open source it at some point. The more I think about it, the
more I like the idea of having a transaction database do analytic predicate
pushdown by just transparently querying an actual OLAP database.

~~~
sradman
That is an interesting observation/claim that TiFlash is based on ClickHouse.
I’m not sure what benefits ClickHouse has over the ORC/Parquet based Open
Source engines like Presto/Impala.

What is emerging in HTAP is two patterns: scale-up like HANA and scale-out
like TiDB 4.0. The engine/system in both cases transparently handles the merge
between the OLTP delta row store and the OLAP column store (AutoETL) and there
is a transparent federated query that is aware of both store types.

Does Presto or another scale-out solution transparently perform these two HTAP
functions?

~~~
ilovesoup
The reason for ClickHouse is simple: it's fast. And we need to have it
function like a TiKV coprocessor which support filtering and aggregation
mainly and ClickHouse is good at aggregation and filtering. Also it might take
more time and dirtier to do seamless / compatible integration to TiDB if
Impala or presto on top of MPP layer. But the price we paid is implementing
MPP layer by ourselves now.

Almost all the data lake based products loss full control over storage system.
It makes them very hard to build delta-main engine we need. To make HTAP
storage transparent to query layer, TiFlash need a lot more control over
storage engine than data lake can provide.

------
orhanhh
I recently benchmarked TiDB (pre 4.0), CockroachDB and YugabyteDB. TiDB
outperformed the others for write throughout and latency, probably because it
has a bit weaker isolation guarantees (snapshot vs serializable). However, for
all read operations, CockroachDB performed significantly better than TiDB. It
would be interesting to do a new comparison with version 4.0, as it seems from
this article that they have improved performance quite a bit.

~~~
nujabe
How did you benchmark?

~~~
orhanhh
I used oltpbenchmark and ran automated tests on hetzner cloud using Terraform
and some automation scripts. The comparisons are based on the YCSB workloads
executed by oltpbenchmark.

~~~
youjiali1995
Hi, I want to reproduce it. Could you tell me how much scalefactor did you use
and did you change the default weights?

~~~
orhanhh
I used 100 as the scale factor and used different weights for benchmarks A
through F, defined by the original YCSB project here:
[https://github.com/brianfrankcooper/YCSB/tree/master/workloa...](https://github.com/brianfrankcooper/YCSB/tree/master/workloads)

~~~
shenli3514
May I know how many instances are there in crdb/tidb cluster and the
concurrency in the benchmark? We found that for small clusters (for example 3
instances), CRDB could be fast in read-only workload. Because CRDB is single
binary, about 1/3 read operation will not involve RPC. TiDB need to involve a
RPC for every request. For larger scale cluster or high concurrency, it is a
different story.

~~~
orhanhh
I tested clusters with between 3 and 12 nodes, and the differences were
similar for the different sizes. I’m not sure how it performs for larger
clusters than that though. Additionally, the results might have been a bit
misleading on the large clusters because of the low scale factor, leading to
higher contention on some rows.

------
sanxiyn
Does anyone find it unfortunate that they abandoned optimistic concurrency
control as default because it is less compatible with MySQL?

