
We Chose a Distributed SQL Database to Complement MySQL - Caitin_Chen
https://pingcap.com/case-studies/why-we-chose-a-distributed-sql-database-to-complement-mysql
======
throwaway_pdp09
If anyone from the company's reading this: I'd take a guess that it's wanting
"...to ensure that TiFlash processes the latest (not just fresh) data" is a
major problem. If you can tolerate even a few minutes-old data then I really
think you could use some dumb kind of replication to another machine which
would be much simpler.

Also a billion rows of data isn't really so much, depending on what you're
storing.

(I'm trying to get 250 million rows (of 1 character) into a db to see how fast
it counts when in memory. Still inserting)

Edit: TSQL, counting 268 million rows all mem resident, 7-year old CPU,
forcing it to use 1 core is 12 secs, allowing it to parallelise over 2 cores
(all I have) is 8 secs.

This is for comparison to "I did a simple test with five TiKV nodes, one
TiFlash node, and one table with 250 million rows of data. I tried to count
the table records. When I queried by index in TiKV, it took me more than 40
seconds to get the result. But when I queried in a single TiFlash node, it
took only 10 seconds."

I used single bytes per row but I know from prior testing that fat rows and
thin rows count about the same speed.

~~~
gregwebs
The TiFlash replication is sub-second and happens automatically. We are going
to be publishing a more extensive paper on TiFlash, but the point is that the
operational burden is equivalent to a read replica but you get the column-
based storage speed of a data warehouse.

I agree the data size doesn't seem extraordinarily big from some of the
description. Later in the article it states a size of 2.4 TB.

It is also true that in-memory databases should be faster. TiDB is designed
for data that is too large to keep in memory.

Note that having 5 TiKV nodes isn't going to make this single count query any
faster than having 1 TiKV node. TiDB is designed for high availability, so a
default TiKV setup has 3 nodes each with a copy of the data. TiKV can be
scaled out horizontally to provide more compute power to handle a larger
workload but it is not likely to make a single query on an idle machine any
faster.

~~~
jinqueeny
The link to the paper (on VLDB 2020) is now available:
[http://www.vldb.org/pvldb/vol13/p3072-huang.pdf](http://www.vldb.org/pvldb/vol13/p3072-huang.pdf)

------
exabrial
TiDB sounds cool except it's not 8.0 compatible. We use the hell out of CTEs
and window functions.

~~~
gregwebs
It actually has all the window functions:
[https://docs.pingcap.com/tidb/stable/window-
functions](https://docs.pingcap.com/tidb/stable/window-functions)

For CTEs, stay tuned for the next release (TiDB 5.0) or perhaps track this
github issue:
[https://github.com/pingcap/tidb/issues/17472](https://github.com/pingcap/tidb/issues/17472)

------
qaq
Vitess might have being a decent option considering they are already MySQL
shop

~~~
Nican
I might be wrong here, but can someone comment on the process required to add
additional nodes on Vitess?

Every time I start reading about Vitess, I fall into a hole about manual
resharding, hosting multiple vitess instances per host (I think 250GB of
storage each?), and that cross-shard reads may be inconsistent.

I am taking a quick look at the documentation of TiDB, and they have automatic
data re-balancing, 2PC Raft-based-transactions, global snapshot isolation, and
actually some nice dashboards to analyze cluster health.

Vitess just does not seem nice from an operations perspective.

~~~
hilbertseries
Yes, if you want to add a node you have to manually reshard. It’s quite the
process. I’d add each node is a MySQL database and you can make each shard as
large you want.

The data on different vitess shards is disjoint, so I don’t know how much you
need to worry about consistency for cross shard reads. And vitess doesn’t
really support cross shard transactions either.

------
scarmig
What is PingCAP's story when it comes to data protection and privacy,
particularly with its managed product?

~~~
gregwebs
We just had a SOC 2 (type 1) examination for Security, Availability and
Confidentiality. [1] We believe independent audits for compliance are more
meaningful than any words of assurance.

At an architectural level tidb cloud (managed TiDB) currently dedicates
infrastructure for each tenant although we are working on developing lower
cost options with less isolation.

[1] [https://pingcap.com/blog/pingcap-successfully-completes-
soc-...](https://pingcap.com/blog/pingcap-successfully-completes-
soc-2-type-1-examination-for-tidb-cloud)

