
TiDB – a global scale distributed DB - eigenrick
https://pingcap.github.io/blog/2017/10/17/announcement/
======
eloff
From their blog: "TiKV is a distributed key-value database. It is the core
component of the TiDB project and is the open source implementation of Google
Spanner." I think the Cockroach DB guys would take offence at that.

So how do they use a Spanner-like design without specialized hardware? I had
to really dig but the answer is "We are using the Timestamp Allocator
introduced in Percolator, a paper published by Google in 2006. The pros of
using the Timestamp Allocator are its easy implementation and no dependency on
any hardware. The disadvantage lies in that if there are multiple datacenters,
especially if these DCs are geologically distributed, the latency is really
high."

Using Spanner's design but without the real hardware that makes it practical
seems like a step backward. A Calvin[1] based approach like FaunaDB would
probably have been better - but they can't do that and still have MySQL
compatibility (you can't do interactive client-server transactions with
Calvin.)

I'm skeptical of a database built over a generic KV-store. There was a much-
hyped database a few years ago that failed to live up to the hype because of
exactly that architecture. I can't even remember the name now when I was
trying to find the post-mortem analysis for it in Google.

I'm also skeptical of a database claiming to be good at both OLAP and OLTP.
One requires a column store, the other a row store. You can be half-decent as
an OLAP store and good as a OLTP store. There are no column-stores that are
also good at OLTP. But OLAP is the real big-data problems of today, and using
half-decent for that is likely to get you in trouble. There's no reason a
database can't do both by using two separate storage engines under the hood,
but that doesn't seem to be the case here.

[1] [https://fauna.com/blog/distributed-consistency-at-scale-
span...](https://fauna.com/blog/distributed-consistency-at-scale-spanner-vs-
calvin)

~~~
jasonwatkinspdx
> There was a much-hyped database a few years ago that failed to live up to
> the hype because of exactly that architecture. I can't even remember the
> name now when I was trying to find the post-mortem analysis for it in
> Google.

FoundationDB, acquired and taken internal by Apple. There were a few different
blog posts about the layering aspect of it, with mixed opinions.

They built a transactionally consistent distributed database that was mysql
client compatible and within 50% of mysql single node performance. If creating
that and getting it bought by Apple is failing, I hope I'm lucky enough to
fail like that.

> I'm also skeptical of a database claiming to be good at both OLAP and OLTP.
> One requires a column store, the other a row store. You can be half-decent
> as an OLAP store and good as a OLTP store.

This is a rigid and narrow view. There's lots of other design possibilities
such as fractured mirrors, column decommission only within pages, etc.

I wouldn't call a single db that can do both OLAP and OLTP easy or a slam
dunk, but I also wouldn't rule it out. VoltDB is well proven and worth looking
at.

~~~
eloff
Thank you, it was FoundationDB I was thinking of, post-mortem analysis from
the VoltDB guys here[1]. I'll contend they failed though because Apple bought
them as an acqui-hire and killed the product. They were acquiring good
engineers with a track record of executing well at a premium. It's a common
recruiting strategy for tech companies. It doesn't mean that FoundationDB was
a successful product - certainly it means it was not successful enough for the
founders to risk turning down a lucrative acqui-hire deal.

> his is a rigid and narrow view. There's lots of other design possibilities
> such as fractured mirrors, column decommission only within pages, etc.

Yes definitely I over-simplified to make my point, but I'll stand by it. OLAP
and OLTP are different enough workloads and the engineering tradeoffs for
being good at one of the other make it hard to excel at both. VoltDB is a row
store, best suited for OLTP. Vertica is a column store best suited for OLAP.
VoltDB can also make a decent OLAP database. Vertica does not make a good OLTP
database. Both databases come out of Michael Stonebraker's work (although
given the number of people required to make complex products like VoltDB and
Vertica happen, it seems unfair to credit him unduly.)

But especially in the case of a sharded, distributed database where
communication overhead is likely to dominate many types of queries (VoltDB,
TiDB), one might not notice the inefficiency of using a poorly-suited design,
like a row store for OLAP. This can make the differences less striking. This
might also hide the inefficiency of basing everything on a generic KV store in
TiDB.

[1][https://www.voltdb.com/blog/2015/04/01/foundationdbs-
lesson-...](https://www.voltdb.com/blog/2015/04/01/foundationdbs-lesson-fast-
key-value-store-not-enough/)

~~~
voidmain
Ah, yes, a completely speculative "post-mortem" by someone who knew nothing
about the technology, but wanted some free press for his own database out of
the news, knowing that the FDB team wouldn't be able to respond.

And I still can't respond to equally uninformed speculation about the FDB
transaction, because that was confidential.

But I can say that the article's technical thesis about the design space is
wrong. A layered design (and, really, all databases are layered, so we just
mean a design where distributed concurrency control is below the layer of the
data model) is good for performance, exactly because it makes it practical in
engineering terms to use the right data model for the job, often getting
asymptotic wins rather than tiny constant factor ones. Row and column stores,
hierarchical document or "table family" stores, conventional indexes,
compressed bitmaps, spatial indexes, etc can all be reasonably efficiently
mapped to K/V pairs and then take advantage of the same fast distributed
storage and concurrency control. You can use them for different indexes in the
same database or the same table. You can, just to throw out an example, store
recently changed rows in a row store while transactionally moving them into a
column representation in the background, and combining both seamlessly for
reads. And you can have things that aren't a good fit for the relational
model, like queues, graphs, and full text search indexes, in the same database
and operate on them transactionally. You can still co-locate compute with the
data, and you can actually scale indexes because they _don 't_ have to be co-
located with the data as most databases wind up doing. And let's not forget
that you have at least a _chance_ to actually get the hard distributed systems
part right, as so few products do, because it is not hopelessly entangled with
your data model and execution engine.

To address some specific speculations from the article: you don't have to read
metadata every transaction; you can just cache it in memory and just ask the
distributed store to make sure it hasn't changed in conflict with the
transaction (via a tiny read of a serial number, for example, or a usually-
empty range read of a log of metadata changes). A key/value store designed for
layering needs to be able to go full speed with real, cross-node, multi-key
transactions rather than focus on a single node "fast path" (like, if I recall
correctly, VoltDB does). It's also highly recommended to be able to do
concurrency control on arbitrary ranges rather than just individual keys, both
for performance and serializability reasons. Indexes, as I've already
mentioned, are _vastly_ more scalable in this type of design than most others.
And you get to decide whether you want to read from them serializably or not.
One way you can do push down (which in the modern world of 10GBE networking is
way more important for OLAP workloads than OLTP, but let's be perfectionist)
is to co-locate SQL and K/V store nodes and to let the higher layer see where
data is located in the lower layer so that it can partition queries (at least
approximately) to one of its own co-located nodes. You keep all the nice
abstractions and the statelessness of the SQL nodes, and you can read data
locally (a nice optimization would be to have a super fast shared memory path
for doing this in the database driver).

There is _some_ cost for every abstraction. But because human ingenuity is
finite, successful abstractions on net greatly increase what we can do. We
don't write everything in machine code, or build new custom networking
hardware and protocol stacks for every piece of data we need to send, even
though there might be some microoptimization opportunities if we did.

The real performance problem that any new SQL database is going to face is
that its query planner isn't _exactly the same_ as the one people are trying
to switch from. So if the customer's random web app has 1000 queries, and 995
of them are faster on the new database and 5 of them are worse, guess what
happens? The 1000 queries by selection bias ran _acceptably_ on the old
database, but the 5 worse ones can be arbitrarily worse, because the badness
of query planning in such a rich query model isn't really bounded. So the
application is now brokenly slow. So there is going to be a long, long game of
whack-a-mole, and users' experience of the performance is going to tend to be
negative no matter how good or bad it really is. In my view, the ideal
database interface for scalable production "OLTP" applications would be as
powerful as SQL, but do less magic in the query planner - it would make it
more explicit in the query how challenging the execution plan is supposed to
be, so that you don't have queries that explode into unscalability. Pushing a
little more work onto developers in order to save the skins of the ops people
when the cardinality of some table changes.

Sorry, /rant. I don't know anything about the engineering of TiDB or TiKV and
can't comment on them. But I really strongly believe that a layered
architecture is the right one both for databases and for lots of things that
people don't think of as databases but that wind up having to solve all the
same concurrency control and fault tolerance problems.

~~~
jhugg
As I mentioned in a previous comment, you can fix some of the problems;
caching metadata is a great example.

The real issue is not whether abstractions are good, but at what layer and how
pure/leaky they should be.

If I build a SQL engine on top of RocksDB, I still need a way to scan a bunch
of tuples and apply a predicate. It's probably faster if RocksDB lets me hand
over a predicate and returns an iterator of matching tuples than if I have to
iterate on top of rocks DB. Maybe this difference is large -- maybe not. It
depends on a lot of details. Certainly a custom storage layer turned to apply
predicates fast _is_ substantially faster.

If I build a SQL engine on top of a _distributed_ KV store, then I really want
to push the predicate scan down to the individual nodes, and I probably still
want to push the predicate down even lower. For most queries, I also want to
have understanding of how data is partitioned.

You can do all of this, but the abstraction gets leakier and leakier as you
start to get reasonable performance. At the time, the FDB SQL layer didn't
seem to do any of this. Maybe not at Apple it is much smarter and more
intertwined.

The planner issue you mention is real, but I'm slightly more optimistic that
engineers are willing to identify slow queries and figure out how to adapt
them to the new system if the rewards are clear.

N.B. If you're using SQL for KV gets/puts, or if you're joining one row to a
handful of others by primary key (e.g. lookup order items in order), then this
stuff doesn't matter much. But if you give someone a SQL layer, odds are
they'll want to run a query sooner or later, even an OLTP-ish one.

To address the "completely speculative "post-mortem" by someone who knew
nothing about the technology" bit: I was only talking about the FDB SQL layer
performance and design, much of which was public at the time of the
acquisition.

------
foohak42
So, how does it compare to cockroachdb?

So far i've seen: \- Apache v2 license \- Aims at compatibility with Mysql vs
postgres for cockroach

~~~
shenli3514
There are some key differences between TiDB and Cockroach.

1\. User interface and eco-system Despite that TiDB and CockroachDB both
support SQL, TiDB is compatible with MySQL protocol while Cockroach chooses
PostgreSQL. You can directly connect to TiDB server with any MySQL client.

2\. Architecture The whole TiDB project is logically divided into two parts:
the stateless SQL layer (TiDB) and distributed storage layer (TiKV). As TiDB
is built on top of TiKV, developers have the freedom to choose to use TiDB or
TiKV, depending on their own business. If you only want a distributed Key-
Value database, you can just use TiKV alone for higher performance and lower
latency.

In a word, our system is highly-layered and modularized while CockroachDB is a
P2P system. The design of our system results in the fact that we use two
programming languages: Go for TiDB and Rust for TiKV to improve the storage
performance.

And benefit by the highly-layered architecture, we build another project[1] to
run Apache Spark to on top of TiDB/TiKV to answer the complex OLAP queries. It
takes advantages of both the Spark platform and the distributed TiKV cluster.

3\. Transaction model Even though CockroachDB and TiDB both support ACID
transaction, TiDB uses a model introduced by Google’s Percolator. The key
feature of this model is that it needs an independent timestamp allocator.
Like Spanner, each transaction in TiDB will have a timestamp to isolate
different transactions.

The model that CockroachDB uses is similar to the TrueTime API that Google
described in its paper. However, unlike Google, CockroachDB didn’t build the
atomic clocks and GPS receivers to keep the time consistent across different
data centers. Instead, it uses NTP for clock synchronization, which leads to
the problem of uncertain errors. To solve this problem, CockroachDB adapts the
Hybrid Logical Clocks (HLC) algorithm.

4\. Programming Language TiDB uses Go for the SQL layer and Rust for the
storage engine layer. As Go has a Garbage Collector (GC) and runtime, we think
it will cost us days to tune the performance. Therefore, we use Rust, a static
language, for TiKV. Its performance is much better. CockroachDB only uses Go.

[1] Spark on TiKV:
[https://github.com/pingcap/tispark](https://github.com/pingcap/tispark)

~~~
foohak42
Thanks for the details!

I'm pretty sure cockroach use RocksDB for the underlying storage so it's
written in C++.

~~~
shenli3514
I mean the raft/mvcc/transaction layer which are on top of RocksDB and below
the SQL layer.

------
otterley
Contrary to the assertion in the press release, there's no evidence Abraham
Lincoln ever said, "the best way to predict the future is to create it." That
would be Alan Kay.

[https://quoteinvestigator.com/2012/09/27/invent-the-
future/](https://quoteinvestigator.com/2012/09/27/invent-the-future/)

~~~
qq66
Anyone who has read any of Lincoln's papers knows that simply isn't the way he
would talk, or really anyone from his time period. The language that we use
today around innovation, envisioning futures, boldly creating the world we
want to live in, etc. is post-WWII.

------
olegkikin
Benchmark from half a year ago:

[https://pingcap.github.io/blog/2017/05/23/perconalive17/#sys...](https://pingcap.github.io/blog/2017/05/23/perconalive17/#sysbench)

Previous discussions:

[https://news.ycombinator.com/item?id=13298664](https://news.ycombinator.com/item?id=13298664)

[https://news.ycombinator.com/item?id=10180503](https://news.ycombinator.com/item?id=10180503)

~~~
shenli3514
We will release a new benchmark result soon.

------
BlackjackCF
Congrats to PingCAP on the 1.0 launch.

I've used TiDB for some side projects and I was really impressed. Excited to
see how things continue to develop!

------
api
Paging Jepsen. Paging Jepsen.

Edit: some work's been done:

[https://pingcap.github.io/blog/2017/09/01/tidbmeetsjepsen/](https://pingcap.github.io/blog/2017/09/01/tidbmeetsjepsen/)

~~~
anishathalye
They're also working on using a custom-built system written in Golang to test
TiDB (though this is in the early stages):
[https://medium.com/@siddontang/use-chaos-to-test-the-
distrib...](https://medium.com/@siddontang/use-chaos-to-test-the-distributed-
system-linearizability-4e0e778dfc7d)

~~~
siddontang
Hi @anishathalye, thanks for your
[https://github.com/anishathalye/porcupine](https://github.com/anishathalye/porcupine).
It’s a great project! Would you like to help me to check whether our model
check is right or not with Porcupine? That would be highly appreciated!

~~~
anishathalye
Sure! Sent you an email.

------
itaifrenkel
How would you run spark on top of the production db without affecting its
performance?

~~~
c4pt0r
TiDB doesn't want to solve all the problems, no silver bullet, right? And I
think it's all about workload, some analytical workload requires CPU/Network
resource instead of I/O, and OLAP workload isn't that frequently. Storage
layer and computing layer are separated in TiDB stack, I think it's possible
for some workload.

~~~
itaifrenkel
The problem is that OLAP is run by analysts/business/data.science people and
they may mess up with their queries/workloads. without isolation the
production performance could impact the user expirience

~~~
shenli3514
We use different isolation levels and priorities for OLAP and OLTP workload.

------
galkk
Their references to use cases from an article sounds funny "Migration from
MySQL to TiDB to handle tens of millions of rows of data per day".

Come on, even Excel can handle tens of millions of rows of data per day.

~~~
buryat
Excel can not handle more than 1048576 rows

[https://support.office.com/en-us/article/Excel-
specification...](https://support.office.com/en-us/article/Excel-
specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3)

~~~
aneutron
Not to mention that file locking would make it literally impossible to
concurrently edit the file.

