So how do they use a Spanner-like design without specialized hardware? I had to really dig but the answer is "We are using the Timestamp Allocator introduced in Percolator, a paper published by Google in 2006. The pros of using the Timestamp Allocator are its easy implementation and no dependency on any hardware. The disadvantage lies in that if there are multiple datacenters, especially if these DCs are geologically distributed, the latency is really high."
Using Spanner's design but without the real hardware that makes it practical seems like a step backward. A Calvin based approach like FaunaDB would probably have been better - but they can't do that and still have MySQL compatibility (you can't do interactive client-server transactions with Calvin.)
I'm skeptical of a database built over a generic KV-store. There was a much-hyped database a few years ago that failed to live up to the hype because of exactly that architecture. I can't even remember the name now when I was trying to find the post-mortem analysis for it in Google.
I'm also skeptical of a database claiming to be good at both OLAP and OLTP. One requires a column store, the other a row store. You can be half-decent as an OLAP store and good as a OLTP store. There are no column-stores that are also good at OLTP. But OLAP is the real big-data problems of today, and using half-decent for that is likely to get you in trouble. There's no reason a database can't do both by using two separate storage engines under the hood, but that doesn't seem to be the case here.
FoundationDB, acquired and taken internal by Apple. There were a few different blog posts about the layering aspect of it, with mixed opinions.
They built a transactionally consistent distributed database that was mysql client compatible and within 50% of mysql single node performance. If creating that and getting it bought by Apple is failing, I hope I'm lucky enough to fail like that.
> I'm also skeptical of a database claiming to be good at both OLAP and OLTP. One requires a column store, the other a row store. You can be half-decent as an OLAP store and good as a OLTP store.
This is a rigid and narrow view. There's lots of other design possibilities such as fractured mirrors, column decommission only within pages, etc.
I wouldn't call a single db that can do both OLAP and OLTP easy or a slam dunk, but I also wouldn't rule it out. VoltDB is well proven and worth looking at.
> his is a rigid and narrow view. There's lots of other design possibilities such as fractured mirrors, column decommission only within pages, etc.
Yes definitely I over-simplified to make my point, but I'll stand by it. OLAP and OLTP are different enough workloads and the engineering tradeoffs for being good at one of the other make it hard to excel at both. VoltDB is a row store, best suited for OLTP. Vertica is a column store best suited for OLAP. VoltDB can also make a decent OLAP database. Vertica does not make a good OLTP database. Both databases come out of Michael Stonebraker's work (although given the number of people required to make complex products like VoltDB and Vertica happen, it seems unfair to credit him unduly.)
But especially in the case of a sharded, distributed database where communication overhead is likely to dominate many types of queries (VoltDB, TiDB), one might not notice the inefficiency of using a poorly-suited design, like a row store for OLAP. This can make the differences less striking. This might also hide the inefficiency of basing everything on a generic KV store in TiDB.
The VoltDB post is also wrong in many of its particulars, but I’ve gotten into that argument on this site way too many times by now (and last I checked John agreed that some of its claims were overly hasty).
If there’s a particular question about FoundationDB’s performance characteristics that you’re curious about, I can probably give you a straight answer provided that it was once public information.
To be clear, I have a lot of respect for FDB and what their team did; it was only the SQL layer put on top that seemed poorly thought through.
And I still can't respond to equally uninformed speculation about the FDB transaction, because that was confidential.
But I can say that the article's technical thesis about the design space is wrong. A layered design (and, really, all databases are layered, so we just mean a design where distributed concurrency control is below the layer of the data model) is good for performance, exactly because it makes it practical in engineering terms to use the right data model for the job, often getting asymptotic wins rather than tiny constant factor ones. Row and column stores, hierarchical document or "table family" stores, conventional indexes, compressed bitmaps, spatial indexes, etc can all be reasonably efficiently mapped to K/V pairs and then take advantage of the same fast distributed storage and concurrency control. You can use them for different indexes in the same database or the same table. You can, just to throw out an example, store recently changed rows in a row store while transactionally moving them into a column representation in the background, and combining both seamlessly for reads. And you can have things that aren't a good fit for the relational model, like queues, graphs, and full text search indexes, in the same database and operate on them transactionally. You can still co-locate compute with the data, and you can actually scale indexes because they don't have to be co-located with the data as most databases wind up doing. And let's not forget that you have at least a chance to actually get the hard distributed systems part right, as so few products do, because it is not hopelessly entangled with your data model and execution engine.
To address some specific speculations from the article: you don't have to read metadata every transaction; you can just cache it in memory and just ask the distributed store to make sure it hasn't changed in conflict with the transaction (via a tiny read of a serial number, for example, or a usually-empty range read of a log of metadata changes). A key/value store designed for layering needs to be able to go full speed with real, cross-node, multi-key transactions rather than focus on a single node "fast path" (like, if I recall correctly, VoltDB does). It's also highly recommended to be able to do concurrency control on arbitrary ranges rather than just individual keys, both for performance and serializability reasons. Indexes, as I've already mentioned, are vastly more scalable in this type of design than most others. And you get to decide whether you want to read from them serializably or not. One way you can do push down (which in the modern world of 10GBE networking is way more important for OLAP workloads than OLTP, but let's be perfectionist) is to co-locate SQL and K/V store nodes and to let the higher layer see where data is located in the lower layer so that it can partition queries (at least approximately) to one of its own co-located nodes. You keep all the nice abstractions and the statelessness of the SQL nodes, and you can read data locally (a nice optimization would be to have a super fast shared memory path for doing this in the database driver).
There is some cost for every abstraction. But because human ingenuity is finite, successful abstractions on net greatly increase what we can do. We don't write everything in machine code, or build new custom networking hardware and protocol stacks for every piece of data we need to send, even though there might be some microoptimization opportunities if we did.
The real performance problem that any new SQL database is going to face is that its query planner isn't exactly the same as the one people are trying to switch from. So if the customer's random web app has 1000 queries, and 995 of them are faster on the new database and 5 of them are worse, guess what happens? The 1000 queries by selection bias ran acceptably on the old database, but the 5 worse ones can be arbitrarily worse, because the badness of query planning in such a rich query model isn't really bounded. So the application is now brokenly slow. So there is going to be a long, long game of whack-a-mole, and users' experience of the performance is going to tend to be negative no matter how good or bad it really is. In my view, the ideal database interface for scalable production "OLTP" applications would be as powerful as SQL, but do less magic in the query planner - it would make it more explicit in the query how challenging the execution plan is supposed to be, so that you don't have queries that explode into unscalability. Pushing a little more work onto developers in order to save the skins of the ops people when the cardinality of some table changes.
Sorry, /rant. I don't know anything about the engineering of TiDB or TiKV and can't comment on them. But I really strongly believe that a layered architecture is the right one both for databases and for lots of things that people don't think of as databases but that wind up having to solve all the same concurrency control and fault tolerance problems.
The real issue is not whether abstractions are good, but at what layer and how pure/leaky they should be.
If I build a SQL engine on top of RocksDB, I still need a way to scan a bunch of tuples and apply a predicate. It's probably faster if RocksDB lets me hand over a predicate and returns an iterator of matching tuples than if I have to iterate on top of rocks DB. Maybe this difference is large -- maybe not. It depends on a lot of details. Certainly a custom storage layer turned to apply predicates fast is substantially faster.
If I build a SQL engine on top of a distributed KV store, then I really want to push the predicate scan down to the individual nodes, and I probably still want to push the predicate down even lower. For most queries, I also want to have understanding of how data is partitioned.
You can do all of this, but the abstraction gets leakier and leakier as you start to get reasonable performance. At the time, the FDB SQL layer didn't seem to do any of this. Maybe not at Apple it is much smarter and more intertwined.
The planner issue you mention is real, but I'm slightly more optimistic that engineers are willing to identify slow queries and figure out how to adapt them to the new system if the rewards are clear.
N.B. If you're using SQL for KV gets/puts, or if you're joining one row to a handful of others by primary key (e.g. lookup order items in order), then this stuff doesn't matter much. But if you give someone a SQL layer, odds are they'll want to run a query sooner or later, even an OLTP-ish one.
To address the "completely speculative "post-mortem" by someone who knew nothing about the technology" bit: I was only talking about the FDB SQL layer performance and design, much of which was public at the time of the acquisition.
Regarding FDB's architecture, I'm going to agree with you wholeheartedly that the world is built on layers of abstractions. It's how we make any progress at anything. It's useful, and it can save time that can be focused on more important or beneficial areas. I would hope nobody in CS would disagree with your premise there.
However, there's a reason nobody writes a high performing database in Python. Or just uses the filesystem as a KV store. Abstractions are obviously useful, but it's equally trite to say they're always beneficial.
So the question with FDB: Ddoes basing everything on a generic KV abstraction yield enough benefits to make up for the losses in performance? I'm not qualified to say. Clearly the developers of FDB think so, and they'd be the ones most likely to know.
My opinion, as a database developer with many years experience is that any performance losses from building over a generic KV-store are probably solvable or mostly solvable. They also likely pale into insignificance next to the performance losses from trying to run a relational database over a distributed cluster. I'd question the sanity of doing that when the vast majority of people can fit their entire OLTP dataset in memory on a single server. I know I'm in the minority with that opinion though.
I think the parent is referring to performance. FoundationDB had a fast key-value store, but they supposedly weren't able to get good performance out of their SQL layer.
Careful, though, because one has to consider life after the acquisition if one loves the technology.
The article you're referring to is about Foundation DB written by John Hugg of VoltDB: https://www.voltdb.com/blog/2015/04/01/foundationdbs-lesson-...
The database world is rapidly changing and we're seeing new technologies that are enabling a lot more than we could do before. Just looking at some of the SQL on X engines alone shows massive improvements.
However, I would argue that page layout doesn't matter that much in terms of system performance in very large scale-out systems. Other factors often dominate. I've seen row stores crush columnar stores for query performance, not because the row store was efficient (it wasn't) but because the inefficiency due to relative organization was offset by even greater efficiencies elsewhere in the architecture relative to the columnar store. Hence why using one of the "pretty good but not ideal for read and write" page engine designs and focusing elsewhere for query performance is a reasonable design choice.
I would agree that a generic KV-store is not a particularly efficient basis for a general-purpose database. You need to get much more query selectivity out of the content-addressability than a KV-store typically gives you for non-trivial data models.
The only database I can remember that did something like this was FoundationDB. They had the core layer as a K/V, and a layer above that was FSQL or similar. They got acquired by Apple, and most of their material was taken off line. I don't know if they were a failure (IIRC, some people were pretty impressed with the core NoSQL DB), but an exit to Apple may not have been all bad - although I would bet that Apple is not using the technology internally.
IME, my beef with the new NewSQL solutions is that they tend to be worse at both NoSQL and SQL workloads (specifically none are "the best" at OLAP and OLTP) except at one specific use case - global-replication. Getting global replication requires a cost (either monetary like Spanner, or performance wise like everything else), and global-replication just doesn't seem like a killer feature to me. Most times I need data in two datacenters globally, I wouldn't replicate the data anyways (don't want EU data in the US), and other times my SLAs just aren't tight enough to warrant the cost.
So far i've seen:
- Apache v2 license
- Aims at compatibility with Mysql vs postgres for cockroach
1. User interface and eco-system
Despite that TiDB and CockroachDB both support SQL, TiDB is compatible with MySQL protocol while Cockroach chooses PostgreSQL. You can directly connect to TiDB server with any MySQL client.
The whole TiDB project is logically divided into two parts: the stateless SQL layer (TiDB) and distributed storage layer (TiKV). As TiDB is built on top of TiKV, developers have the freedom to choose to use TiDB or TiKV, depending on their own business. If you only want a distributed Key-Value database, you can just use TiKV alone for higher performance and lower latency.
In a word, our system is highly-layered and modularized while CockroachDB is a P2P system. The design of our system results in the fact that we use two programming languages: Go for TiDB and Rust for TiKV to improve the storage performance.
And benefit by the highly-layered architecture, we build another project to run Apache Spark to on top of TiDB/TiKV to answer the complex OLAP queries. It takes advantages of both the Spark platform and the distributed TiKV cluster.
3. Transaction model
Even though CockroachDB and TiDB both support ACID transaction, TiDB uses a model introduced by Google’s Percolator. The key feature of this model is that it needs an independent timestamp allocator. Like Spanner, each transaction in TiDB will have a timestamp to isolate different transactions.
The model that CockroachDB uses is similar to the TrueTime API that Google described in its paper. However, unlike Google, CockroachDB didn’t build the atomic clocks and GPS receivers to keep the time consistent across different data centers. Instead, it uses NTP for clock synchronization, which leads to the problem of uncertain errors. To solve this problem, CockroachDB adapts the Hybrid Logical Clocks (HLC) algorithm.
4. Programming Language
TiDB uses Go for the SQL layer and Rust for the storage engine layer. As Go has a Garbage Collector (GC) and runtime, we think it will cost us days to tune the performance. Therefore, we use Rust, a static language, for TiKV. Its performance is much better. CockroachDB only uses Go.
 Spark on TiKV: https://github.com/pingcap/tispark
CockroachDB uses RocksDB for the storage engine on each node, and that's written in C++ (which can be viewed as good or bad).
I'm pretty sure cockroach use RocksDB for the underlying storage so it's written in C++.
I've used TiDB for some side projects and I was really impressed. Excited to see how things continue to develop!
Edit: some work's been done:
Come on, even Excel can handle tens of millions of rows of data per day.