
Technical Challenges Developing a Distributed SQL Database - sickeythecat
https://blog.yugabyte.com/6-technical-challenges-developing-a-distributed-sql-database/
======
kthejoker2
As a precursor to this (excellent, in depth) post I also recommend Martin
Kleppmann's Designing Data-Intensive Applications, which is (to date) the
definitive 101 book on the topic.

He also did these awesome Tolkien-esque maps of the database engine ecosystem:
[https://martin.kleppmann.com/2017/03/15/map-distributed-
data...](https://martin.kleppmann.com/2017/03/15/map-distributed-data-
systems.html)

Anyway, I inject this sort of stuff directly into my veins, so thanks very
much for the post!

------
rekoros
Good post! May sound more credible if you have mentioned that YugaByte started
as a fork of Apache Kudu and some of the design decisions and technical
challenges that you mention were addressed by the code that was already there
before your fork.

~~~
rkarthik007
Hi rekoros,

I am the CTO/Founder of YugaByte and author of the above post. Thanks for your
comments, glad you liked the post! You make a great point about YugaByte DB
using Apache Kudu to start out, but wanted to clarify a few things.

* This is a post about our architectural decisions in building a distributed SQL database, and less so about how we actually implement it. Hence there is no mention of any starting points as far as the codebase.

* Also, we have used the RocksDB and PostgreSQL codebases in their entirety in addition to Apache Kudu, and make no secret of this fact ([https://docs.yugabyte.com/latest/architecture/concepts/ackno...](https://docs.yugabyte.com/latest/architecture/concepts/acknowledgements/)).

~~~
Justsignedup
I genuinely never thought I'd see a fully distributed db. This is the magic
bullet people been looking for in terms of having a simple ish db compared to
complex data layers.

Incredible work. And supporting full postgresql.

This does sound a lot like a marketing post. But me no cares. This is pretty
awesome I'm rarely this hyped for a technology.

~~~
ldng
I always though it was one of the pursuit of Google but NoSQL was all the rage
and I was never taken seriously ... until F5 that is.

Funnily enough, all what I would call "NoSQL papers" are very well known, yet
F5 is hardly ever talked about. As if some people would not like to admit they
might have been wrong ;-)

~~~
ddorian43
It's not talked about cause it's named F1 ;)

~~~
ldng
Yes ! Thanks for the correction.

------
Nican
I find Spanner an absolute beautiful database, and I recommend people to
checkout CockroachDB as well, which is inspired by Spanner.

~~~
zenithm
Shouldn't talk about Spanner and Percolator without discussing Calvin, either.
[https://www.infoq.com/articles/relational-nosql-
fauna](https://www.infoq.com/articles/relational-nosql-fauna)

~~~
sidch
YugaByte DB product manager here. we have compared the Spanner and Calvin
architectures in depth previously ([https://blog.yugabyte.com/google-spanner-
vs-calvin-global-co...](https://blog.yugabyte.com/google-spanner-vs-calvin-
global-consistency-at-scale/)). one key difference comes from the fact that
Calvin’s deterministic locking protocol requires advance knowledge of all
transactions’ read/write sets before transaction execution can begin -- this
leads to repeated transaction restarts if the secondary index is on a column
whose value is changing frequently as well as no support for client-initiated
sessions transactions. in other words, Calvin is better suited for a
transactional NoSQL database as opposed to a true distributed SQL database.

~~~
freels
This take on Calvin is inaccurate:

Under contention, a Calvin-based system will behave similarly to others which
use optimistic locking schemes for Serializable isolation such as Postgres, or
YB itself. There are advantages to the Calvin approach as well. For example,
under Calvin, the system doesn't have to write out speculative intents to all
data replicas in order to detect contention: The only writes to data storage
are successful ones. The original paper only describes this briefly, but you
can read about how FaunaDB has implemented it in more detail:
[https://fauna.com/blog/consistency-without-clocks-faunadb-
tr...](https://fauna.com/blog/consistency-without-clocks-faunadb-transaction-
protocol)

It's also not a stretch to see how the protocol described in that post can be
extended to support session transactions: Rather than executing a transaction
per request, the transaction context is maintained for the life of the session
and then dispatched to Calvin on commit. (This is in fact how we are
implementing it in our SQL API feature.)

I would instead say that one of the more significant differences between
Calvin and Spanner is the latter's much stricter requirements it places on its
hardware (i.e. clock accuracy) in order to maintain its correctness
guarantees; a weakness its variants also share.

~~~
sidch
Only if things were that simple :) Calvin avoids the need to track clock
accuracy by making every transaction go through a single consensus leader
which inherently becomes a single point of bottleneck for performance and
availability. Spanner and its derivatives including YugaByte DB chose not to
introduce such a bottleneck in their architecture with the use of per-shard
consensus — this means caring about clock skews for multi-shard transactions
and not allowing such transactions to proceed on nodes that exhibit skew more
than the max configured. The big question is which is more acceptable: lower
performance & availability on the normal path OR handling offending nodes with
high clock skew on the exceptions path?

~~~
freels
Sorry I can't let this go unchallenged. Again, you are inventing an
architectural deficiency where the is none. The log in Calvin is partitioned
and does not require all transactions to go through a single physical
consensus leader. There is no single node in a Calvin cluster which must
handle the entire global stream of transactions. The Calvin paper itself
extensively covers how this works in detail:
[http://cs.yale.edu/homes/thomson/publications/calvin-
sigmod1...](http://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf)

------
mikece
The article makes mention of Aurora and Spanner as inspiration; did CosmosDB
not exist yet? I realize Cosmos is really more of a document database but
given its ability to be queried with SQL in addition to massive horizontal
scale and global replication seems like it should be more or less in the
discussion.

Reference to querying Cosmos with SQL: [https://docs.microsoft.com/en-
us/azure/cosmos-db/how-to-sql-...](https://docs.microsoft.com/en-
us/azure/cosmos-db/how-to-sql-query)

~~~
sidch
YugaByte DB product manager here. Yes, CosmosDB and its underlying
architecture were indeed not yet publicly available when we started the
YugaByte DB project early 2016. However, classifying CosmosDB as a distributed
SQL database is a bit of stretch given no support for SQL on the write path
(the write path uses a custom document API). This means no support for multi-
row/multi-shard ACID transactions as well as features such as client-initiated
session transactions.

------
chalianwar
Following research papers from academia also explored similar problem:

1) BESPOKV: Application Tailored Scale-Out Key-Value Stores
[http://people.cs.vt.edu/~butta/docs/sc18-bespokv.pdf](http://people.cs.vt.edu/~butta/docs/sc18-bespokv.pdf)

2) ClusterOn: Building Highly Configurable and Reusable Clustered Data
Services using Simple Data Nodes
[http://people.cs.vt.edu/~butta/docs/hotstorage16-clusteron.p...](http://people.cs.vt.edu/~butta/docs/hotstorage16-clusteron.pdf)

DynomiteDB from Netflix allowed to scale singe server NoSQL stores.
[https://github.com/Netflix/dynomite](https://github.com/Netflix/dynomite).

~~~
sidch
Thanks for sharing these. Hadn’t heard of the first 2, so look fwd to reading
them. We did review the Amazon Dynamo architecture (which is used in Dynomite)
in depth but found it to be lacking for supporting even single-row
linearizability. multi-row ACID with serializable isolation builds on top that
property. These are must-haves in a storage architecture to power a
distributed SQL API.

------
protomyth
I wonder where the lessons of NonStop SQL fit into the picture? It seems like
there is some knowledge to mine there.

~~~
sidch
assume you are referring to
[https://en.wikipedia.org/wiki/NonStop_SQL](https://en.wikipedia.org/wiki/NonStop_SQL)
\-- was not aware of it so thanks for bringing it to attention. first
impression is that it was a technology way ahead of its times. in late 80s
(and some would argue even recently), majority of apps can be run easily on
monolithic/single-node SQL databases. they are not generating data at a volume
that requires multiple nodes (in this case parallel computers) and user
perception can indeed tolerate high latencies, manual failover and repair. we
are now in a different era when it comes to data volume and user perception.

~~~
qaq
And if we are being honest not much has changed 99% of the projects still can
run fine on monolithic/single-node SQL databases.

------
jayd16
Interesting that spanner doesn't support foreign keys. Is this a fundamental
limitation or just a missing feature?

~~~
joatmon-snoo
Spanner uses a hierarchical table model instead, where a parent table key is a
strict prefix of the child table key.

[https://cloud.google.com/spanner/docs/schema-and-data-
model#...](https://cloud.google.com/spanner/docs/schema-and-data-
model#creating_a_hierarchy_of_interleaved_tables)

From a technical perspective, I'd guess that the reason was because it makes
local transactions trickier to do. Spanner has a philosophy of avoiding work
that isn't required, and knows how to optimize transactions that don't require
2PC-ing data on different servers.

------
tyopiuy
Not really developing a distributed dB but using one. Inaccurate and
misleading title

~~~
penagwin
No this is about developing one. They talked about how they looked at the
existing database options and how they choose different bits/pieces.

------
triplee
That wasn't immediately relevant to me, but is definitely a good read since I
imagine I'll be dealing with these problems within a year.

Thanks!

~~~
triplee
Why was this downvoted? I literally thanked the poster for sharing this
because it was interesting and a thing I may have to run into soon.

I'm sorry you're all offended that I'm on legacy projects right now and we're
not at this point yet.

