
Designing Schemaless, Uber Engineering’s Scalable Datastore Using MySQL (2016) - mangatmodi
https://eng.uber.com/schemaless-part-one/
======
dcposch
FYI, MySQL has a fresh new JSON data type now.

It has some great properties. It lets you mix data with a strict schema and
data without a strict schema, getting some of the benefits of both worlds.

The JSON datatype avoids many of the annoying legacy considerations that other
SQL column types have. You don't have to specify a length--so you won't make a
VARCHAR(255), then get burned when one day a value has more than 255
characters. You don't have specify a character encoding--JSON is always
utf8mb4, the right one. (MySQL's 'utf8' encoding, perversely, supports only a
subset of utf8 and will break if you try to write an emoji.)

Here's a table that illustrates some of the power:

    
    
      create table unitType (
        id bigint not null auto_increment,
        buildingId bigint not null,
      
        info json,
        name varchar(255) as (info->>'$.name') not null,
      
        primary key(id),
        foreign key (buildingId) references building(id) on delete cascade,
        unique key(buildingId, name)
      );
    

We're modeling unit types in a building. For example, one building might
contain 1-bedrooms, some nicer 1-bedrooms, and some 2-bedroom units.

\- It's very easy to add new fields. If, tomorrow, we decide that each unit
type needs a `minSqft` and `maxSqft`, I can add them with no database
migration.

\- We still get most of the benefits of a schema. The database makes it
impossible for a unitType to exist that does not belong to a building. The
database also makes it impossible for a single building to have two unitTypes
with the same name. (With a truly schemaless DB like Mongo, the complexity of
preventing or dealing with those kinds of invalid data end up in the
application code.)

\- It makes it easy to use SQL directly, with no ORM. SQL is a powerful
language; ORMs are often a leaky abstraction and a source of unessential
complexity. With JSON columns for extensibility, you end up with way fewer
migrations and way less need for auto-generated SQL.

\- Computed columns (like name above) are really powerful.

Most of the above is possible in Postgres as well. Postgres does not have
computed columns, as far as I can tell.

\--

This is just to say: 99% of people on Hacker News are closer to where we are
(rapid prototype phase) than where Uber is (Web ScaleTM). If that's you,
consider just using JSON columns to maximize your development velocity! You
can always do something fancier (like Schemaless) later on.

~~~
nerfhammer
> MySQL's 'utf8' encoding, perversely, supports only a subset of utf8 and will
> break if you try to write an emoji.

Almost every programming platform makes the same mistake. Characters outside
of the basic multilingual plane (i.e., characters taking more than two bytes
to store) will break certain string functions.

It's due to using (fixed length) ucs2 for in-memory storage, instead of
variable-length utf8. Imagine, for example, if you tried to replace a one-byte
ascii character at a given index with a 4 byte cuneiform character - you would
need to reallocate, copy+transpose the entire string buffer. Every string
function would need to take variable byte length into account and would need
to traverse the entire string with a state machine to do any operation, plus
maybe a full reallocate+strcpy.

So instead we can make every character a fixed byte length. We could make
every character 4 bytes in memory, but since strings are 99% ascii that seems
wasteful, so instead let's just make every character a fixed-length 2 bytes
(ucs2), that should cover 99.9% of it.

~~~
dcposch
Well, sort of.

Java and Javascript use 2-byte strings in memory, yet both can represent an
emoji just fine. (They do so via a hack that uses multiple indices in a String
to represent a single character. If you want to go down the rabbit hole of how
much this sucks, check out the MDN pages for charCodeAt vs codePointAt ...)

Modern languages like Go just represent strings as UTF8 in memory. This has
lots of advantages:

\- Most strings are mostly 1-byte-per-character, saving memory

\- Full Unicode support.

\- Faster IO, since you don't have to re-encode UTF8 strings to/from the
network or disk.

MySQL did something weird and enormously stupid with its `utf8` encoding. It
supports up-to-3-byte-per-character UTF8. This is idiosyncratic, nobody else
does this. It supports _some_ emoji and _some_ Chinese characters but not
others.

Fortunately, you don't have to worry about it or learn about it. When using
MySQL, just always use utf8mb4 ; never use utf8.

~~~
nerfhammer
huh, I had remembered unicode as being broken in javascript, python and mysql
in the same way. I hadn't remembered three-byte utf8. mysql also is the only
platform that I can think of that supports three byte integers.

------
ronnier
Summary:

> We ended up building a key-value store which allows you to save any JSON
> data without strict schema validation, in a schemaless fashion (hence the
> name). It has append-only sharded MySQL with buffered writes to support
> failing MySQL masters and a publish-subscribe feature for data change
> notification which we call triggers. Lastly, Schemaless supports global
> indexes over the data.

~~~
samstave
Let’s also not be smug, let’s explain for everyone who comes here: why is this
good or why is it bad. So describe why this is good or describe why it’s bad

That comment was edited with more info...

------
whalesalad
Still blows my mind that it took Uber so long to migrate away from a single db
solution. The bit about wanting an event system to handle downstream trip
processing w/o having one failure block the whole job was shocking.

I’m all for avoiding premature optimization but this was taken to the extreme.

PostgreSQL is capable of all of this out of the box. Wonder why a custom tool
was built instead?

~~~
subway
_Still blows my mind that it took Uber so long to migrate away from a single
db solution. The bit about wanting an event system to handle downstream trip
processing w /o having one failure block the whole job was shocking._

I'm pretty sure they've seen a few iterations on their data stores. I remember
attending a meetup at Urban Airship in 2012 or so where Uber engineers gave a
presentation about a data store migration (I _think_ from Mongo to MySQL)

------
mewse
I wish I had video of the faces I was undoubtedly pulling, during the few
seconds I spent puzzling out the pronunciation and meaning of the word
"Schemaless". /Shema-leez/ ? /Szhee-males/ ?

Naming products is demonstrably a hard problem.

~~~
singularity2001
God forgive me for reading 'she-males'

------
mangatmodi
Any critique on Uber's use of Triggers for triggering billing service? I have
been reading that Triggers shouldn't be used to esp, call external services
as, the external service might not be ACID compliant(no rollback?) and if
expensive, they can hold the DB lock on the row for really long time.

------
verst
This was discussed here 2 years ago.

[https://news.ycombinator.com/item?id=10894047](https://news.ycombinator.com/item?id=10894047)

~~~
alpb
Indeed discussed 4 times before
[https://hn.algolia.com/?query=schemaless%20datastore%20mysql...](https://hn.algolia.com/?query=schemaless%20datastore%20mysql&sort=byPopularity&prefix&page=0&dateRange=all&type=story)

This also should be titled "(2016)"

------
rubyn00bie
Can this have a 2016 added to it, please?

~~~
dang
Yes.

~~~
stmw
Great catch!

