Hacker News new | comments | show | ask | jobs | submit login
Why Uber Engineering Switched from Postgres to MySQL (uber.com)
731 points by myhrvold 333 days ago | hide | past | web | 294 comments | favorite



> MySQL supports multiple different replication modes:

> Statement-based replication replicates logical SQL statements (e.g., it would literally replicate literal statements such as: UPDATE users SET birth_year=770 WHERE id = 4)

Postgres has that too (using a 3rd party tool, but it's an officially supported tool). We were using it on reddit 10 years ago. It caused a lot of problems. I wouldn't call that an advantage for Mysql.

Honestly, reading this it seems like the summary is: "We don't follow great engineering practices so we need a database more forgiving". Which is fine if that's how you want to run your business, but isn't really the death knell for Postgres.

A specific example:

> This problem might not be apparent to application developers writing code that obscures where transactions start and end. For instance, say a developer has some code that has to email a receipt to a user. Depending on how it’s written, the code may implicitly have a database transaction that’s held open until after the email finishes sending. While it’s always bad form to let your code hold open database transactions while performing unrelated blocking I/O, the reality is that most engineers are not database experts and may not always understand this problem, especially when using an ORM that obscures low-level details like open transactions.

Your developer should understand database transactions. But you should make it easier for them by abstracting it so that they don't have to. And in this particular case, I'd say they shouldn't be using the database to do locking around sending a receipt. It should be put into a queue and that queue should be processed separately, which avoids the transaction problem altogether.


The actual summary of the article is "The design of Postgres means that updating existing rows is inefficient compared to MySQL".

Yes, there were some other points that were just extra annoyances for them but clearly that point was the most important to them. It's what the header image and the first 60% of the article was talking about and yet nobody seems to be engaging with that point in this thread.

Is the design choice bad? They never said it was. It's just an engineering trade off. It's very possible that most workloads benefit from this design. But if you workload involves updating lots of existing rows at large scale, then MySQL is going to be a better choice for you.


But it's only an issue if you rely on lots of transactions for data consistency and my point was that it sounds like they are relying on transactions too much which is why they need a more "forgiving" database, which is the part I quoted.

Also they didn't mention anything about the auto vacuumer, which mostly solved the issue they are talking about.

Their lack of mention of the vacuumer and not seeming to know that Postgres supports statement level replication makes me wonder if they took a deep dive into the wrong part of the technology.


None of what you said addresses the issue that I (or they) are talking about.

On Postgres an update requires a rewrite of every index of the row.

On MySQL it only requires an update of the indexes that were touched by the update.

If you have a table with 10 indexes then this means doing 10 extra writes physically to the disk.


As I know, not every updates rewrite indexes in PG. It has single page clean up and HOT (heap only tuple) update optimization.

Please refer to 64~ page of https://momjian.us/main/writings/pgsql/mvcc.pdf.


10 indexes on one table seems a bit much. It sounds like a table that hasn't been normalized.


Though that is quite proper for a data warehouse.


A data warehouse should use much fewer transactions though.


Both vacuuming and logical replication are discussed in the article. In particular, vacuuming is easier with InnoDB since the changed records all exist in the redo log whereas PostgreSQL needs to scan the whole table. pglogical is mentioned for people running PG9.4+ as a way of doing minimal downtime cross version upgrades, which wasn't an option back with PG9.2 unless you go with something like slony1 or londiste.


> vacuuming is easier with InnoDB ... PostgreSQL needs to scan the whole table

There's been quite a few improvements to VACUUM in 9.6 [1], including avoiding full-table scans.

[1] https://www.postgresql.org/docs/9.6/static/release-9-6.html#...


Agreed, they even admit Postgres is faster for querying because MySQL requires 2 index lookups. I like the immutable data design of Postgres, is more robust and enables fast transactional DDL. My design never deletes or alters a tuple, so all of these problems that Uber is obsessing about, go away. Question is - Why do they need so many updates? Is there a problem with their data capture?


Experience with Mysql replication (even simple master/slave) leads me to believe Uber is going to have some rather nasty surprises at some point.


MySQL has had solid and flexible replication options for a long time. Postgres has only just started to catch up in the last couple of years.

Don't get me wrong, I would generally choose Postgres over MySQL for an RDBMS with replication requirements these days, but I'm not sure I would have made that same descion a few years ago.

There are valid reasons that long established companies such as Google, Twitter, Facebook and countless others chose MySQL as their primary data store.


it's not that "replication options just started to catch on" with postgresql, it was that when the postgresql team take something into core it has to be fully fleshed out and fully supported with no surprises.

And with that in mind, I would take the slow to deliver replication options postgresql supplies in base over the previous third-party options which where comparable to the built-in options in mysql... and I'd take it over mysql's built-in options.

and I've used both postgresql and mysql in highly transactional/reliable production environments for over 5 years.


I guess I'll be "that guy" who brings up non-technical concerns. I'd think very long and hard about using MySQL these days. Oracle is pretty much bizzaro treating it and widening the feature gap between the open source and enterprise version. Wikimedia is using MariaDB so I'd guess it works well for certain high scale use cases.

[I know there's different opinions but I'd rather have a fully open database if I have a choice. For the record I default to PostgreSQL]


> There are valid reasons that long established companies such as Google, Twitter, Facebook and countless others chose MySQL as their primary data store.

I think you're misrepresenting things here. While all do use MySQL for some specific tasks, it's clear that all also have many other central datastores which are not MySQL. In Google's case this is a gross overstatement.


Not really, YouTube runs on MySQL. Pretty sure it's still the largest video site in the world.


YouTube runs several MySQL instances behind Vitess (source: http://blog.vitess.io/2015/03/scaling-mysql-in-cloud-with-vi... )


At the same time, YouTube is only one of Google’s services, and there is no evidence that the remaining services run on MySQL as well.


Twitter's underlying data stores are and always have been MySQL.


Could you elaborate?


You can read it in their next blog post: "How we migrated back to PostgreSql from mySql after migrating from PostgreSql to mySql". My experience with mySql replication was OK, but I did nothing fancy - just master/slave standard stuff. However the database itself was allowing you to do stuff by default (at the time at least) that is really bad. Not to mention scaling while I was using it was pretty much "you're on your own buddy" kind of deal. Today I'm scared to touch mySql because I don't trust Oracle. I use MariaDB on an old project which used mySql but anything new is postgres.


Personally I prefer Postgres. But at the shop we use MySQL. One of the leads personally believe in Percona Server[0] and mentions some of the old mysql folks work there. I like their docs and the idea of a more tuned mysql config.

[0]: https://www.percona.com/doc/percona-server/5.7/index.html


I can second using Percona; I wasn't much of a believer in MySQL until I inherited a medium sized application using Percona 5.7. Once you get past a few of the initial warts (like making sure the database doesn't silently truncate your data--yuck), it's been really performant, stable and impressive (and we're running a setup with a fairly complex schema with master operating at ~3000 QPS average, a slave actively replicating for read load balancing and backup, and Sphinx for FTS).

Knowing what to be careful with on MySQL, I would certainly consider using the Percona flavor again in a new application.


I hit an in production bug with the silent truncation field thing the other day, old system (2009ish) that was been moved.

Table had a field that was varchar(128) and wasn't used (original filename but system generated a hashed filename and served it with that name (for sharding, lots of files)) so for 5 years MySQL had been silently killing everything > 128 characters and it didn't matter.

Then along comes our hero (me in this story) who does all the upgrades, tests everything thoroughly (he thought) and it's all fine, deploy, test everything fine.

Next morning, 9 bug reports "Can't upload files at all FIXITNOW!!!".

Turns out company was auto generating PDF's from some report software that dumped all the data into the filename and none of them where less than 128 chars.

MySQL 5.7 started not silently dumping data into the bitbucket in the sky and instead threw an error (when quite reasonably you tried to shove 300 characters into a varchar(128)).

TLDR: MySQL doing the right thing broke the broken.


Heh, yes came here to say something rather similar...


Try enforcing this on teams that use ORMs like hibernate with 500 developers.

Super, duper, common issue, you will find this at every large shop at some point in its life time, usually around the time of hiring people and expanding extremely fast, and taking on some tech debt.

All functions of extreme scale, hyper growth, and yeah, not following the absolute best practices all the time, but tech debt is like any debt, you get something now, and pay later. If they continue going up and to the right they will be able to afford it.


ORMs have mostly been just painful at ever shop I've been at. I've used ActiveRecord, Squirl, Hibernate, django.db .. they're all various level of suck.

The one huge advantage of an ORM is the ability to support multiple databases, but that only really works if you can do everything using the ORM. The moment you have a function too complex that you need to write some SQL, now you need some case statements and multiple integration tests for all the database your product needs to support.

They remove some boiler plate while adding others. In one of my own projects, I just created several files (pgsql.commands, mysql.commands, etc.), a basic set of classes around them and a base set of commands that will work for all the DBs I wanted to support (so the command files had an inheritance model, albeit only one layer).

With all that being said, most ORMs I've used do have explicit transaction support. I know Squirl had a `transaction {}` block you could wrap commands around. Transactions shouldn't be an excuse. They should be off by default and explicit added around blocks of things that need to be atomic.

> Try enforcing this on teams that use ORMs like hibernate with 500 developers.

I realize this is a hyperbole (I hope) because you really shouldn't have 500 developers all on the same monolithic project (unless you're developing like...the Linux kernel). Getting your team to at least try to implement best practices does take some effort, but with things like weekly demos and code reviewed commits, it's do-able.


I used to be a huge proponent of ORMs everywhere, but I've come to realize that if you're writing your app in such a way that a developer needs to be able to do any arbitrary data fetch or transformation whenever they want, that's your real problem. The set of retrievals and transformations you want to support should be well-defined, and abstracted into a layer whose interface allows the level above it to only think in terms of models and not their backing store.

After you have that, then it doesn't even matter on the backend. The models you present to the layer above can have complex and changing relationships to the actual storage -- maybe they contain data which is derived from what's in the database, but transformed after being fetched so that none of their properties actually correspond to a column or a field in a store. In my experience -- having seen the tragedy that is a Rails project gone full ActiveRecord -- this pattern enforces an excellent separation of concerns and constrains a problem which can otherwise grow unboundedly in complexity.


I don't really agree; I think you either need to have a very thin layer between your DB facts and your domain, or else use the DB as a kind of persistence layer for a complex graph.

The latter only really works if you've got a primary implementation language and aren't integrating lots of applications / libraries written in different languages communicating with the same database. You need to go down the SOA / distributed RPC / FFI route to integrate different languages, and that has its own complexities.

Personally I prefer treating the DB as a canonical store of facts. Models with a lot of code are pretty suspect. Retrievals can be tuned to just the facts required, ma'am - you don't accidentally drag in the banana + gorilla + whole forest. Doesn't stop you building a higher-level service layer if that's what you need, either. You'll need that when you scale up anyway; chatty models won't work at that level either.


Yeah, I'm saying that as your app grows out of being a simple CRUD app into something more useful and involved, there will be less of a relationship between what you need to store things efficiently, and what you need to present them well. Your model will become more graph-like, probably. For this reason, patterns designed around AR-style models will fail to scale. I disagree that this only works in a mono-lingual environment, although you will need tooling and infrastructure to support it; a model-centric architecture typically doesn't afford the possibility of multi-lingual support.

The code doesn't go in the models, it goes in the service/arbitration layer. DB as a store of facts is obvious -- DB as a 1:1 representation of what yet-unforseen features, UIs and platforms will need is a naive and limiting assumption. You have to build your application in a way that future product needs won't be constrained by storage and modeling decisions, which is a tension that Rails apps frequently encounter.


> The one huge advantage of an ORM is the ability to support multiple databases, but that only really works if you can do everything using the ORM.

It has several advantages: support for multiple databases (which is useful, sometimes), the ability to serialize/deserialize an object graph in one go, and sometimes a decent query builder which lets you compose queries as opposed to concatenating strings.

Unfortunately, it's also terribly easy to destroy performance by using lazy collections configured the wrong way for your use case and not notice it, to the point where I strongly advocate using query builders instead.

> I realize this is a hyperbole (I hope) because you really shouldn't have 500 developers all on the same monolithic project (unless you're developing like...the Linux kernel). Getting your team to at least try to implement best practices does take some effort, but with things like weekly demos and code reviewed commits, it's do-able.

The problem is that when you come in later, that the codebase grew way too fast and the deadlines are tight, retrofitting best practices on an existing ball-of-mud can be daunting.


> It has several advantages: [1] support for multiple databases (which is useful, sometimes), [2] the ability to serialize/deserialize an object graph in one go, and [3] sometimes a decent query builder which lets you compose queries as opposed to concatenating strings.

[1] was the point of the comment you replied to and it provided a very important constraint as well

[2] rarely needed & easy to implement with recursive queries in native SQL

[3] building queries is pretty straight forward; what ORMs usually tend to bring to the table is knowledge about the schema and therefore compile-time error reporting - but this can be done in any language where one has the level of reflection, or, in worst case, by a two-stage compilation process where stage 1 generates code from the schema that can then be used by the compiler for verification in stage 2


I don't see how recursive queries are going to help you serialize a new Foo with a new field of type Bar with a new field of type FooBar, each going in a different table. That's what I mean by serializing an object graph.

As for building queries, SQL is straightforward (mostly). The problem is that it composes very badly. Any time you need to implement something like an advanced search (ie, lookup the same information, but with a number of different search criteria only known at runtime), the best you can do is concatenating partial queries and hoping you got the parentheses count right. Not to mention that a query builder will help with stuff the SQL syntax is miserable for, like IN clauses.


> I don't see how recursive queries are going to help you

You construct a graph of objects with very few queries, transform it and write it back; with knowledge of the db-schema this will outperform any ORM-based solution and give much greater flexibility.

> As for building queries [...] The problem is that it composes very badly. [...] the best you can do is concatenating partial queries and hoping you got the parentheses count right.

There is an area between full-fledged ORMs and string concatenation. In a purely functional approach queries are composed by composing functions that compile to queries. Postmodern[1][2] is a good, open source example, though in most commercial projects we just built our own wrappers for the tasks and databases at hand. This also allows for much better performing code since for a lot of tasks hooking up the db-reader to the json emitter without going through object instantiation reduces memory and cpu consumption by an order of magnitude (or two), while in the same project, some code benefits from a OOP approach (for which you just use a reader that constructs the objects on the fly).

[1] http://marijnhaverbeke.nl/postmodern/

[2] http://marijnhaverbeke.nl/postmodern/s-sql.html (yes, it does string concatenation at run time, but it does it for you, you don't worry about getting the parenthesis count right)


This is why I like SQL Alchemy. It provides an ORM that acts as a veneer over a very powerful SQL expression library, and you can move back and forth between the two seamlessly, even mixing them in individual statements.


That's what I've often wondered about ORMs. First they try to abstract away all the database internals so that the developer hardly even knows what's happening. However, ultimately the abstraction always ends up having some limitations. In the end the internals need to be exposed, but since your code base is still mostly based around the ORM it can end up being a mess.


If you're team is 500 strong you definitely should be at a point where you can do things right. If not, the 'debt' is not mainly technical, it's clusterfuck of bad decision making.


Yea sure some people in the sea of 500 will be doing elegant clean things, working in the core part of the business ( usually at or near the cash register program ) but if you have reason to have 500 developers you don't simply get the luxury of having every project be written perfectly.

Landing somewhere in-between by getting things done not perfect but shippable, and doing it fast as possible is par for the course ime. The best code is always written the 2nd or 3rd time, never the first.

Especially on the case of hyper growth worrying your competitors like lyft or postmates or amazon might get something pivotal out first.

People have to learn somewhere, usually things like this they learn at scale, on the job.


Yes, but there is no point in blaming the database for that. It's seriously not a database issue. Particularly Postgres has well exposed internals for monitoring and killing of long running transactions (if you are not setting statement timeout).


Sure, but the show must go on.


Something that hit me researching how build a relational language and thinking how hard could be to remove SQL and put instead my own flavor, (super-oversimplification):

- SQL "bad" - ORM "good" - NoSql apis "good"

So, the thing is that when facing with SQL "everyone" try to "abstract" it more.

Or instead use NoSql, because is "easier".

Fine.

Then if exist a market demand for a better API for the databases, why the databases guys not DO IT?

Yep, I know SQL is supposely the way for it, but bear with me: I live in the FoxPro era so I know what is code "to the metal" in database without SQL (and it was fine and easy).

If the SQL layer could be optional and more bare layer is provided (you can copy the dbase ideas!) then the problem of ORM mappers could be solved far easier (I imagine!).

How this could be?

With a AST api layer, for example. So I can send:

TABLE "Customer" SELECT "*"

So, imagine a kind of LLVM but for databases...


LINQ: https://msdn.microsoft.com/en-us/library/bb397926.aspx

Most of the time, the problem is to "embed" a foreign language (SQL) into an existing one (Python, Ruby, etc.), but you can do it in a proper way such as LINQ, then you have type-checking etc.


I wish your thinking would be more widespread. I never understood how come that restful HTTP became very popular without having any universal query language around and merely providing an HTTP client with each service built, while in the DB world everybody is obsessed with SQL? Why not just provide the developer with an API/library for the underlying algorithms/datastructures?

For example I hate so much when I have to read mySQL documentation jus to find out what went wrong when I used ORDER BY and the optimization didn't kick in? I mean, if I have to keep in my head the whole optimization mechanism of the DB engine and all the steps of the filesort algorithm going on behind the scenes just to write a single wretched SQL statement, then what's the point of SQL in the first place?

And this was just one example. The truth is that the SQL abstraction is leaking. Most of the commands don't encapsulate anything at all.


it's very hard to abstract away from SQL. to be honest, many nosql have integrated concepts similar to grouping, having, joins in their API as well as they matured, because that's what data elaboration needs.


But still everyone do it. That must tell something, right?

And I'm not talking about NoSql borrowing some relational concepts, but the opposite, and more directly, the API.

I know that most folks (including the ones that downvote!) have no clue what I'm talking about, because almost nobody (recently) have experience in talking against a database without SQL. Is like if the only way to talk to a NoSql was using Json + REST. That constrain badly your mind.

The inner relational model is far more rich, and simpler, SQL was not designed to be used by developers and it show (specially when you have contrived syntax as with CTE)

And before you ask what is the problem with CTE, is exactly because the way to have that is create a contrived syntax that obscure what is goin on. SQL is too restricted, yet too broad for interfacing.


Well, data transformations are much easier to read and write in SQL than e.g. Java, what with temporary collections built up in memory, random maps, lists, etc.

CTEs aren't required very often - you generally only need them for recursive CTEs, and that's iterative retrieval analogous to pointer-chasing. It's typically a sign of a data model that's poorly suited to relational storage, e.g. trees and graphs.

I have issues with the irregularity of SQL syntax - it deeply annoys me that we have both 'where' and 'having' for doing exactly the same thing, one pre-fold and one post-fold - and I don't like my lack of access to the potential the index has for related fetches (window functions are not pleasant to work with) - but mostly my problems come not from lack of access to the relational nature, but lack of power over poor implementation / query planner / etc. details.


I started using CTEs much more often for complex queries actually. IMHO, they're more clear and easier to read than sub-queries, especially when those sub-queries get nested 3 or 4 layers deep.

I agree with the ideas of irregularities. The WHERE vs HAVING doesn't bother me much, and I can't think of a better syntax off hand. My SQL pet peeves are that the SELECT list is at the beginning - I don't know what columns I want exactly until after I type out the JOINs, so I usually type SELECT * and then fill in later. I'd rather put it between WHERE and ORDER BY. I'd also like UPDATE to put the WHERE before the SET, so you don't risk blowing up a ton of data if you forget or miss the WHERE.

Probably a lost cause to get that in, but I would think it wouldn't be too hard to at least support those syntax changes in addition to the current standard.


Complex queries using CTEs may be easier for you to reason about when you are writing them, but are a nightmare to understand if you are not the one who wrote them. We have heaps of CTE-intensive technical debt and usually it takes longer to understand what they are doing than it would take to write them from scratch from a specification.

If you are using them to ease your understanding, chances are that you are doing the job in a very memory intensive way. It also tends to be slower because the generated intermediate results do not have indexes nor good statistics that could help the query planner to be efficient. Essentially, you are taking upon yourself the query planner job and assuming you'll do it better yourself.


Agree, but that are problems at the end, when most developers have issues at the start (basics). Baffle me when i read that relational databases are "hard" and you need to teach some basics even for experienced developers. Plus, the thing here is that yeah, everyone know SQL have issues.

So they make a ORM

Then it create worse issues. Is like the Database-side and the app-side are at war (without intention!) with each other and the database-side only concern themselves for the kind of issues that only happened for specialized tasks.

And the app-side still try to interface to the database, but poorly.

----

Articulating this better, I hope: Is like the movement now to improve the syntax and semantics JS. Or like ASM.js.

So, why not have SQL 2.0 with the fixes everyone since 20 or more years ago already know? And the libraries/API made to acknowledge that databases are not primarily used by "end-users" but app developers?

But that is probably like ask why not clean C++... ;)

I know this is just ask for the moon (obviously not exist one "SQL", only several more or less alike implementations), but the trouble with ORM and databases is more than a decade old and the answer is mostly "yep, we have that problem. Move along"


> SQL was not designed to be used by developers

Sure it was. Who else would it have been designed to be used by?


> > > SQL was not designed to be used by developers

> Sure it was. Who else would it have been designed to be used by?

More accurately than the grandparent: SQL wasn't designed exclusively to be used by developers. It was designed to be accessible to analysts (domain-focused analysts, not systems analysts.)


I've tried to setup a replicated postgres with autofailover and it honestly is a pita.

the only sources of failover are rando scripts over the internet, that you have to download hammer in to your version dialect and hope you don't trigger one of the many uncovered failover modes.

sure log shipping works, but that's far, FAR from a working solution. the gap requires ton of development hour, testing etc.

can't really blame people for using things with mature tooling.

(but I didn't switch to MySQL that'd be madness! there are plenty of good, replicated redundant stores out there, both nosql and sql)


Does repmgr handle most of this?


most of it. but in the master-slave configuration the clients can only work if connecting to the master, and when the master switch, scripts needs to go to each clients and update their config. not immensely bad if you also use pgbouncer so you can do it on the fly without restarting the whole client, but exceptionally vulnerable to split brains and the like.


I've handled this with a DNS CNAME record that points to the active master, though pgBouncer is a valid solution as well.


Uber engineering in general is extremely disappointing. They consistently oversell/overhype mundane processes and technologies both software and otherwise. I suspect this is mostly because they are driven more by marketing and business needs than actual engineering bottlenecks. That is the only way I can explain all the technological missteps they keep bragging about.


Sending an email to a local MTA should be pretty fast.

That is, if you cannot and will not do async stuff in your program, there are ready-made tools that will do that particular thing asynchronously for you, and have been doing so for years (or even decades).


True, but really your local MTA is just acting like a specialized queue, since the first thing it will do is send the message to your relay.


Yes, the whole point is to avoid building your own queue where a ready-made, purpose-made solution already exists.


Well there are pros and cons. If you have already have a queue structure set up in your environment, it may not be a good idea to have to maintain a second queue structure in the form of a bunch of local MTAs. You'll need monitoring (how many messages are queued in each local MTA?) and a way to keep those MTAs up to date.

If you just put an item in to a queue, and then have a specific cluster of machines that does nothing but grab items from that queue and push them through an email infrastructure, then it'll be a lot easier to maintain.


I was wondering how could that happen? It sounds like someone is trying to do two things in parallel.

I would expect

    def my_view(request):
        try:
            receipt = generate_receipt(...)
            receipt.send_email(...)
        except SomeKindOfEmailError as e:
            # okay do something else
and this should be synchronous and thus blocking. So to not block, they either wrote co-routines (asynchronous) or execute things in parallel. Have I interpreted their problem incorrectly?


Open db connection

get data for receipt

generate receipt

send email

write success to database

close connection

To a junior programmer this would probably look reasonable, and to be fair, it takes some experience and getting burned, or good training, to know it is not.


I was about to argue but then I realized: I've been working in the MS world for so long, I forgot that not everybody has connection pools.

The MS recommendation is the opposite: get the connection early, finish with it late, let us worry about the "real" connection. I've been working on multi-TB databases like that, with hundreds of concurrent requests, and never had problems.


This has nothing to do with pooling, you can't use a connection pool in this scenario because you're in a transaction. Even in the MS world, that requires a single connection to be used throughout without releasing it back to the pool. If one isn't in a transaction then each access to the db uses a fresh connection from the pool and this problem never comes up, but the OP here is assuming the time between open and close is a transaction.


So the correct version is, I guess:

Open db connection 1

get data for receipt

Close db connection 1

generate receipt

send email

open db connection 2

write success to database

close db connection 2

I guess you could speed this up a lot by doing it in bulk instead of opening and closing db connections twice for every email. Anyway, the version you wrote sounds reasonable for everything but really big operations to me, but then again I'm fairly junior.


You can still do it in one transaction, sort of. Replace "send email" with "queue email" and then either eliminate the step that writes the success to the database (pushing that off to the queue processor, who will open or already have open a database connection) or instead write to the DB that you queued the item.

The advantage you gain is that the queue is a nice buffer if something gets held up with email sending, and also, the queue processor can work in bulk, say sending 100 emails, and then opening a connection to the DB and writing them all in one statement.


Depending on the number of email sending workers, it could be closer to:

    transaction 1:
        mark an unsent receipt as being processed
        get the marked receipt
    send email
    transaction 2:
        mark the receipt as sent


So if the "send email" step fails (temporarily), the next worker to come along will grab the same receipt and send it again?

I think a better solution would be to use a centralized queue that actually does the mail sending, and retries in case of failure.


That depends on the implementation of the queue. Most queues have a "failed/retry" concept, where they mark failed jobs to be retried in some set future point in time. So one failed job does not hold up your entire queue processing.


> a centralized queue that actually does the mail sending, and retries in case of failure.

A bit like SMTP?


not necessarily, in the first write we can update receipt status as "queued" so queue system can update to "sent" or retry


You don't need to close the database connection to not have an open transaction, right?


No you don't. But depending on the ORM, some of them batch writes unless you explicitly tell them not to, but will flush when you close the connection. So some programmers will just close the connection because they don't understand the difference or because they've learned that closing the connection guarantees that the data is written.


> So some programmers will just close the connection because they don't understand the difference or because they've learned that closing the connection guarantees that the data is written.

To be fair, this problem isn't limited to databases. Filesystems, even HDDs/SSDs, have been known to readily ignore flush() calls in order to achieve better benchmark results.


This still can be a pretty reasonable and correct course of action. Everything depends on the amount of data you are locking, and amount of transactions open simultaneously.

When there's a noticeable contention due to the number of parallel transactions, one should consider ways to loosen their data guarantees (the status can be unset for some time) and go for an explicit asynchronous approaches.


Thanks for the illustration. But if Uber team was using ORM, then expect ORM to take care of the pool of connections.

get data for receipt would be a SELECT and only write success to database would do INSERT or UPDATE. I expect junior programmer to complete the above in at least two SQL calls. I have a feeling they were trying something smart.

EDIT: hmm reading the other commenter above, probably they are trying to lock on the data for full data integrity. Okay. That makes sense then. I was looking at the problem from the wrong angle.


Grapevine says that they were holding a transaction open for the entire duration of the Uber ride.


The fact that Postgres didn't have official replication feature prior to a recent version is just unacceptable.


I would argue that most of these Postgres "flaws" are actually advantages over MySQL when you look at them holistically rather than the very specific Uber use-case.

Postgres's MVCC is superior (can rollback DDL, can add indexes online, can have open read transactions for a VERY long time without impacting other parts of the system)

Postgres supports many types of indexes, not just b-tree. One thing it doesn't have is clustered b-tree indexes... which is really what MySQL does that makes it somewhat "better." I wonder how Uber adds an index to a table that already has 1B+ rows in it with mysql?

Postgres have WAL level replication is a better guarantee of actually replicating the data correctly. I cannot tell you how many times I've had to tell my boss that the "mysql replicas might be slightly out of sync with the master" because of various replication issues. The way it handles triggers and scheduled events alone is garbage and can very easily break replication and/or silently cause inconsistency.

As for data corruption, if there is a bug that causes corruption, then there is a bug. I don't think that is a fundamental design flaw as implied in this article. You shouldn't rely on 1/2 assed replication design to accidentally save you from the data corruption bug. There are many downsides to the design MySQL has that are simply not listed here.

I have been both a professional MySQL administrator as well as Postgresql (as well as SQL Server and many NoSQL engines). Many of these Postgres issues are only issues at crazy huge scale, and I would say at that point you probably want to move away from relational anyway. MySQL has its own very large set of problems at scale as well.

It sounds like Uber is using MySQL as just a data bucket with primary keys ("Schemaless") which is good -- because you can't alter tables to save your life with MySQL.

At the end of the data each developer/business needs to use what works for them, but I would really shy away from pointing to this article as a linchpin in the "MySQL vs. Postgres" war (if there even is such a thing.)


Can't agree with this post enough.

I find their whole writeup to be terribly myopic. When they started their service, Postgres was almost certainly the right choice for what they were building and their MySQL setup was not. Now Postgres is less effective for them.

These kind of tech switches are _inevitable_ if you're making the right choices for your organization.

This strikes me as very similar to the article where Twitter ditched Rails. The focus should be inward... how they chose a tool that didn't support their use case and how they solved the problem, but instead they're about the flaws (that aren't really flaws) of the tool.

It's always the craftsman.


While the article is discussing PostgreSQL vs MySQL, Uber's actual win seems to be from switching their data model to their sharded Schemaless key=value store. I expect they would have got this win no matter what backend they chose for Schemaless.


"1B+ rows in it with mysql?"

Been there, really no fun.


As long as you can shard (across multiple instances, or even within same instance, to avoid B-Tree latching), 1B+ rows within MySQL is piece of cake.

Also, MySQL* is getting LSM-Tree support lately, which makes high performance data ingestion combined with OLTP workload quite feasible.

* https://github.com/facebook/mysql-5.6/tree/webscalesql-5.6.2...


"I wonder how Uber adds an index to a table that already has 1B+ rows in it with mysql?"

"As long as you can shard "

Not sure how sharding helps with 1B+ tables when adding indices, care to share?


This is: "Split the index in your application code."

And then do your joins, in your application code.


Percona to the rescue [1]. Works flawlessly, at least in our case.

[1] https://www.percona.com/doc/percona-toolkit/2.1/pt-online-sc...


Also been there. Percona Toolkit works fine (though it's not load-aware enough, so we changed the backfill logic).

Plus, in 5.7, there are a fair number of online-DDL changes, and adding indexes is (usually) one of them.


Am there right now, it's not really a problem - you just have to plan your schema changes and use Percona OST.


https://github.com/soundcloud/lhm

Saved my life a few time :)


Thanks!


> It sounds like Uber is using MySQL as just a data bucket with primary keys

They have a couple posts about "Schemaless", but I still don't understand why they used MySQL as the data store instead of something like Cassandra. ( https://eng.uber.com/schemaless-part-one/ ) From that post it looks like they basically built a no-sql database on top of a relational database.

The only reason given was operational trust ( "If we get paged at 3 am when the datastore is not answering queries and takes down the business, would we have the operational knowledge to quickly fix it?" ). The project took nearly a year to roll out, and in that time the operation knowledge could surely be trained, hired, or contracted.


Operating Cassandra at the scale that Uber is going to require is going to be painful and as operationally draining as MySQL if not more.

There are really not a large number of options here anymore with the departure of FoundationDB from the market. CockroachDB might be an option in a few years, though I'm still confused why they are moving towards a SQL-ish vs key-value interface...


"departure of FoundationDB from the market"

Pissed me off so much. Only thing close to Google's F0 RDBMS on the market, at a reasonable rate, and the beginning of a good offer to enterprises. Then, "poof!" It's a good example of why I tell companies to not put anything critical into something from a startup. If they do, better have a synchronized, backup option tested and ready to go.

"why they are moving towards a SQL-ish vs key-value interface..."

That's easy: most databases and buyers use SQL. Key-value is preferred by startups & non-critical, side projects in big companies you see here a lot but aren't representative of most of the market. Need first-rate, SQL support. I think EnterpriseDB shows that it's also a good idea to clone a market leader's features onto alternative database.


> Only thing close to Google's F0 RDBMS

Did you mean F1 (instead of F0)?


Yeah, yeah. I keep getting the 0 and 1 mixed up. Thank you.


What dbs would you suggest in their scale ? that are easier operationally than cassandra ?


I was at MesosCon and ended up talking to some Uber people. They are currently using Cassandra in prod. I can't speak as to why they use MySQL the way they do though.


I gave a talk at MesosCon about how we (are starting to) run Cassandra across multiple datacenters at Uber (https://www.youtube.com/watch?v=U2jFLx8NNro, https://schd.ws/hosted_files/mesosconna2016/60/mesoscon-uber...).


Thanks for sharing this. I was wondering - the limitations they have listed could be easily overcome with cassandra


I had that same thought, that the time spent rolling their own system could be better spent just learning some existing good-enough thing.

A great way to get familiar with something is to be the folks who write it. It's also much more fun to design and implement something new than to just learn some other fella's software. I'm guilty of this myself.

But I've started to remind myself that "somebody else has had this problem" and there's probably a good enough solution out there already.

Put another way, is what you are trying to do really so novel? In the case of Uber's infrastructure, you would have to talk for awhile to convince me that they really really need something not-off-the-shelf.


So arguably, they are using mysql as a storage engine rather than as a database.

They don't explicitly answer the question "Why didn't you use InnoDB/WiredTiger/etc. for your dataplane?", but you get the idea that they were very happy with the specific characteristics of MySQL for their use case and so they built on top of it. It also sounds like they had some deadlines (specifically, the death of their datastore) that they had to meet :).


IMHO some possible reasons of not using Cassandra could be the following.

You can't use Cassandra if you need atomic increments (yes, they're included but painfully slow due to several trips required to satisfy PAXOS).

Also there are no transaction rollbacks (atomic batches always go one way - forward).

You may hit GC pauses if the JVM is not tuned properly.

If the use case involves its of deletes then tombstone related issues need to be considered.


I wouldn't have trusted Cassandra back then either. 0.9, 1.0 or maybe 1.2 was reaching sufficient maturity to actually be recommended. Modern Cassandra has come leaps and bounds, with the 2.x series finally becoming stable this year and just recently 3.0.x finally getting blessed by the community as stable enough for production. And ScyllaDB hot on their heels.


Tools such as the percona toolkit (https://www.percona.com/software/mysql-tools/percona-toolkit) provide the ability to perform safe and performant online alters for MySQL. Behind the scenes it actually creates a new table with the new schema and utilizes triggers to apply writes to the new table as the original data is being copied. Once it completes the table is renamed and the triggers are removed.

Percona also recommends using this tool to safely perform alters for any Galera-based replication product (Percona XtraDB Cluster, MariaDB Galera Cluster, etc.)


I have used this tool many times in production, and in accordance with advice about avoiding foreign keys.

It only failed catastrophically, causing a critical production incident, twice.

Each time took dozens of engineer hours to vet in advance, thus costing thousands of dollars.

Online DDL changes and indexing with pg cost us... basically nothing, and never caused downtime.

My and Pg each have their place... but if you want to modify large tables, MySQL is almost certainly the wrong tool for the job.


While this is true, i don't personally think Percona tools alone makes these changes 'safe'. You can create scenarios where performance is so degraded that you introduce replication delay or service interruptions (and yes the tools have ways to work around that, but it's a significant challenge to tune this correctly on production), data can be lost during the table switch over, and the complication of managing a product / database with a table in an extended migration is very difficult.

This space is far from done, until we can change data structures on the fly...we're going to find there to be a constant struggle between schema and schema-less and neither side will be right or wrong.

Solution: Quantum states; We suspect every possible scenario is happening already, so we should just migrate our database to the appropriate quantum state, where the data is in the structure we desire. Sounds insane, it probably is.


Your absolutely correct that it is not risk free, however, if you do not use foreign keys (which no huge scale mysql install like schemaless does) and you know what your doing (read: calculating the expected load, adding throttles and sanity checks, backups, etc) online alters are extremely feasible with mysql and folks have been doing them for years.

I know first hand shops like FB use a similar method in production.

Also, big note, the whole point of "schemaless" (And other things, like fb's friend feed architecture) is that you don't make schema changes.


online schema change is absolutely riddled with crippling bugs if you use foreign keys. Just a heads up, we've had to basically take everything it does and make an in-house version. The idea and execution are great, when it actually works.


Foreign keys are a major problem in MySQL if you do lots of writes, especially bulk writes. As is auto-increment (table-lock for load data infile, yay!).

We allocate our own IDs, and disable foreign keys for bulk loading.


Support for online schema changes has been getting a lot better in recent releases of MySQL. No idea if it is totally bulletproof right now though, we are still in the "stone ages" of doing failover based schema changes, not even pt-online-schema-change.

It seems that the replication arguments in the post are pretty weak (migrating to 9.4+ to allow heterogeneous versions would still have less downtime than migrating to MySQL), but clustered indexes seem like a huge win for their use case of updating only a subset of each row. Whether that is reason enough to totally ditch postgres, I don't know.


Well, for one thing, if they say schemaless, then you are using the wrong tool for the job.

Sure postgres has JSONB (comparing to mongo db), Key value store such as HStore, but they do well if they fit on one machine. The moment you hit that scale you have to realize that there are tools specifically built for this.

There is the phoenix project https://phoenix.apache.org/ that salesforce is using for scaling. Definitely worth a try.

But again, the title of the article sounded nothing more than a rant to me.


The term schemaless within Uber is more of a product name. It's kind of a joke with many of the engineers, that it's anything but devoid of schema.


I just want to add that newer versions of MySQL can add indexes online. In fact, it looks like all DDL can be done online, even though some if it (like adding a column) may require the table to be rebuilt, which can take time.

Note that the master is still available for read and write during this. Replicas will lag though.


From the MySQL docs: http://dev.mysql.com/doc/refman/5.7/en/alter-table.html

> Updates and writes to the table that begin after the ALTER TABLE operation begins are stalled until the new table is ready, then are automatically redirected to the new table without any failed updates.

While the master may be technically up for writes during this period, it's not much a goer if your table is large and has any write traffic at all, as anything writing will stall for what may be an extended period.


It should never be an extended period. Everyone everywhere will advise to keep your transactions as short as possible.

Sure this is sometimes boring additional work - eg you don't delete 1M records with one statement, you break it into 1,000 statements each deleting 1,000 records.

Sucks, but keeps your db and your users happy.

BTW this is true for PostgreSQL and MySQL and Oracle and every db that allows concurrent DML.


No, the migration itself will cause transactions to stall.

For example, if you have a table with 1M user records, and you run a migration to add a column in MySQL, then any updates to the table will be stalled while the table is rewritten to add the extra column (which may take a while). This is independent of how many records it touches - even if the transaction only touched 1 record and would take 10ms to execute, if the migration takes 10 minutes it may be stalled for up to 10 minutes.

In Postgres you can add a nullable column, and the table will only be locked for a very short amount of time, independent of the size of the table.


>can have open read transactions for a VERY long time without impacting other parts of the system

Unless you're on a replica.


> the very specific Uber use-case

Other than the "schemaless" layer, the issues mentioned are fairly common.


We did something very similar at EA Playfish, at least one alumni of which is part of the Uber engineering team.

We used a 2 column InnoDB-backed table for all of our data storage, massively sharded, and run in a 3-host master-slave-slave configuration.

At that time EC2 would routinely kill hosts without the courtesy of a poke via ACPI and as such we became very good at quickly recovering shards. In a nutshell this mechanism was to have the new host contact a backup slave, perform an lvm snap, pipe the compressed snap over a TCP connection, unroll it and carry on, letting replication take up the delta.

That enabled us to not only manage the 10 million or so daily active users of that title, but was also the platform under the 12 or so additional titles that studio had.

We had lots and lots of very simple things and failures were contained.

I think at the time we were the 3rd-largest consumer of EC2 after Netflix and "another" outfit I never learned the name of. EA being what it was, however, we were never permitted to open source a lot of the cool stuff Netflix and ourselves seemed to develop in parallel.


this is frikking awesome! do you have any of the lvm and pipe scripts publicly available ? i'm kinda struggling with building and setting up lvm on ec2 automatically and was wondering if there is any tidbits you can pass along.


Unfortunately not. That company and codebase is very much dead, and I don't own any of it. Even if I did it would be a bunch of context-less shell (in the first iteration) and then a bunch of context-less Chef (in the second platform iteration).

Things I do remember:

- We used ephemeral volumes for all data stores. This was pre-provisioned IOPs and EBS was flaky as heck back then. I can't remember the disk layout, although we did experiment a great deal.

- We took great pains to ensure there was enough space to make the snapshot (IIRC is was a telemetry/monitoring item)

- The pipe scripts were essentially "netcat".

The best I can offer is this talk: http://vimeo.com/57861199


The article could be summed up as "Postgres is not a distributed database." MySQL isn't either, although it certainly has more friendly replication technology. I think it's a lot more likely that what's really happening here is that they've designed their "schemaless" schema or its supporting software to handle the kind of soft errors that MySQL is permitting and Postgres was not.

We have MySQL replication across the country where I work and I certainly wouldn't characterize it as robust; it fails every 3-6 months. MySQL replication is certainly a lot older and easier to use than Postgres's, but SQL databases are fundamentally CP systems. When you say "This design means that replicas can routinely lag seconds behind master, and therefore it is easy to write code that results in killed transactions" it sounds like you're blaming the way replication was implemented for a physical problem. There is no way to design a replication system such that two highly-consistent databases can achieve perfect availability in the face of real-world networks. A worse protocol can exacerbate the problem, but a better one can't make it go away.

I have never seen corruption with Postgres (unlike MySQL), but I have never tried cross-datacenter replication with it. Apart from that, Postgres generally seems to do much better with consistency than MySQL does, where DDL statements are not transactional, etc. So I am not surprised to hear that their system trips harder on Postgres's more aggressive consistency.

In short, I suspect a more robust solution to their problem is a NoSQL database. On the other hand, it sounds like they want a combination of availability and consistency that will be difficult to get off-the-shelf. I'm glad they found a way to make it work. I wouldn't generally choose Postgres for a scalable system with an aggressive availability constraint--but then again, I wouldn't choose MySQL either, and I generally avoid problems that demand highly scalable, highly available solutions.


Roadmaps ( PostgreSQL ) 2016-2017-...

* Postgres Professional roadmap ( Pluggable storages, Multimaster cluster with sharding, Effective partitioning, Adaptive query planning, Page-level data compression, Connection pooling, Native querying for jsonb with indexing support, ....) https://wiki.postgresql.org/wiki/Postgres_Professional_roadm...

* EnterpriseDB database server roadmap ( Parallelism, Replication, Vertical Scalability, Performance ) https://wiki.postgresql.org/wiki/EnterpriseDB_database_serve...

====

And "Scalable PostgreSQL for real-time workloads https://www.citusdata.com " --> https://github.com/citusdata/citus


My big question here is why they decided to move over to MySQL instead of using the Citus (also open source) Postgres extension. They don't mention it, so we don't know whether it was considered, and if so, why it was not selected. Postgres' rapid feature growth in the past years is interesting in itself.


Citus is Open Sourced since "Mar 2016", so it is not so old.

https://github.com/citusdata/citus/blob/release-5.1/CHANGELO...

"citus v5.0.0 (March 24, 2016)

- Public release under AGPLv3

- PostgreSQL extension compatible with PostgreSQL 9.5 and 9.4"


Citus currently has major problems joining non-distributed tables with distributed tables, and it doesn't fully support schemas yet. It's almost there, but not quite.


Thank you!


Why is there a Postgre Pro and Enterprise DB? Whats the differnce? ( Never heard of PostgrePro till today )


Postgres Professional and Enterprise DB are two companies that provide Prostgres products and services. The linked roadmaps are their respective plans on what they want to add to the open source Postgres database.


We've hit a lot of the same fundamental limits scaling PostgreSQL at Heap. Ultimately, I think a lot of the cases cited here in which PostgreSQL is "slower" are actually cases in which it does the Right Thing to protect your data and MySQL takes a shortcut.

Our solution has been to build a distribution layer that makes our product performant at scale, rather than sacrificing data quality. We use CitusDB for the reads and an in-house system for the writes and distributed systems operations. We have never had a problem with data corruption in PostgreSQL, aside from one or two cases early on in which we made operational mistakes.

With proper tuning and some amount of durability-via-replication, we've been able to get great results, and that's supporting ad hoc analytical reads. (For example, you can blunt a lot of the WAL headaches listed here with asynchronous commit.)


Posts like this are important.

We too often rely on a buzz-word heuristic and that's how you end up with dozens of random technologies that are harder to maintain and don't necessarily solve any of your problems. This method is good, because it shows that when you understand the problem the right way, you can find the right solution, even if by popularity it looks like a "step backwards"

Massive Kudos.


I strongly disagree. it would have been useful f they'd stuck to problems without well-known solutions.

Sadly, they also mixed in issues which are easily solved, or in a particularly egregious case, where they just complain about a bug. As though MySQL never had a bug. That was silly.

My read of it was: Postgres annoyed us a few times, and we got fed up with its, so now something different will annoy us. Please look forward to our blog post in 4 years about how we're using X instead of MySQL/Schemaless because those were also imperfect.


I got a similar impression... though with Uber's scale, funding and resources, they probably could have worked with and through their issues with Postgres. I'm actually surprised they didn't take a multi-pronged approach to their issues. Since they're using Schemaless, I'm curious why they didn't go for one of the many non-sql databases that may well be a much closer match to their use case.

It seems to me that Cassandra (C*) may have been a better match to their needs... yes it means more administrative and application tuning, but would definitely scale to meet their needs. RethinkDB would also likely be a better match to what they are wanting to do.

That said, I've been holding out for some time on PostgreSQL's in the box replication story to take root and mature. There are definitely more mature features and solutions to sharding and replication around MySQL, I just tend to find MySQL to be brittle and every time I've ever worked with it, I have at least a half dozen WTF moments... from binary data handling/indexing, foreign key syntax, ANSI out of spec, and others. PostgreSQL has some great features, and once the replication issues settle in, it will become the default choice for a lot of projects in a lot of organizations. Though, mySQL/maria and even MS-SQL are currently better options for many SQL use cases.


I'll preface what I'm about to say with "I have never worked for Uber and I don't know terribly much about their internal structure", but from my interviews with Uber and a few of their hires I know, it seems that they tend towards hiring totally independent teams from the existing staff when tackling big projects...including hiring an outsider manager to hire a whole team. I won't speculate as to the reasons for this publicly but I've drawn some interesting conclusions from this.

A multi-pronged approach that might involve multiple stakeholders just doesn't seem like their way of doing things.


What replication issues are you referring too? I never once had a problem with pgsql's replication across 8.1->9.5.

It would randomly die, but that was always either my fault or the applications fault, never pgsql itself.

The lack of master-master seems to be the big thing everyone mentions, but PostgresXL is currently in a usable-in-production state.


But, what is the current replication setup that comes with postgres that is well documented with the PostgreSQL (current-version) documentation... the past, when I've looked there's mention of 2-3 solutions (none in-the-box) and others that require at least a 5-figure support contract.

Compare to MongoDB, RethinkDB, MS-SQL and others where the tooling for replication comes in the box. Yes, to of the examples are "no-sql" but even the mysql replication is in the box and supported as such.


Did you read this?

https://www.postgresql.org/docs/current/static/high-availabi...

I'm not sure what more you can ask from documentation.

Does MySQL document Vitess, Galera, MaxScale, etc...?


> though with Uber's scale, funding and resources, they probably could have worked with and through their issues with Postgres

Then they wouldn't get to build cool new stuff and write blog posts about how they had to build cool new stuff because OMG UBER SCALE.


The issue is all solutions are imperfect. You just make tradeoffs and shuffle the problems around.

Doing so deliberately is whats commendable.


thank you for more accurately stating what i was trying to say earlier.


Almost all problems are already solved. "Solving" a problem today is mostly ego-stroking.


Sweet, I have a few PhD thesis and NSF grant proposals I'd like your help on.


I'd love to help stroke your ego, but I see you're busy doing it yourself.


Making a lot of "buzz-word" errors like that, I now use the simple heuristic: choose the oldest available technology that solves your problem. Also known as Lindy effect https://en.wikipedia.org/wiki/Lindy_effect


The problem is that people might get the wrong impression that this is relevant to them when for 99% of people it is not.


Well, this is heresy. Does that mean we are now officially boycotting Uber?

Joke asides, one thing I've been trying to figure out for awhile is the limitation at which certain components/ systems broke down. Basically, something along the line of "given X records, this operations would take Y time, or would cause Z A B C problems". I've actually got developers friends asking me how fast a simple "SELECT * FROM X WHERE index=?" would take on a million row table, since they were surprised that some NoSQL DB could do a query on hundred million rows in a few seconds.

I guess that's part of why you only learned how to scale after having done it once.


"surprised that some NoSQL DB could do a query on hundred million rows in a few seconds"

The tone seems to suggest this is fast, a simple index query in an RDBMS of even a hundred million rows will take milliseconds on even a weak computer.


I was testing some JSONB indexing and did a date-range query on a json doc for a table with 100m records to test with, it returned top 10 records that met the criteria in ~4s, figured out how to index the json doc for dates and this dropped to ~25ms.

This was running on an old Core i5 / 16gb of ram with a lot of junk running on Windows.

It would probably have been faster on a production environment.


Justify that with actual math, please? I don't think you know how computers (processors, bus, memory access, etc) work. How exactly do you think an RDBMS can query an index at a hundred billion rows per second?


The parent said a query of a hundred million rows, i.e.: `select * from users where id = ?` when there are a hundred million users. That's obviously not scanning a hundred million rows, since the index will be a b-tree which by the power of logarithms is not so many levels deep.

Like Illniyar, that's also what I interpreted GP to mean when they said:

> "surprised that some NoSQL DB could do a query on hundred million rows in a few seconds"

which is not fast at all. However, if it actually did something on a hundred million rows that would be impressive. Not sure exactly sure which they meant, though.


Illniyar is right, a b-tree with a branching factor of 100 can index 100M rows in 4 levels. Even with the index on a spinning disk, should have no trouble coming in well under 100ms.


Indexes like this are also typically in memory to begin with, making it even faster, since you only have to find the actual record on disk.

Even better, if you only need one field from the record, and it's part of a compound index, you can frequently return the data from just the indexes; no disk seeks required. Small tip with InnoDB on MySQL - any non-primary key index is automatically a compound index with the primary key.


Great write up. A couple points -

I'm not sure this post is illustrative of any generally applicable considerations (re: the title) in the choice of Postgresql vs MySQL, since Uber seems to no longer be using a relational model for most of their data and is using MySQL effectively as a key-value store.

> say a developer has some code that has to email a receipt to a user. Depending on how it’s written, the code may implicitly have a database transaction that’s held open until after the email finishes sending. While it’s always bad form to let your code hold open database transactions while performing unrelated blocking I/O, the reality is that most engineers are not database experts and may not always understand this problem, especially when using an ORM that obscures low-level details like open transactions.

I have to very seriously disagree here, ORMs make a lot of things easy - and you can get away with building stuff for a while without understanding the underlying databases or SQL but only to a certain scale (I'd say more like medium-scale, definitely not large or Uber level). If you have engineers writing code that interacts with a database without understanding transactional semantics, the engineer in question not the database is the problem.

> We started out with Postgres 9.1 and successfully completed the upgrade process to move to Postgres 9.2. However, the process took so many hours that we couldn’t afford to do the process again.

There seem to be ways [0][1] to do online upgrades with Postgres (before logical decoding in 9.4), although I haven't personally used them. Not sure if they explored these options at Uber or not?

[0] https://github.com/markokr/skytools [1] http://slony.info/


> I have to very seriously disagree here, ORMs

In spirit I agree with you, its the engineer's fault for not reading the documentation of their ORM or equivalent.

But in these big ships with hundreds of programmers, leaving transactions open in hibernate is a daily occurrence somewhere.

Usually caught before production, but happens oh-so-frequently, that anything that exacerbates the pain from this would be seen in a negative light for sure.


I've heard from technical leaders at multiple now well established unicorns how they'd never use postgres or switched from postgres simply because MySQL has a lot more tooling built it and many more people are exposed to its shortcomings at "web scale" so that it's very well known where and when things will break.

Disclaimer, I'm a hardcore Postgres user myself, but I also keep tabs on the other tools.


I've been a hardcore and exclusive Postgres user since 2004. Always rolled my eyes when folks say they use MySQL or moved to it from Postgres. Reading about that bug sent shiver's down my spine. I know no system is bug free, but Postgres is one of those systems that have held up very well for me, and now I'm super paranoid. :-( I guess I understand tho, I moved from BSD to Linux for more tooling.


OTOH, it's important to consider that most places are not and will never be "unicorns", so analyzing things at that scale may not make sense. Just because Facebook managed to make something work doesn't mean that you should follow the same path.


> Each of these system calls incurs a context switch

System calls are not context switches. I wish people would distinguish between them. A system call is just a change of privilege level and is efficient --- there's no cache invalidation required on almost any system.

A context switch, on the other hand, involves a call to the scheduler, saving and restoring of much more CPU register state, and various kinds of cache invalidation. (It's even more expensive if you're switching between different processes instead of different threads in the same process.)

The kernel may perform a context switch while executing a system call --- this context switch is what makes blocking calls blocking. But even IO system calls do not necessarily cause context switches, especially in the case where an operation can be satisfied by accessing only the page cache.

tl;dr A system call is not necessarily a context switch


So the major issue detailed here is that postgres basically uses immutables rows which creates performance issues with writes.

Just read about their new schemaless db in their blog an the first paragraph contains this:

"The basic entity of data is called a cell. It is immutable, and once written, it cannot be overwritten. (In special cases, we can delete old records.) A cell is referenced by a row key, column name, and ref key. A cell’s contents are updated by writing a new version with a higher ref key but same row key and column name."

So, mmm..., not saying that postgres didn't pose a problem for them but I think postgres' db model fits better to their new db then mysql. They probably had to work really hard to get mysql to work like postgres.

Without this issue, it looks like two things needed.to be done with postgres that would have solved their problems have indexes that point to primary id and do logical replication (which they say a plugin solved in 9.4).

Is this a case of "I got burned by something so I won't use it again"


So more from here: https://eng.uber.com/schemaless-part-two/

"Each Schemaless shard is a separate MySQL database, and each MySQL database server contains a set of MySQL databases. Each database contains a MySQL table for the cells (called the entity table) and a MySQL table for each secondary index, along with a set of auxiliary tables."

So... 1 table, with 1 index and manually created and updated secondary index tables.

With this scheme I can only assume postgres will work just as well or better.


Facebook maintains it's own fork [0] of MySQL. A couple of interesting talks are also available: MySQL at Facebook, Current and Future [1] and Massively Distributed Backup at Facebook Scale [2].

[0] https://github.com/facebook/mysql-5.6

[1] https://www.youtube.com/watch?v=jqwegP9xwVE

[2] https://www.youtube.com/watch?v=UBHcmP2TSvk


"its own fork"

And yes, MySQL has a lot of problems, but it seems to be better for "seat of the pants" usage (that is, when you don't need too many SQL capabilities and wants to use it mostly as a data store)


from what I understand FB uses Mysql as permanent storage, not as relational database


I believe they use it for almost everything. They built all the graphs on top of MySQL data. I don't know what permanent storage is, but if you are referring to storing images and videos, I doubt. Highly doubt that. They may be storing pointers, but as a file system, I doubt. But I wouldn't know for sure, I don't work there. Oh, they do have Cassandra (well they built Cassandra)...


They use a system called haystack for binary large object storage. They very much don't use mysql for those.


That's my feeling they wouldn't use MySQL as a blob store.


technically, it is a patch set, it gets frequently rebased against the upstream


Technically they use it as a dumb key-value storage.


There was a great talk from a FB engineer who talked about using MySQL as a key-value store, then another engineer posted a blog about how they use MySQL and queries and joins etc, I asked about the video and he got super defensive and said it was wrong and the guy had no idea what hes talking about.

It really put me off ever wanting to even consider working at Facebook.

Digging around I can't find either the blog or video :(


I've talked to several Facebook devs who confirmed to me that FB primarily uses MySQL as a key-value store (e.g. this is why GraphQL makes so much sense for them). They were quick to point out however that of course they don't know everything FB does.

The problem with making absolute statements about what FB does is that it seems to be structured as a large number of mostly independent teams that are free to choose and develop whatever technology they need to solve their problems. There's actually a talk about scaling by a FB dev (don't have the link right now) that uses this as an example for what it means when they talk about "Facebook scale" (another example is that their monorepo outgrew git so they created their own extension to mercurial instead).

It's entirely possible that the main use of MySQL at FB is as a key-value store while at the same time there are small parts of applications using it with plain old queries and joins.


Ahh yeah good point. FB's scale is very interesting, it's like we solved scalability then FB came along and was like, ah crap we need to resolve scalability. And different teams are solving their own problems different ways.


And so does Yahoo, and Google, and...


I wonder what the design decisions are behind (or what it would take) to make Postgres store secondary indexes on disk like InnoDB does. Sure, the extra index lookup through the primary index is a cost, but it seems like write-amplification can sure be a greater concern too. Ultimately, it would be nice if Postgres gave the DBA a choice of — if not move outright to — secondary-index indirection through the primary index like InnoDB does.


I would definitely not want PostgreSQL to "move to" the mechanism used by InnoDB as that mechanism is slower for reads (which this article even admits). This disk layout is one of the reasons I continue to use PostgreSQL. If you care more about write performance than reads you might even want to look into something entirely different than a traditional database.

FWIW, PostgreSQL's mitigation to the write amplification problem mentioned here are "heap only tuples". It is highly likely that Uber could have prevented a lot of this pain by learning more about this feature--which is notably not mentioned even once in this entire article, which to me completely undermines the feel they are trying to achieve of "we really really really know what we are doing"--and tuning their table density parameters to take maximal advantage.

Instead of "why we moved off of X" it would be much better to see "we are considering moving off of X: anyone know what we are doing wrong?". Sure, maybe Uber knows about HOT, and did extensive analysis to determine it wasn't a solution to their problem; but it frankly does not seem at all to be the case. Anyone who knows a lot about PostgreSQL would have explained HOT to them, so they probably didn't even consult with PostgreSQL experts behind the scenes.

Sadly, "we are looking for help with a complex technical challenge" is something the market punishes under the premise that everyone has to be entirely self-sufficient gods of their technology stack :(. The only time I remember ever having seen a company reach out to the community for help was reddit (with respect to something involving PostgreSQL or Cassandra... I don't remembee the specific issue).


Thank you for introducing me to "heap only tuples", I did not know about it. From what I read about it from Postgres's documentation[0], it would have helped in the updates where no indexed column was updated.

0: https://wiki.postgresql.org/wiki/Index-only_scans#Interactio...


You can find more detailed explanation from Bruce Momjian's site. (From page 64: https://momjian.us/main/writings/pgsql/mvcc.pdf)


They migrated from MySQL to Postgres (https://www.yumpu.com/en/document/view/53683323/migrating-ub...) with almost same reasons and now they are counter argumenting for their inability to use software and lack of skills to upgrade!


Funny how the article is just:

- we used X in a fashion that suited us best

- it caused us problems Y because of some technicalities of X

- so we switched to Z and we could avoid Y thanks to how Z handles the technicalities differently than Y

and the top rated HN comments are:

- you used the X wrong

- all the technicalities of X that caused you problems Y are actually superior features of X


But think how funny that'd be if we assume that "all the technicalities of X that caused you problems Y are actually superior features of X" is actually true?


This was a great overview and write-up.

Anyone know why they are using MySQL over MariaDB[1]?

1. https://mariadb.org/


I'm not saying this is the case at Uber, but it's quite common for MariaDB users to refer to it as MySQL, or to use the two names interchangeably. It's difficult to get out of the habit as all of the CLI tools are still mysql*, clients (i.e. the protocol) are all named "mysql", CM modules (like Ansible's) are all named "mysql", etc. And sometimes it's just simpler (if lazier) to say you're using MySQL rather than explain what MariaDB is and why it's preferable.


I still occasionally call LibreOffice OpenOffice. Besides mysql has less syllables than mariadb.


It's hard to say, but one can often find people switching from Postgres to MySQL (and vice-versa). For example, see this [0] and the related HN discussion [1].

[0] http://insights.dice.com/2015/03/19/why-i-choose-postgresql-...

[1] https://news.ycombinator.com/item?id=9231751


Well there new database is basically a key value store built on MySQL, so I think there isn't much to be gained by using MariaDB and it will add needless additional complexity.


anno 2013 "MIGRATING UBER FROM MYSQL TO POSTGRESQL"

https://www.yumpu.com/en/document/view/53683323/migrating-ub...


Great write-up. A few observations:

1. The encoding and translation schemes of Postgres and mySQL/InnoDB are well described in the blog post, and I would also agree that InnoDB’s design is, all things considered, better for all the reasons outlined in the post.

2. I don’t understand why anyone still uses lseek() followed by read()/write() and not pread()/pwrite() syscalls. It’s trivial to replace the pair of calls with one. Aerospike is another datastore that resorts to pairs of seek/red-write instead of pread/pwrite calls.

3. Process/connection model makes no real sense nowadays - although to be fair, there is, today, practically almost no difference in terms of footprint between OS threads and OS processes (other than memory and FDs sharing semantics, they are practically the same). It’s still more appropriate to use threads (although I ‘d argue maintaining a pool of threads for processing requests and one/few threads for multiplexing network I/O is the better choice).

4. ALTER TABLE is obviously a pain point with mySQL, although I am not really sure many users with large datasets care; they probably figured out long ago it’s going to be an issue and they designed and expanded accordingly. It’s also a relatively rare operation. That said, other than using mySQL (or any other RDBMS) to build the data plane for an elaborate, distributed KV store, one should consider Salesforce’s approach too. Their tables have some 50 or so columns, and the column names are generic (e.g column_0, column_1, … ). They have a registry where they assign column indices (e.g column_0) to a specific high-level entity type (e.g customer title, or price), and whenever they need to query, they just translate from the high level entity to the actual column names and it works. They also, IIRC, use other tables to index those columns (e.g such an index table can have just 3 columns, table id, column index, value) and they consult that index when needed (FriendFeed did something similar).

5. Cassandra should have no problem supporting the operations and semantics of Shemaless ass described in their blog posts. However, given they already operate it in production, they probably considered it and decided against it.


w.r.t. 2:

https://www.postgresql.org/message-id/6248.1046130083%40sss....

     Manfred Spraul <manfred(at)colorfullife(dot)com> writes:
     > Tom Lane wrote:
     >> It seems unlikely to me that eliminating lseek on some platforms would
     >> be worth the hassle of maintaining two code paths.  lseek is mighty
     >> cheap as system calls go.
     >> 
     > It was considered expensive enough to write a syscall avoidance layer 
     > that caches the file pointer and skips lseek if fpos==offset.
     
     You're missing the point: that layer is mostly there to ensure that we
     don't foul up the kernel's readahead recognition for sequential fetches.
     It's nice that Linux doesn't care, but Linux is not the only platform
     we worry about.
     
     			regards, tom lane


This ML thread's only real argument is that some OS/Kernels may not support pread/pwrite. The readahead argument makes little to no sense IMO. Unless there are too many uses of random access IO in the codebase, they should use pread and friends if available there. Especially considering most people run it on Linux not some exotic OS nowadays.


Worth quoting from the article:

  Accordingly, using pgbouncer to do connection pooling with
  Postgres has been generally successful for us. However, we have
  had occasional application bugs in our backend services that
  caused them to open more active connections (usually “idle in
  transaction” connections) than the services ought to be using,
  and these bugs have caused extended downtimes for us.


What I don't understand: no system gets to maintain open transactions for free. MySQL keeps UNDO logs, so the effect there, much like Oracle, is possibly running out of UNDO space, as well as slowing down reads that have to apply UNDO to get the last-committed row version. So what gives? Did Uber Engineering fix the dangling transaction issues while migrating off, or are they relying on some other property of MySQL vs. Postgres?


They seem to be relying on the database take care of those :)


It really is sad that we all just can't appreciate a company like Uber giving us insight into their internal engineering choices. So many people on here think Postgres is so perfect that if you don't like it you must be using it wrong. Postgres is a tool like anything else. It has good use cases, and bad use cases.


Nice to see how Postgres (a relative of Informix) follows an old-school maxim to focus on Consistency and Durability by being "append-only" and never over-writing the data. Sticking to the right principles is better than over-optimization.

The Uber engineers should, perhaps, take a look at Changelogs of last 5 or so releases of MySQL to see how many bugs in InnoDB has been found in each release and read stories about data loses due to inability to repair storages.

According to old-school DBA tradition, it is much better to have an straightforward storage engine based on right principles written in C than fancy storage written in C++. At least if one values ones data.

Well, in the age of in-memory "databases", "durability through replication" and "eventual consistency" old school focus on disk commits might sound a bit outdated, until one gets that moment when shards got messed up and there is no way to know how many writes are missing and where.

Database is a durable storage which guarantees data consistency and ability to roll-back to a clean state through direct-access (by passing all caches) writes. At lest this is what we had in glorious times of IDS 7.3


One major advantage of MySQL's clustered indexes the article doesn't mention is that, although secondary key reads may be a little slower, primary key reads will be faster. The row data lives in the primary key index, so there is no need for referencing an additional database page (possibly causing random I/O).

This is especially relevant when doing range queries over the primary key. Imagine a table containing billions of chat messages, from which you want to retrieve a single conversation history. With a clustered primary key on (conversation id, message id), MySQL would need to process just a couple of database pages. Postgres, on the other hand, would need to reference a semi-random page for each of the messages.

Now imagine a 10k message chat conversation, a table too large to fit into RAM, and storage by means of spinning rust (yeah, yeah, I know what year it is :-)). The difference would be somewhere between 2 and 3 orders of magnitude.


Uhh ... Postgres supports Index-only scans; as long as the data you're asking for is in the index, that is.

So if you have an index on (conversation_id, message_id), and you try to retrieve message ids of a specific conversation, only the index will be touched.


Doh, you're right of course! We were having this problem in the Postgres 9.1 era, before index-only scans were a thing.

Still, it's quite inefficient maintaining an extra copy of the data that is never actually used. Though no longer multiple orders of magnitude less efficient.

However, I'd guess that programmers don't often think to add these seemingly useless fields to an index, as it feels inefficient and just wrong. But at least this offers an out in pathetic cases.


Fyi... a related (not duplicate) discussion of a previous Uber story: https://news.ycombinator.com/item?id=10923848


Wow, this is such a detailed analysis. Having used Postgres and suffered issues with data replication as well as database crashes, this post was really helpful.


The fact that Uber scaled to so many users with Postgress gives me such a relief. For now, I am good !!


So basically, if you don't intend to use it as a relational database, and you have enough scale to run cross-data-center (and across-the-world) master-master replication, then you should maybe switch from PostgreSQL to MySQL?


Why would anyone run hundreds of connections? A server can only process number_of_processor_cores connections at once. Sure, few connections might wait for I/O, but not hundreds, unless database is very untypical.


For example: You are using Python and you have 10 web servers and 20 background servers connected to a common DB. Each server has 10 threads, and each thread holds 1 connection open. That is 300 open connections.

Opening and closing connections is very slow and expensive, so almost always better to keep these 300 connections open than to try to be fancy.

You COULD try to say give each server only 3 connections and make them share, which cuts you to 90 connections.. but then you have to try to share state between different python processes (not easy), and will often end up with deadlocks and sync overhead.


Agreed, which is why many clients create a pool of connections that gets reused. Connection cost is expensive, and the rdbms already handles concurrency and even a couple thousand connections shouldn't be a significant overhead.


I'm trying to find evidence on what memory usage is for MySQL for 1k connections vs postgres with 1k connections. I am finding a lot of people saying postgres has heavier connections but for 1k connections what's the difference. 1MB of memory? 1GB?


https://wiki.postgresql.org/wiki/Number_Of_Database_Connecti...

Its a little outdated but I've found it largely holds. That is, you'd want a mighty box for 1000 concurrent connections.

That said, connection queuing works really well with postgres such that throughput is frequently better at lower connection counts than at higher ones.

I have no experience with mysql that is less than 15 years out of date.


the connections are idle, are routed to dozens or hundreds of separate processes themselves, and are held ready for fast response when that process needs to serve a request.

Unfortuantely, MySQL has no good replacement for Postgresql's PGBouncer which greatly mitigates the issue of cross-process connection pooling. (I'm actually working on one, but for Uber to use it they'd have to switch back to SQLAlchemy >:) )


In my experience DB connections are overwhelmingly idle. Even if you omit the obvious dead times between requests and consider only actively executing connections, a request will spend most of the time waiting for a resource or another (read IO, commit flush, latches, locks). It takes an extraordinarily well tuned app to be able to drive all connections to be anywhere near 100% CPU bound and never conflict or wait.


Like you said, the connection is still in use while waiting for I/O, at which point, another process or thread runs a bit and can also kick off some I/O on yet another connection.

Numbers vary, but that's the principal that uses more connections than cores.


A common solution to conserve bandwidth is to use compression. This can be done easily in PostgreSQL by using ssh tunnels and turning on compression. I wonder why they didn't try that.


Reading this, I wondered if they deeply understood the difference between bandwidth and latency. I doubt compression would be the thing, as I'd guess they had a latency problem, not a bandwidth problem.

You see this kind of misunderstanding commonly pooled with other "a computer has physically moving parts" misunderstandings like the ORM or connection pooling concerns outlined.

After tyranny of abstractions, nobody knows how the moving parts really work.

// As alternatives given you'd like to keep the immutable data approach which brings a lot of goodness, consider a log-structured file system for the disk concerns, and geo-sensible replication for the latency concerns. At this scale, for near real-time app, bi-coastal DB is a bad model. You shouldn't have all users in SF querying a database in DC. Given the nature of the business model, they can share geographically at one time scale, and roll up and replicate geo diverse data at a high latency leisure.


I'd like to see their migration strategy as well. I mean, they say moving from pgsql 9.2 to higher version (which then allows online upgrades) is too much work. Yet they'll have to migrate to mysql, which will take much more engineering effort. For anything close to realtime, they'll need to copy the old data, while at the same time forking the new writes into both pgsql slaves and new mysql servers. And they cannot use WAL for that without some advanced processing.

I hope this follows in the next blog post.


Does anyone know if citusdb or enterprisedb improve on the postgresql issues mentioned in the post vs last postgresql version?


CitusDB is essentially vanilla Postgres now that they've moved their implementation to an extension.

It won't specifically address the problems posted here, but will solve other ones like trying to scale a system beyond a single node.


I think a fair summary is "We had a few problems with PostgreSQL mostly due to our rapid growth. We rethought the problem and changed the way we use a relational database for large-scale storage and are now using MySQL as a dumb key-value store."

The conclusion reinforces this: "Postgres served us well in the early days of Uber, but we ran into significant problems scaling Postgres with our growth."

I read this a both endorsement of PostgreSQL as well as highlighting some of the problems that any large-scale use of it would run into.


Interesting post! While I suspect that a MySQL installation is just as likely to have its own problems in the long run, I'm not smart enough to provide any kind of compelling point-by-point refutation. However, a number of the points made strike me as having possible trade-offs that were not really addressed in-depth.

My summary of the arguments against Postgres and some basic thoughts on each:

1. Writes are more expensive because all secondary indexes must be updated with a new physical location.

This may be true, but the MySQL model of using primary keys from secondary indexes will mean that reads are inherently expensive. They even mention this:

> This design means that InnoDB is at a slight disadvantage to Postgres when doing a secondary key lookup, since two indexes must be searched with InnoDB compared to just one for Postgres.

So it seems like a classic read vs. write trade-off.

I'm also a little skeptical of any performance claims that don't include any numbers. It's possible that efficient coding in Postgres makes this much more of a wash in terms of performance than claimed here.

2. Replication is less efficient because it's sending a lot of physical information out along the stream.

This is quite true, but IMO unlikely to be a major issues for most users unless they're dealing with a huge amount of data and streaming it over a slow connection (i.e. across the continent like Uber's disaster recovery center).

3. Data corruption from a bug found in 9.2.

Certainly a bad situation, but IMO not really a valid claim for situation. 9.2 is way behind at this point, and there's not much to say that they wouldn't have encountered a similar bug or something worse in MySQL in all that time, especially operating at scale.

To give a counter-anecdote, I operated Postgres at scale for a long time across many versions starting at 9.1 and was lucky enough to have never once encountered a bug with data corruption.

4. Postgres' MVCC model makes it easy for replicas to accidentally fall behind their master.

This one is valid (and annoying), but there are very good reasons for it, and you have some switches to control the behavior based on value transactions finishing on followers or prompt replication more highly.

5. Upgrades are difficult because the WAL stream works at a physical level and is not compatible between database versions.

Again, this is valid, but the statement-based replication is a scary idea. Row-level replication is more interesting and probably something that Postgres should have though.

Some good news is that Postgres is getting closer to logical WAL streaming, which should make in-place upgrades possible.


> This may be true, but the MySQL model of using primary keys from secondary indexes will mean that reads are inherently expensive.

With MySQL the indexes are usually kept in memory so there should not be noticeable overhead.


with respect to "Difficulty upgrading to newer releases":

pg_upgade has a --link option which uses hard links in the new cluster to reference files from the old cluster. This can be a very fast way to do upgrades even for large databases (most of the data between major versions will look the same; perhaps only some mucking with system catalogs is required in the new cluster). Furthermore, you can use rsync with --hard-links to very quickly upgrade your standby instances (creating hard links on the remote server rather than transferring the full data).

that is all referenced in the current documentation: https://www.postgresql.org/docs/current/static/pgupgrade.htm...


I've tried to produce a full reply to most of the technical points raised there. http://blog.2ndquadrant.com/thoughts-on-ubers-list-of-postgr...


Also check out highscalability.com for more stories that value MySQL and its great InnoDB engine: http://highscalability.com/blog/category/mysql


I wonder how much of this could have been solved by using a different file system. There is all of this talk about the physical layer but no mention of the file system used.

> Typically, write amplification refers to a problem with writing data to SSD disks: a small logical update (say, writing a few bytes) becomes a much larger, costlier update when translated to the physical layer.

This is exactly the type of problem solved by the file system layer.


How would any filesystem help with that? SSDs typically write entire blocks, even if the OS asks them to only write a few bytes. That's just how SSDs work.


Rethink the file system. Think log-structured.


Doesn't matter. Even a log-structured fs, when told to write 4 bytes of new data to disk will have to write those 4 bytes immediately and return. If an application asks the OS (via fsync), and the OS asks the fs and the fs doesn't write to disk but tells the OS it did, then the fs just lied and risked data loss.

If power is lost between the fs lying to the OS and its subsequently actually writing to disk, the data that the fs lied about is lost.

You don't want that with a DB on top of it.


Why did they not consider Oracle or MS SQL Server? They can afford the licensing and both have numerous replication technologies to choose from.


Oracle ??? Let's start:

- No transactions for DDL changes.

- Oldschool commandline client. auto commit disabled by default. no history.

- Weird sql syntax + semantics. f.e. null == empty string.


I'd add

- Oracle InstantClient SDK needs an Oracle account to download and is closed source, and is not available as a package for anything except rpm-based distros

- getting it to run with PHP is a major PITA, once again due to the above-mentioned issues

- Holy f..ing cow, why does it translate everything down to ASCII by default instead of returning raw bytes?

- It's expensive as f..k

- Try to run a query using InstantClient with a ; at the end, it will barf

- DID I MENTION IT CANNOT DO A SIMPLE LIMIT AFTER AN ORDER BY?! (at least not until 12.1, which was released in 2013; due to various issues, one including the pricing, I have seen multiple orgs running way older versions. But come on, over 30 years with only ROWNUM?!)


I'm with you on all of your points except auto-commit: having it off by default is much better, it forces you to explicitly commit when you change data and thus think if you really want to persist the changes.


yes. except that all other clients have auto commit on. thus if you transition from that to auto-commit off. You face two issues: 1. you don't know. and you thing you commited something but actually you didn't 2. you forgot and you committed something. but transaction keeps an row lock open and you bring down the server... :)


They should use SQL Server (which has great replication abilities, although horizontal scale out is still difficult) or MemSQL (which is distributed, scalable, and can do everything they need).

Or use Cassandra which is a perfect fit (or ScyllaDB which is a better version of it).

This all sounds like an aversion to just paying for or using better products when the problem is easily solved.


SQL Server needs a couple hits with a cluebat to even satisfy the basic three requirements for a database system they listed (I'm talking about the MVCC line of course).

Honestly, if a commercial database provides what you require and you have the budget? Sure, why not. But I'd always try to avoid that myself, because the serious players (MS, Oracle, is anyone still using DB2?) are really, really expensive.


> basic three requirements

SQL Server works just fine and has lots of concurrency control to do whatever they need. And the way they're using the database doesn't seem to really make this an issue outside of their replication.

> are really, really expensive

This whole janky setup they have sounds even worse. None of the commercial relational databases are really that expensive considering what they offer, and we're talking about Uber here. We're a small startup that pays for both.

Money for (a better) working product with support is the right call, not build it yourself. This is just poor tech decision (outside of using a RDBMS in the first place).


Maybe you should call up Google, Facebook,linked in, Twitter, and tell them they all made mistakes and should use MSSQL.


Those are all massive scale tech companies that actually needed to invent many of the datastore technologies used for big data today. Uber is not among them.

But I'm sure you know all this as it seems you work for an enterprise database company that actually makes the exact product Uber should use.


My big think with MySQL is that the last time I needed to make this decision (about 3yr ago), the support tooling, community knowledge, and documentation around MySQL was light years ahead of Postgres. There were literally hundreds of MySQL clients and utilities and Postgres was "just a database".

Competition breeds excellence.


From this article, Can i assume the below point?

why postgres is designed such a way of "physical replication" rather than the design of mysql's ONLY "logical replication"?

Because postgres empowering data integrity with help of forceful constraints.


This is a fantastic read. I hope the pg folks can turn as many of the issues brought up here into bug reports as possible (I think many of the issues, especially re: replication, are known), this kind of feedback is invaluable.


Some of these may be tradeoffs rather than bugs. I'd love to read a reply from someone knowledgeable about Postgres internals - it'd be very interesting.


Yep! saurik's reply above sheds some light on the issue (read more important than write). Also, to the downvoter, sometimes the solution to a bug report is improved and more visible documentation of existing features.


There is this interesting talk on MySQL vs Postgres.

https://www.youtube.com/watch?v=emgJtr9tIME


while Postgress might be better if you use it 'as-is'.... community of MySQL is much better and the tools available are more mature... just goes on to prove that even if something is not-that-good.. it still might be successful,scalable and popular if there is a strong community behind it..


>community of MySQL is much better and the tools available are more mature

Can you backup your statements with some facts. I have seen the postgres community to be much better particularly now with Oracle taking ownership.


Perhaps their engineers can design a web page that allows the scroll wheel to work.


So let me try to summarise this.

Poor replica MVCC support

They are actually pointing to a blog article written in 2010 -> http://blog.2ndquadrant.com/tradeoffs_in_hot_standby_deplo/

Do they realise that it is 2016 ?

Guess they did't bother to understand the hot standby feedback system.

> Postgres’s design resulted in inefficiencies and difficulties for our data at Uber.

What kind of inefficiency ? The explain what is purpose of WAL and replication which every database person knows about but didn't care to explain the actual problem at hand ?

Data corruption

> During a routine master database promotion to increase database capacity, we ran into a Postgres 9.2 bug

Why the heck didn't they upgrade to a newer version ? Did you report this bug to pg dev , did they take so much time to fix this, or were you just assuming that the bug could fix itself ?

> The bug we ran into only affected certain releases of Postgres 9.2 and has been fixed for a long time now. However, we still find it worrisome that this class of bug can happen at all.

Postgres 9.2 is pretty old and there has been 3 major releases after that. WTF ?

I can say countless instances where MySQL data corruption was a constant nuisance with version 5.5 and they have fixed it with newer releases.

Replication

> During peak traffic early on, our bandwidth to the storage web service simply wasn’t fast enough to keep up with the rate at which WALs were being written to it

So you have run into a hardware limitation and then blame postgres. What was limit that you hit ? I don't understand this point at all.

Concept of context switching

I am surprised that this is actually an issue, in a database the slowest part is always the disk and not the CPU. Confused on how did they hit this limitation first without actually touching others.

Time taken by a context switch : http://stackoverflow.com/questions/21887797/what-is-the-over...

Which is in microseconds.

InnoDB buffer pool

> By comparison, the InnoDB storage engine implements its own LRU in something it calls the InnoDB buffer pool

Postgres has something similar called shared_buffer. They are speaking as if postgres relies entirely on the operating system which is false.

> It makes it possible to implement a custom LRU design. For instance, it’s possible to detect pathological access patterns that would blow out the LRU and prevent them from doing too much damage

Not sure what kind of damage they are speaking. In a postgres sequential scan (full table scan), a ring buffer is used instead and does not result in the shared buffers being blown away.

If you need a custom LRU design, there is definitely something wrong in the way that you are using an OLTP database.

Connection Handling

This is complete BS. Nobody uses databases without connection pools. Agree that a thread is more lightweight than a process, but you would never hit this limit at all in real time which is in the order of microseconds again. In a production system, one would open connections immediately and then hold them in the connection pool. This overhead is almost not visible at all. If you are constantly opening and closing connections then there is something seriously wrong with your design.

> However, we have had occasional application bugs in our backend services that caused them to open more active connections (usually “idle in transaction” connections) than the services ought to be using, and these bugs have caused extended downtimes for us

So they are blaming the database for a bug in their design/system. Computers are no match for human stupidity.

> Accordingly, using pgbouncer to do connection pooling with Postgres has been generally successful for us.

Again what is the problem, the whole article smells more and more like a useless rant, just because you dont know how to use them ?

Conclusion

Another thing is that they have not given any kind of query/access pattern in which they use postgres/mysql. They put in a couple of low level things and then say that postgres is badly designed.

I can think of only two logical explanations

1) The article writer was already familiar with MySQL and they didn't bother to even dig into postgres deeper

2) They have been paid by oracle :P


I completely agree with you. The blog post seems to be seriously biased to justify their choices due to other reasons.



Why not PostgreSQL? (Sorry, someone had to say it.)


I like the sample data they have used:

  id  first         last            birth_year
  1   Blaise        Pascal          1623
  2   Gottfried     Leibniz         1646
  3   Emmy          Noether         1882
  4   Muhammad      al-Khwārizmī    780
  5   Alan          Turing          1912
  6   Srinivasa     Ramanujan       1887
  7   Ada           Lovelace        1815
  8   Henri         Poincaré        1854


Except it's MySQL, so by the time you read it out it says:

    id  first         last            birth_year
    1   Blaise        Pascal          1623
    2   Gottfried     Leibniz         1646
    3   Emmy          Noether         1882
    4   Muhammad      al-Khw�rizmī  780
    5   Alan          Turing          1912
    6   Srinivasa     Ramanujan       1887
    7   Ada           Lovelace        1815
    8   Henri         Poincaré       1854
(I know, I know. It is possible to configure MySQL encodings correctly. And given that they've put a lot of engineering thought into choosing MySQL, they certainly have.)


Don't know if it's still an issue, but I once setup mysql, with an id of binary(16) to store UUIDs, and used 0-padded conversions for legacy id's... when trying to port over existing records I discovered the index on a binary field was a "case-insensitive" textual index by default, so started getting conflicts when the ascii equivalent of "a" and "A" as a byte was inserted, etc. This was about 15 or so years ago, but the irksome nature still sticks with me to this day.


I wouldn't be surprised if that behavior remains the same. MySQL seems to be fine with long-standing bugs and other sub-optimal behavior hanging around, eg. https://bugs.mysql.com/bug.php?id=20786 .


I used both Postgres and MySQL and in my opinion encodings are more difficult to setup with Postgres. You need to change some "template" when creating a database to use unicode. Otherwise it will use latin1. I do not even understand what a database "template" is, how it is related to encodings, and why it is so overcomplicated.

In MySQL there is no templates and you can change the encoding of a table at any time using ALTER TABLE statement.

And in cloud IDEs like c9.io you cannot use unicode collations in Postgres because they need to be installed separately and they are not installed.


You certainly don't need to change any templates when creating a database, CREATE DATABASE and createdb both support specifying the encoding.

Changing encoding of a table (or a database after creation) is usually not a wise thing to do; since the db engine is not going to go over all the data and convert it anyway, so if your data is corrupted now it will remain corrupted. If one simply needs to tell the DB to treat the data differently without worrying about data conversion, the encoding of a database is stored in `pg_database` and can be changed with an UPDATE query.

And this is all from the official documentation. I'm not a Postgres expert.


I can confirm. I recently set up a private mattermost server. Because the host I'm running it on didn't have the UTF-8 locale on by default (I think), Postgres defaulted to Latin-1, and mattermost ran fine, right up until someone sent a UTf-8 character which is unrepresentable in Latin-1.

That was No Fun™ to solve.


So ... you'd rather have a database with mixed encodings, on a table-by-table basis?

Ugh.


Mind explaining the significance? I didn't pick up on it.


They're all famous scientists/mathematicians.


My understanding is that it is very diversified in terms of race, gender, ethnicity, sexuality identification, etc.

Make no mistake, they are all great mathematicians and deserve a place in math pantheon.


They're all famous mathematicians


It caught my eyes as well. Like it very much.


old mathematicians


No, dead mathematicians


No love for Kurt Godel I see.


Uber """"""""""engineering""""""""""


I like this transparency, I know I never want to work at uber.


The connection handling section was surprising to me, reading that Postgres uses a process per connection! This is pretty shocking to me, in a bad way.


One could use connection pooling. Quoting from the tuning guide [0]:

  max_connections sets exactly that: the maximum number of client
  connections allowed. This is very important to some of the
  below parameters (particularly work_mem) because there are some
  memory resources that are or can be allocated on a per-client
  basis, so the maximum number of clients suggests the maximum
  possible memory use. Generally, PostgreSQL on good hardware can
  support a few hundred connections. If you want to have
  thousands instead, you should consider using connection pooling
  software to reduce the connection overhead.
Replication, Clustering, and Connection Pooling [1]

[0] https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Serv...

[1] https://wiki.postgresql.org/wiki/Replication,_Clustering,_an...


pgbouncer works pretty well for that. I kinda wish it was part of the default Postgres install.


How does pgbouncer work with Amazon RDS (where you can't install separate software)?


you install it on the clients of the servers themselves and "connect locally" to the bouncer or have a bouncer step (server pair) in front of the database connections themselves.


pgbouncer does not need to run on the same machine as the DB. It can run on an EC2 instance in front of your RDS instance.


Different design goals, Mysql lot of connections with shared memory model and fast running small queries, ex: PHP one connection per page view. Postgres dedicated memory model long running complex queries.


So PG has a process pool instead of a thread pool. (I doubt either database is spawning procs/thds anew willy nilly for each request).

This means the PG has explicit IPC overhead, vs the quick and seductive path (to the dark side?) of simply sharing memory between threads. Safety vs speed.


It actually forks a new process per connection.


this reads like a laundry list of buzzwords that were designed to justify not throwing any effort into postgresql and just going with a new shiny toy (not mysql. yes. i know it's been around for a while).

it happens everywhere.


From the post:

> [...] This design difference means that the MySQL replication binary log is significantly more compact than the PostgreSQL WAL stream.

Doesn't sound like what you described at all.


so you're pulling one line from the article to tell me that i'm wrong?

on-disk format/write amplification: > For tables with a large number of secondary indexes, these superfluous steps can cause enormous inefficiencies. For instance, if we have a table with a dozen indexes defined on it, an update to a field that is only covered by a single index must be propagated into all 12 indexes to reflect the ctid for the new row.

How wide is their data? Depending on the answer to this, it could be that they've over-indexed, have a poor indexing strategy, or are reacting to the poor queries generated by an ORM (not sure if they use one, or if they hand-code their own SQL).

data corruption: everyone has bugs. https://bugs.mysql.com/search.php?search_for=&status=Active&... add in mysql's tendency to loosely-adhere to the SQL standard and there are many ways that you can actually corrupt your own data.

i'm not here to debate whether postgres is better than mysql. i'm just saying that it seems like a lot of research went into justifying a switch. who knows, maybe that research could have been spent optimizing their current environment.


>so you're pulling one line from the article to tell me that i'm wrong?

Better than telling them that they are wrong while not only not pulling even one line from their article, but misattributing it to be something very different from what it is.

In what world does the response to the concerns and analysis in the article can ever be: "data corruption: everyone has bugs"...


the same world where mysql lets you corrupt your own data. did you just stop reading? in what world does a data corruption event prompt you to change platforms to another platform that has a history of data corruption?


They do address other stuff. They have huge writes and needs better writes performance. Maybe not what you and I need, but hey, I suppose they know Uber needs better?

MySQL handles it differently than Postgres, and gives them better performance for their purpose (based on their experience/test). They were explaining it in the parlance of the terms that MySQL and Postgres. If those are buzzwords, then MySQL and Postgres are both created using buzzwords?

Their explanation is not perfect (for me, why do their datamodel needs massive updates?). But I wouldn't write it off as buzzwords and dismissing Postgres because of data corruption. There are a lot of other things they were trying to explain there.


Isn't WAL-based replication more reliable?


It is. But at least MySQL provides both ways of shipping changes, WAL-shipping like row-based replication and the less reliable statement-based replication, and the DBA can choose which to use when.

Combining that fact with the way Postgres and InnoDB handle secondary indexes means WAL-shipping not only ships the entire new row, but also all the disk blocks of all the secondary index updates, unlike MySQL's row-based replication.

This is actually something I greatly like and want to see in Postgres. Perhaps the decision to choose secondary indexes with direct links to rows was taken because at that time (before replication) it was less read-heavy and the write-amplification wasn't such a concern. But now, when replication is a common requirement (and network IO is not as fast as disk IO) it makes a lot of sense to switch to a single-point-of-update allowing way of storing secondary indexes


Yes, MySQL, the shiniest of new toys...


now sure MySQL qualifies as "new" or "shiny" or "toy"


but getting to use "Schemaless" and "microservices" so that they could "shard"


Because sharding and schemaless are some recent novelties, huh?

(And even microservices are established practices -- heck Amazon was built on them more than a decade ago).


and so they could "get it working"


it doesnt sound like it wasnt working, though.


MySQL can hardly be described as shiny new toy. The peak of its hype cycle was around 2001


"Schemaless" on top of it and a shiny new "microservices" layer. Please, by all means, continue to take one aspect of the stack in an attempt to misrepresent me.


>this reads like a laundry list of buzzwords

The "laundry list of buzzwords" is a detailed, up to engineering standards, analysis of the issues they faced, with example cases and explanatory follow-up.

As far away from a "laundry list of buzzwords" as you could possibly get. Not to mention they did the very opposite of switching willy nilly to some new "shiny toy".

In other words, the comment is not even wrong.


> not throwing any effort into postgresql

The amount of research about the on-disk internals of both PostgreSQL and MySQL are a lot more effort than I would have probably spent (granted, I don't have a team of highly paid devs at my disposal, but still, I've seen technical decisions made on the basis of Google Trends...).


In my personal opinion, those bits of research aren't much, and they would have found it when they looked around for the cause of the write-amplification. A truly good amount of research would also have popped up heap-only-tuples, as mentioned here[0], or other ways to mitigate too many secondary-index updates.

0: https://news.ycombinator.com/item?id=12167161


It is not much effort to understanding the storage of each database. It is like a common sense for database guys.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: