Hacker News new | past | comments | ask | show | jobs | submit login
Why we lost Uber as a user (postgresql.org)
1234 points by areski on Aug 1, 2016 | hide | past | favorite | 285 comments

The PoststgreSQL project never fails to impress me. I know that there are some use cases which are not currently covered by it versus alternatives, but I have consistently got the feeling that everybody involved in the project is supremely professional and interested in building an excellent database – and the focus is on how to work to fix these use cases, instead of pointless mudslinging. Class act.

True, they're admitting there's a problem (which exists for this very specific user case) and not just deflecting the criticism.

I'm impressed, I'd say a lot of projects couldn't have handled this so nicely

They're trying to "Do One Thing And Do It Well". It's always interesting how core UNIX philosophy often gets lost in Startup Culture's rush to "Disrupt The World".

Databases do lots of stuff: authentication, data persistence, query optimization, process management, even an file system.

They are not the prime example I would use for DOTADIW.

Provide one abstraction and provide it well, maybe?

Puts me in mind of something I've seen done in Redis, and in the OSX (really NeXT) CoreFoundation container classes. In both, you instantiate an abstract data-structure based on worst-case time-space guarantees you want from its API. Then, the implementation actually instantiates it as one of several underlying concrete classes, usually starting with one that wins for low content sizes just because of fewer dereferences needed (e.g. a "ziplist" in Redis); and then migrates the content to a different underlying concrete container transparently when content reaches a certain size.

In both cases, you're giving the user a cleaner abstraction that they can think less about, but you're accomplishing it by making the computer itself do far more internal book-keeping, and creating a much more heavily-engineered design.

How does adding support for JSON types work with the Unix idea?

True, this is certainly a commendable move on their part and while they are making the right decision here, I feel they've shown real professionalism and class.

Color me impressed as well.

> The PoststgreSQL project never fails to impress me.

Except the name is a bit clunky and hard to write ;)

A small bit of history from Tom Lane one of the biggest contributors to the project "Arguably, the 1996 decision to call it PostgreSQL instead of reverting to plain Postgres was the single worst mistake this project ever made."

- https://www.postgresql.org/message-id/2693.1152762174@sss.pg...

> It seems too late to change it now

That's what they said in 1996. Then they said it again in 2006. In 2026 when its interwoven in all sorts of AI systems and still causing confusion, they will be looking back thinking why didn't they fix it back in 2017 when they still could. "Now" is always the right time, it's never too late.

“The best time to plant a tree was 20 years ago. The second best time is now.” – Chinese Proverb

And the name "Postgres" was an extension of "Ingres":


Or pg if the context is clear enough.

That might be more than slightly confusing here on Hacker News.

psql? pgsql?

Now we wait for pg to release his Arc lisp based database system called pgsql

Neither works: psql is the bundled postgres shell and (pl/)pgsql is posgres's procedural SQL extension (inspired by Oracle/s pl/sql) so both are pretty ambiguous.

pgsql is exclusive to postgresql, and from the context, it should very easy to know if you talking about the programming language or postgresql

pg sequel or pg ess-queue-el

are both very good, short, and difficult to confuse with another project or product

i think psql refer specifically to the command line tool of pgsql

It's called Postgres because it was the second DBMS that Stonebraker started after Ingres. So it's "post Ingres", or "post gres"

Then they made it speak SQL, so it was called PostgreSQL, which some might confuse as "Postgre SQL".

If only a clunky name was the worst problem with some software (even closed source) we have to deal with.

Slightly clunky names are fine as long as they're somewhat cute or clever. Names are important because it's how we identify and remember things, and a catchy name (even if it seems a little ridiculous or clunky) is better than a boring, banal name if the catchy name is memorable.

I'll take PostgreSQL's name (which I consider clever) any day over some name completely devoid of originality or thoughtfulness, such as most of the names Microsoft uses for its software.

Yes, at least it's searchable. The current trend of using common words as names was ridiculous as of a few years ago. Now it's just painful. Someone's going to name their project "The" and I'm not sure I'll be able to remain civil at that point.

There used to be a project named “THE”, but its name was changed to “Archy”:


Microsoft's naming (or lack of) is a mixed blessing.. usually combining "Microsoft" + term works... what sucks is when you search for anything "SQL" and you get a bunch of MS-SQL related links... really, T-SQL and MS-SQL should always be prefixed when referring to the vendor's product.

> The PostststgreSQL project never fails to impress me.

Let's see how many iterations we'll need for people to actually notice what was pointed out here ;)

I see clunky names as neutral, as the reduction in search engine collisions usually balances the difficulty in typing.

Maybe, but it's a much better name than mysql. I feel like a little kid every time I say "my" anything.

"My" of MySQL is Monty's daughter.

"He lives in Helsinki with his second wife Anna and daughter Maria (after whom MariaDB was named), and has a daughter My (after whom MySQL was named)[14] and a son Max (giving the name for MaxDB) from his first marriage" https://en.wikipedia.org/wiki/Michael_Widenius

Also, I'm pretty sure most of us are mispronouncing MySQL: IIRC, "My" (the Finnish name of his daughter) is pronounced like English "Me".

The y is pronounced like the ü in über.

A tip that works for some: lips to say "oooo" (like an owl, or someone seeing something shiny for the first time), tongue to say "eeee". Well, worth a try.

Actually it's more like the 'j' in Eyjafjallajökull.

Wait, I think you are all just fucking with me...


Not the American pronounciation!

The ü in über is pronounced [y]. I wonder if IPA originally got this symbol from Finnish orthography, then.

I think you will find [y] in most Germanic languages, including German, Swedish, Danish, Norwegian, (most likely) Icelandic, and Dutch, so probably not. But it is a very "Finnish" vowel though... Hyyvää!

Besides, Widenius is Swedish speaking (just like Linus Thorvalds and about 10% of the people of Finland) and the name "My" was more or less invented by the also Swedish speaking Finnish author Tove Jansson in her books about the Moomins.

Although wikipedia claims it to be a short form of Mary or Maria, which I seriously doubt.

Anyway, Finnish has not that much to do with the name, I think that it allegedly was Tove Janssons uncle - a professor in mathematics, that suggested the name based on the mathematical symbol and Greek letter μ pronounced in Swedish. It might even be vaguely similar to how it was pronounced in ancient Greek :-)

Besides, I think that "My" in Finnish would be spelled "Myy" since just about the only simple thing with the Finnish language is that the vowel length is indicated by the number of characters.

Oh, thanks! I was led astray by people earlier in the thread mentioning Finnish and so I simply assumed Monty's first language was Finnish, which, as you point out, it isn't.

Maybe Monty should upload an .au file somewhere saying "Hello, this is Monty Widenius, and I pronounce MySQL as MySQL"!

I don't speak Finnish. But I did once hear David Axmark pronounce My's name, and to my ear it sounded like "Mih" - an "M" followed by a short vowel similar to the vowel in "bit". In other words, My's name is the same as Mitt Romney's first name, just without the "t" at the end.

You might want IPA [ɪ]. (But another comment in this thread suggested [y], a sound that English doesn't have, and which is close in several ways to English [i] in "me" or [ɪ] in "bit".)

My is a Swedish name, not a Finnish one, as Monty is a Swedish-speaking Finn. It's pronounced similar to "Myeuh", so "Myeuh Ess Kuh Ell".

The variations on the project name I've read in this thread alone are amusing.

Three of them being: Postgre Postgreql PoststgreSQL

On the other hand it makes for more relevant search results!

Edit: did not see the typo at first though the word seemed longer than usual...

Too bad they didn't name it something non-controversial like cockroach.

Given a choice, I would name it PostSQL.

>> "focus is on how to work to fix these use cases, instead of pointless mudslinging"

More than that, within PoststgreSQL I get the sense that while there are serious discussions, all parties try to understand alternative positions and resolve conflicts instead of letting them brew.

To be fair though... It DID take a user on the scale of Uber to find and report and migrate away because of the problem.

I agree, I opened that link expecting a "you're doing it wrong" style post. I was pleasantly by what I saw

Wow, nice explanation. I was expecting a lengthy post with refuting everything uber said and explaining why pgsql was a better choice and uber was wrong.

Nice and clean post showing that acknowledging a weakness isn't a terrible choice.

I do wonder whether a different data structure would have mitigated the issue instead of transitioning to a different storage engine.

Yes, I have a lot more respect for this than say MongoDb which claims to be great at everything.

For the kind of stuff Uber stores, they may actually be doing it wrong (given what that Postgre mailing list post says) because that is one hell of an ugly use case for any DB.

I would have tried solving it by loading a dual E5v4 server full of 3TB and a slew of SSDs for L2ARC+ZIL under ZFS: more SSDs > bigger SSDs because the absolute worst case SSD performance that any and all SSDs suffer from is random reads (not writes) can slow to 200MB/sec even on those insanely fast "enterprise" PCI-E SSDs that do 2-4GB/sec continuous reads.

Given that, there are only four dbs worth using: Postgre, Oracle, DB2, and MS SQL. Unless you're doing something that is truly not suited for SQL (or SQL is insanely overkill), or you're Google and have a database in the hundreds or thousands of TB and literally have to write your own database (they went from almost inventing mapreduce, to going full circle back to full scale distributed RDBMS SQL with F1, which I wish they'd open source), any other database is just going to be a pain in the ass, buggy, and full of gotchas and undiscovered corner cases simply because it doesn't have over a decade of development and millions of users behind it.

Also, I have not included MySQL for obvious reasons: fun for toy SQL DBs where sqlite isn't a good fit, but not for enterprise use by any means. Uber switching to MySQL over Postgre is rather scary, I wouldn't want to be a Uber investor right now.

You wouldn't wanna be an Uber investor because choice of DBMS? As a programmer, I think this is the problem with developers thinking that technical problems are a bigger deal than it is.

Uber relies entirely on their database. If the database is slow, or cannot be considered reliable under extreme load, or can lose data in ways that are hard to recover, then Uber can potentially lose a lot of money especially during peak hours.

Yes, I'm aware there are systems for MySQL to handle failure, but I'm also aware of the systems for Postgre, and Postgre's failure handling seem to be far saner and easier to recover from. Defense in depth against failure is easier in Postgre from my experience.

This would be like Elon Musk blogging about how "oh yeah, sometimes the brand new Tesla factory shuts down completely because sometimes the power goes out; but we're using really popular well known power distribution systems, so we're industry compliant, so everything is okay."

If Elon Musk blogged that, HN would go apeshit, and rightfully so.

A better standard HN car analogy:

Postgreql is trustworth and predictable and engineered and engineerable. Its like a German union automobile plant press operator sitting down on the job and crossing his arms until the broken safety switch is fixed, which will take precisely 3.25 hours and cost $X while the resulting assembly line shutdown costs 1000 x $X. But it'll be safe and nobody gonna lose an arm. Your downtime and related costs are more or less predictable. Maybe not the highest productivity plant in the division, but nice safety record.

Mysql is best effort. The safety switch on the press breaks, redlining that machine and shutting down the entire plant. Hmm if I stick my arm in that 50 ton press while its operating, that'll hurt a bit, so lets just not do that. Dude's a real tryhard, which always ends like you'd expect. Of course safety regulations were literally written in blood so at some unpredictable time in the future you'll get a $1M personal injury lawsuit for loss of an arm and a $10M OSHA fine, and the plant will be shut down for the criminal investigation for a random indeterminate amount of time plus the interval required to remove arm from press. Your downtime and costs are completely unpredictable, but probably mostly over a very long term for many people on average lower than postgresql. The plant will have a higher productivity metric result, and also a worse safety record.

I applaud you. That is exactly MySQL vs Postgresql in a nutshell.

Can you site any of this? Facebook, Twitter, Google (for a long while), Uber, Yahoo all run critical systems on MySQL.

No I won't.

However there is an important point that philosophy doesn't matter when times are good. Its only when the tool is misused or there's a malfunction that the underlying philosophy even shows up.

It doesn't matter which system you use when you try to store Aug 1 2016 into a date column, but (at least in the old days) it was very interesting trying to store February 30th into the databases. Insert anyway with a warning? Round up, down, or stick in a null? Normalize it to being March 2nd ish? Insert fails completely with an error? This has varies with time and configuration but in a "general sweep of history" manner you can guess correctly most of the time what each DB does.

Also there's nothing wrong in any way with a critical system that drops into philosophical best effort mode during a crisis rather than paralytic halt mode. Well, there's nothing wrong with it as long as the system was engineered with that in mind and neither the dev nor ops people are surprised by that behavior. Sometimes that is the right thing to do.

Here is an Uber engineer's talk about their worst outage ever. 16 hours of downtime for their API as they repeatedly try and fail to promote a new postgresql master and reparent slaves to it.


https://surge.omniti.com/2015/images/presentations/MattRanne... (Slides 39-63)

the problem mostly happened cause ppl just ignored the "running out of disk space" message. Would happen on mysql, too. And I really wouldn't want that to happen on galera, I guess that would be a way bigger desaster.

Running out of disk space on any RDBMS is a bad day.

On the other hand, it's an internal thing. If it is a poor choice in the end, they'll change it -- that's what investors have confidence in.

Please don't call it Postgre...

Mysql eats and corrupts data by design. For a company responding to real time events in physical world, that can be a big issue. I know they're trying to improve their defaults lately, but a lot of weird behaviour remains. And you don't have to be an expert DBA to know that choosing a technology known for silent data corruption is risky.

>Mysql eats and corrupts data by design.

I'd like to have a source on that, would help shutdown a lot of MySQL discussions if true.

The design of MySQL has a lot more silent failures, silent coercing of data, and other ways that it attempts to do what it thinks you might mean (because you're an incompetent PHP programmer) instead of what you ask. The obvious example is that SELECT 0 = 'banana' returns 1.

A typical takedown would be the likes of: http://grimoire.ca/mysql/choose-something-else (which also touches storage engine configuration things that are easier to defend against by an experienced organization). Unfortunately this sort of takedown solves very few discussions.

It's the On Error Resume Next of databases.

I'm pretty sure Uber has enough money and enough developers to configure MySQL to not do any of that.

You can't configure MySQL to not do "any" of that. You can certainly make it better, but there simply aren't options to configure away all of the boneheadedness.

There are also tons of hidden gotchas that exist in, for example, the query planner. It can be extremely fickle and suddenly switch from a performant query plan to a terrible one that creates unindexed temporary tables and sorts them or joins against them. Or just ignores relevant indexes in the tables whatsoever.

Everything hums along fine until a random INSERT or UPDATE causes the query plan to change, bringing down your entire site. To be fair, such a problem can happen in any DBMS but I've never experienced it with Postgres to the extent that I have with MySQL.

You're thinking there's databases out there that are flawless, that never corrupt data, but that's garbage. They all do to a degree. They're also subject to being corrupted by hardware failures that aren't related to software.

Anyone with a huge production database running under load is going to have ways of mitigating these problems. Tumblr manages with MySQL, they open-sourced some of their tools like JetPants (https://github.com/tumblr/jetpants) to help build huge datasets.

So maybe Uber made a call and said "we can deal with intermittent corruption problems, we can recover from those, so long as the performance is better because a reputation for being slow is something we can't recover from". Life is all about trade-offs.

Nothing in my comment implies anything close to thinking that there are databases that are don't or flawless or don't corrupt data. I'm not sure how a comment about poor query optimization could possibly be interpreted that way.

Regardless, MySQL by default silently eats data in common situations (truncation of VARCHAR) and returns flat-out incorrect results due to PHP-style "helpful" coercions (SELECT 0 == "banana"). It implements UTF-8 incorrectly, but fixing it would break existing apps, so we're forever stuck with "utf8" encoding that isn't.

There are a million more of these, and while some of them have workarounds (strict tables, utf8mb4), many of them don't (automatic coercion, boneheaded query planner, creating implicit temporary tables without indexes even when present, etc.).

A comparison of MySQL to PHP is apt, honestly. The fact that PHP is a blight doesn't mean other languages don't have their own problems. But PHP (like MySQL) is in a league of its own here.

You can add in checks at the application level - like you would need to with a number of NoSQl databases anyway.

You can Google for silent truncation for a quick example. To be fair, MySql > 5.6 has fixed some of these issues and it also has some flags that can be set to help prevent them.

The by design part is referring to early versions of mysql and discussions around it purposely did not care about ACID. Speed was the number one driver.

I was recently advised by a DB consultant whose area of expertise is MySQL that 5.7 is still too new and risky and that he would advise against upgrading for at least another 6 months or more. He feels that the releases come out much, much too unstable and unpolished and that it typically takes at least a year since release before he's comfortable running it in production. I don't know enough about MySQL to know if that's true or not.

We are now investigating switching to MariaDB instead. (I'd personally love to move to Postgres, but that's not likely to happen any time soon)

I can confirm that the query optimizer introduced a rather serious bug (significantly suboptimal plan for queries involving low cardinality indices), which caused serious issues in our system.

This, in addition to the fact that index merging has been broken in MySQL 5.6 for more than an year now (in some cases it will cause empty resultsets to be returned), and that it is still broken on MySQL 5.7

Do you happen to have a link to the bug report for the first issue you described? I'm wondering whether I saw a similar thing in a benchmark I tested.

Sorry, don't have it. We experienced three separate instances of it, and have to open a report yet.

Beware slight ddl incompatibilities. For example Maria will dump timestamp field size, which mysql doesn't understand (or was it the other way around?...)


The docs describe a lot of ways data can be corrupted if you don't have the right configuration and database engine:


I don't think any of those apply in strict mode, which is the default in recent versions of MySQL.



> Also, I have not included MySQL for obvious reasons: fun for toy SQL DBs where sqlite isn't a good fit, but not for enterprise use by any means. Uber switching to MySQL over Postgre is rather scary, I wouldn't want to be a Uber investor right now.

I've been running a few MySQL setups for a decade, why do you think it is not a good fit for enterprise apps? InnoDB solves most of the previous limitations. I also think MySQL is slightly easier to deploy/scale than Postgres. I believe Facebook is still running MySQL for instance.

... or you're Google and have a database in the hundreds or thousands of TB and literally have to write your own database (they went from almost inventing mapreduce, to going full circle back to full scale distributed RDBMS SQL with F1, which I wish they'd open source),

You may be interested in CockroachDB, which was inspired by Google's F1 and Spanner implementations.


Another contender was FoundationDB, but Apple bought them and stopped selling it as a product.

> Another contender was FoundationDB,

No, it was not. FoundationDB's SQL layer was a joke. The system was horribly slow. Every DBA that I talked to that tried it said that it did not deserve the hype that it got.

> Uber switching to MySQL over Postgre is rather scary, I wouldn't want to be a Uber investor right now.

I guess you wouldn't have wanted to be a GOOG investor either? Most of their early revenue was from AdWords on MySQL.

I actually know engineers who work for Google for that. Google maintains their own MySQL fork internally and also know exactly how MySQL fucks up and has planned for MySQL failure in depth.

Google is maybe the only company I trust to know what they're doing when it comes to databases.

What about facebook? As I'm sure I've read they've forked MySQL too

You realize that Google, Facebook, Yahoo, and many, many other shops have successfully run on MySQL for well over a decade, right?

You can armchair engineer all you like, but at the end of the day, the proof of the engineering is in the working.

And if you have Uber stock you'd like to sell just because you think they're making a questionable engineering decision around databases (despite all the real world evidence that it is adequate to serve large scale businesses), I know a few investors who would gladly take it off your hands.

Github works on MySQL and they got pretty good at it. Ubers engineers made a respectable decision by switching to a lower-hyped database purely on excellent engineering.

Mitigating against your database going mental is scary and should not be encouraged but it's the default in most MySQL shops.

Whether you /can/ coerce it into doing the right thing is not the issue, Uber are doing their own thing regardless of if its right anyway so they're unlikely to change in that regard- what makes you sure they will do MySQL right at all?

Maybe Facebook investors should bail out too by that reasoning?

Facebook are using MySQL as an object store with no transactions, complex queries or even joins.

I could be wrong but most databases are pretty competent with this style of workload.

That isn't fully accurate.

It's true that the large majority of FB's mysql fleet is dedicated to the main product data, essentially an object-graph store with a restricted access pattern. The queries are indeed relatively simple. I can't remember for certain but I think that simple joins and transactions are actually used there.

However, the rest of the fleet supports an extremely diverse set of workloads -- keep in mind that MySQL is the primary data store of Facebook, and this includes product, ad serving, payments, async task persistence, internal tooling, many many other things. Countless different query patterns are in use. And although this is a minority of FB's mysql fleet, it's still many many thousands of machines supporting these other workloads, substantially larger than the vast majority of Postgres installations in the world.

(Source: I worked on MySQL automation at FB, and was lead dev on their RDS-like DBaaS, used by a large portion of the company's engineering teams in one way or another.)

I will readily admit that Postgres is a very good database, and definitely a better choice for OLAP workloads than MySQL (which has problems with complex queries) / InnoDB (problems with long-running read queries impacting old-row purge behavior). However, for extremely high-volume, large-scale OLTP workloads, MySQL/InnoDB is absolutely an excellent choice, imo better than Postgres for reasons of performance, ecosystem, and number experienced engineers who have worked at extreme scale.

Google, too.

Google's hosted CloudSQL (a hosted &customized MySQL solution) had tons of connection drop issues. Last incident it took them 72+hrs to resolve and they had no idea what and why caused it (the resolution was a side effect from another issue that resolved for some other custom complains.)

That was 1st gen CloudSQL right? 2nd Gen CloudSQL seems much more similar to running your own MySQL server. (Both the pros and cons of that)

Those were 2nd Gen CloudSQL instances. :(

We jumped onto the 2nd gen instances even when they were in beta.

This was my first thought, and generally what I always think when I hear about update heavy use cases. Typically they represent some issue in the object model, and often driven by an ORM layer.

I always ask is there a need for absolute correctness, and would this be better handled by a batch processing method.

I'm not sure any hardware would have solved their issue though. It may not have actually been disk speed, but issues in the mvcc design itself.

Facebook is a massive MySQL user so you wouldn't be a Facebook investor as well?

Maybe I've been in big companies for too long, but why not Teradata? Yeah it's expensive for the amount of storage Uber would need, but it's going to scale for them and perform well.

It is like saying I would not invest in Facebook as it is using PHP.

> Given that, there are only four dbs worth using: Postgre, Oracle, DB2, and MS SQL

This is total bullshit. Even a modest size database (3-5 terabyte) for a small company like mine, chokes a relational database bigtime on time series. Especially the modern use cases where you are ingesting fast. You need Cassandra or Hbase which are very serious pieces of technology and are definitely "worth using", and will crush any of the four relational dinosaurs you mention in that space at comparable cost, and with embarrassingly better scalability. And by the way, that space is not some corner case, it is probably the major growth area of the next computing era.

There are column-oriented relational databases for that. DB2 and SQL Server both implement columnar systems for OLAP-esque workloads pretty well, and I think Oracle has something like that as well. In the Free software space, MonetDB is pretty good and I believe there's some Postgres extension to store data column-wise (which is probably not competitive with ground-up column stores like MonetDB, but beats row-wise storage for certain workloads).

You don't need to give up proven relational databases for this web-scale bullshit. Financial companies have been doing hundreds of times “modern use cases where you are ingesting fast” for decades. KDB+ is the darling of that industry, but DB2 is pretty popular (IIRC) as well as some other niche options.

Right tool for the right job, basically. I have nothing against Cassandra or HBase, and they're both quite impressive pieces of engineering. But they're not an alternative to relational databases; rather, a complement for very niche use-cases. I see NoSQL stuff get used for “performance and scalability” in places where a columnar relational DB would scale just fine and bring the benefits of a relational database. I'm all for HBase for non-relational use-cases, but those are rather more rare than people seem to think (and end up with an ad hoc relational model because of it).

For columnar and massively-parallel workloads, Greenplum was opensourced this year[0]. It was originally forked from PostgreSQL 8.2.

There's also Apache HAWQ (incubating), which takes Greenplum's SQL parser and distributed query planner to use as front-ends for Hadoop[1].

Disclaimer: I work for Pivotal, which opensourced these systems.

[0] http://greenplum.org/

[1] http://hawq.incubator.apache.org/

Whoa, that looks quite slick. Thanks for your work!

All credit belongs to my peers in the Big Data division, and the engineering directors who successfully argued to opensource our entire portfolio of data tech (Greenplum, HAWQ, Gemfire and MADLib).

who said they were an alternative to relational databases? The problem with the post was not that the 4 mentioned were bad, only that they were the "only four worth using" which is an eye-rollingly ridiculous statement. Anybody who asserts that relational databases are the only serious databases is plain wrong, or has an ulterior motive. And this is not for "web scale bullshit" by the way. It is about large amounts of financial timeseries ingest for machine learning optimizations that I can promise you, has nothing to do with the web.

Suggested edit: s/This is total bullshit/Perhaps you overlooked something:/.

Otherwise, this would have been a good comment, I think.

This is one thing I like about HN's system: it lets me edit posts for a limited period. I'll sometimes bang out a post like the above, with a rather negative opening line like that and hit "submit". Then a minute later I'll realize I need to tone that down some, so I'll go edit it, as you suggest here. Unfortunately some shitty web forums don't allow you to edit posts at all, but I do appreciate ones which do allow editing stuff like this before anyone has a chance to read it.

HN also lets you set a delay before a comment appears, if you're like me and tend to catch mistakes more easily on the page than in the composition box. It's called "delay" on your settings page.

I think there's something to be said for the Linus style

Yeah, but generally only when you're actually correct. If you're not, you look like an idiot. Apparently the parent hasn't heard of column-store relational databases (which are kinda great for time series).

(I'm most familiar with SQL Server; DB2 and Oracle apparently have similar functionality.)

Apparently their 3-5TB time series table chokes ‘dinosaur’ relational databases. For a time series table you should almost certain be using a column-store index with CREATE CLUSTERED COLUMNSTORE INDEX. Depending on how much they're querying vs inserting, that alone could prevent SQL Server from choking. If it's more insertion-heavy, but they still need to run a lot of ad-hoc queries, SQL Server 2016 supports using a nonclustered column-store index alongside a row-store. You can insert with the row-store and query with the column-store. If it's still choking with that, it might be time to check out the database structure and see what's up. One of the more common killers is putting a lot of data in the time series table. Generally it's more efficient to have a clustered column-store index table that contains the timestamp and ids to metadata, and keep the metadata itself in row-store tables with appropriate indexes.

of course I've heard of column store relational databases, but they're not nearly as efficient nor scalable as Cassandra which has been built from the ground up for this. Fact is, if you don't need referential integrity, there is absolutely no need to pay the massive complexity overhead and horizontal scalability penalties that the relational databases incur.

Separately it is absolutely correct that a comment that asserts that 4 relational databases are the only serious ones is completely misguided and deserves a robust rebuke.

I consider your argument misinformed.

Technology moves fast. Very fast. If a piece of tech gets old enough to be called a dinosaur, and is still run at brand-new companies, it's doing something right, even if it's not for your usecase.

And the fact is, not all of us have 3-5TB time seriesdatabases. And time series data can cause a lot of DBs to choke.

The point is, I'm glad Cassandra and Hbase, or a dedicated TSDB, or whatever, work for you. Your usecase is not the same as the rest of us, and assuming that it is is causing you to come to some incorrect conclusions.

Clearly the relational databases are very good and very useful in more cases than time series. However timeseries is huge, and growing - it's not a niche case anymore. Anyhow my beef was with the assertion that the 4 relational databases listed were the "only serious ones". This is completely untrue.

You may have missed the line "In that case" :-)

> MongoDb which claims to be great at everything

lol right, with MongoDb, Map Reduce is a joke, GridFS is slow and barely usable, the storage is extremely inefficient, the "query engine" slow, and don't get me started on their "full text search" engine. MongoDb is a successful marketing stunt in the "Nodejs era".

Also, wasn't there a ridiculous issue that they had where the db can't be bigger than 4gb on a 32 bit file system because that's the largest size a file can have...?

Yes, apparently the limit on 32-bit is 2GB actually. MongoDB has always stated upfront that 32-bit architectures are not recommended for production use for precisely this reason: http://blog.mongodb.org/post/137788967/32-bit-limitations

This has also has been stated on their download page for 32-bit binaries as well.

It hasn't always been stated there, and it was a warning on a README somewhere. The server would run happily on 32 bits, and never crash or produce any user-visible errors, it would just silently corrupt data while pretending the insertion went great.

And that's the real issue I would think. If it shut itself down or went into read-only and generated errors it wouldn't be so bad. The limitations 32-bit architectures are why we have 64bit architectures. I think it's fair as a developer to require x64 instead of jumping through hoops to support both architectures. But if you're going to release 32-bit binaries for something like development or testing you should be explicit about the use cases and limitations and avoid failing silently when they're exceeded.

This stems from their choice of "mmap and done" as their entire IO strategy, for earlier versions. The data loss is silly though and shows their attitude.

Oracle 9 on 32 bit can handle 32GiB databases without any issue, and bigger setup to use more that a single file per table space. And Oracle 9 it's old.

Note: I hate Oracle DB, but I must work with Oracle 9/10/11/2 DBs because is what our clients have.

> When running a 32-bit build of MongoDB, the total storage size for the server, including data and indexes, is 2 gigabytes. For this reason, do not deploy MongoDB to production on 32-bit machines.

From https://docs.mongodb.com/manual/installation/#faq-32-bit-lim...

There was also the issue where errors were either logged or not, based on the result of Math.random(): http://stackoverflow.com/q/16833100/1233508

While I don't care about mongoDB, who's running 32bit anymore? And for a production DB?

DB exist on more than servers, e.g. in an embedded environment... You find sqlite in the oddest places... One could imagine finding mongodb there too, with bonus data corruption...

For example, the 32 bit mongoDB instance that's managing a grand total of 2 wifi access points here.

Spain public administrations.

Umm... Wow, mongo. Boy, am I glad I didn't decide to pick you.

Let's get some statistics. MySQL and Postgres both have and undefined max DB size, and max table sizes upwards of 16 and 32 TB, respectively.

SQLite has no max table size I could find, but had a max DB size of 140 TB. 140TB. And this is from a database that reccomends not using it if your data grows too large.

> Nice and clean post showing that acknowledging a weakness isn't a terrible choice.

Reading between the lines: "let's ignore the obviously suboptimal choice of architecture and concentrate on the DB specific issues addressed".

OTOH, what would be a nice way of doing CQRS completely within postgres?

CQRS doesn't imply event sourcing. CQRS works fine with the standard RDBMS workflow.

Ah, very nice!

I think if this was a giant corporation or someone like Mark Zuckerberg, there would be some nice PR post refuting Uber's claims.

> "In some ways, people worry about the bugs they have seen, not the bugs they haven't seen." [0]

This is a key take-away for me. I used mySQL extensively and switched to postgres for everything years back. I would need extremely good reasons to use mySQL again. What you get with postgres is a lot more consistency and peace-of-mind, in my opinion.

[0] https://www.postgresql.org/message-id/20160727160408.GA23585...

Does anyone else think the scenario in the explanation is an unreasonable request to make of a relational database? I think that if you've created a design that requires you to update a 50K row table 500 times a second that itself is heavily indexed and used heavily in joins, you have a software design problem more than a database problem. I wouldn't expect any database to handle that and am surprised that mysql does. One has to ask: for how long will it work? Surely the clock is running out on such a design.

You're not wrong. But mostly we haven't collectively agreed that relational databases aren't great for highly-indexed rapid-update join tables.

I think we will at some point. That's the primary original use case for a lot of NoSQL, and the reason Twitter had so much trouble with relational databases.

But these are cultural understandings, and those move slowly. Also, we're poorly (collectively) equipped to handle subtlety in these discussions, so mostly we're trying to move from "relational databases are perfect for all use cases" to "NoSQL databases are perfect for all use cases" -- which is even less true, not more true.

Culturally, this is a hard thing to keep in our collective brain.

That was an elegantly nuanced answer with just the right amount of context. Thank you.

How has NoSQL addressed join tables? All the approaches I've seen are much, much slower than with a traditionally vertically scaled relational database—or by moving away from joining altogether—it's traditionally been the twin pressures of scale and replication that force people to move to a distributed database.

Non-relational DBs (the Not Only SQL sort) usually passes the buck up to the application layer for several things. It is in a way a good thing to do if the DB wants to focus on scale out issues such as consistency of replicas or high availability. I personally find it pretty easy to work with alternate data models, thus not relying on the possibility of a join or a transaction, as long as I am not writing a banking application. Without a relational data model, the biggest problem becomes verification IMHO; verification of the DB, the application logic, and whether the application logic makes the right assumptions about the DB it is using.

Cassandra, for instance, makes it possible to query your join-type tables in relatively manual chunks using slice queries. You have to do a lot more of it in the application so you wind up with weird partial failures...

But at scale, weird partial failures, results as they are found and eventual consistency are usually preferable to a really long DB query that never returns.

Above a certain scale, big joins are simply not usable. NoSQL's manage-it-yourself approach is good for getting partial results where full results are too expensive.

You can think of it as applying a heuristic approach where an exact approach is too expensive, if it makes you feel less like NoSQL sullies the purity of databases :-)

NoSQL doesn't do table joins.

Well, F1 does. I'm fairly sure FoundationDB did too. It isn't incompatible--I'd argue joins are natural for certain tasks, and certainly reduce developer load, especially if the latency isn't a driving priority.

Yes, that's SQL, but the term nosql was always orthogonal to the priorities of the movement (horizontal scalability).

Absolutely - going from Postgres to MySQL is only trading one set of problems for another. It's a longer runway, but not infinite.

Their exact use-case is what things like Cassandra were built for - insanely high writes / updates. They're also built to split your load across N systems, as long as you're still using a monolithic database (even with read replicas) you physically can't get the same performance that you could with one of the NOSQL distributed systems. YMMV for specific performance, I've seen a huge MySQL burn through queries like butter and a Cassandra cluster crawl on a tiny data set.

The second part of your question is the real crux of the problem - "heavily indexed and used heavily in joins". Neither of those (RDBMS or NOSQL) work well in either scenario. This article indicates why PG isn't great, the Uber article indicates some of the (minor) downsides of MySQL. They're already using Schemaless to bend their RDBMS into a fancy key-value store, so they're halfway to using a real one with their home-built indexes already. For NOSQL you generally don't get joins, unless you write them yourself. You also end up with manual "write amplification" since you denormalize, write the data 10 times to index it 10 different ways. You can be smart about it so it's not exactly 10x, but you'll end up with more than your original problematic throughput, albeit spread across more systems.

An Uber engineer at a conference said that none of the open-source NoSQL systems could handle their load, and they they had to heavily hack one of them (which I think was Cassandra but the memory is vague) to get the last bit of performance out of it while they were building Schemaless.

They seem to be running Cassandra still, at least as of last month - "Running Cassandra on Apache Mesos Across Multiple Datacenters at Uber" https://www.youtube.com/watch?v=U2jFLx8NNro

I believe that was the subtext that the writer was trying to convey. Despite it being a bad pattern, it's one that their users still encounter and people are asking a solution for.

> I think that if you've created a design that requires you to update a 50K row table 500 times a second that itself is heavily indexed and used heavily in joins, you have a software design problem more than a database problem.

This would almost certainly be true if the design were widespread (which it's not as far as I know), but it isn't necessarily true for all cases.

I think it would be better framed as an optimization problem. If you design for a domain that actually models 500 events per second in a dataset of 50K items, the simplest correct implementation will implement exactly that. If that domain also involves reads which benefit from joins and indices that make those writes prohibitively slow, you have a conflicting set of optimization paths. The fact that some tools don't accommodate that well is an implementation detail, and addressing that fact is optimization, not necessarily a primary design consideration.

Yeah it feels like the kind of thing where regardless of the database system they use they're going to eventually hit some performance issues, but undoing that DB mess is likely no easy task and they're betting that they can keep kicking the can there.

Agreed, I got stuck on that point too. Leaning on UPDATE like that is a (schema, app) design problem, not a database problem.

500 updates is nothing for a modern database. 10k's of updates is no sweat on modern hardware. What is so scary about this load?

Multiply the number of updates times the number of indexes. Then consider that with such a small table, there's going to be contention for the same rows. Updates actually create new rows, so vacuum needs to be able to remove older, no-longer-visible versions of the row. But vacuum can't keep up, because at any given time many versions of the row are potentially visible.

This only talks about a particular write amplification issue. A far better message IMO is just a few clicks upthread: https://www.postgresql.org/message-id/579795DF.10502%40comma...

Provides a link to the rationale post from uber (https://eng.uber.com/mysql-migration/), and a tl;dr of it. Prior discussion here: https://news.ycombinator.com/item?id=12166585

It is nice to see that while the thread does question some of Uber's motivation for the change, where there is a genuine problem with their product there is a frank admission (with quotes like "this is a common problem case we don't have an answer for yet" and "limitations of our current replication system are real, or we wouldn't have so many people working on alternatives") which people discuss seriously (asking for clarification and/or suggesting ways forward) rather than reacting with a knee-jerk defensive posture.

Yes, and an even better message is the reply to it, pointing out how almost all of their issues aren't really issues: https://www.postgresql.org/message-id/20160727002711.GI4028%...

A huge respect for PostgreSQL team on (always) being so transparent about their shortcomings and giving credits where its due (InnoDB). Posts like these makes me more confident on PostgreSQL as a product.

Also I'd like to take this moment and address those people, who will start rubbing this "Uber switched to MySQL from PostgreSQL" argument without considering that not every app is "Uber". You can't simply take their use case and start throwing stuff against PostgreSQL.

The attitude with which the article is discussed in the mailing list is admirable.

Couldn't help contrasting it to some popular programming language mailing lists, especially a CSP inspired language by a major search company.

Couldn't help contrasting it to the comments on this very site.

In contrast, there is the community of a "useless language" (to quote one of its leading lights) that has never adopted [the] acceptance of paternalism as a requirement of espirit de corps.

imho some of the glaring shortcomings of the technology you refer to can be traced back to the very issue that you are highlighting.

In reading that, my first thought was "why in the world would Uber do that?"

In every single performance tuning a scale story that I've read over the past decade, the very first point of order is: remove joins from high traffic queries. It seems like Uber has gone the complete opposite direction.

It's unrealistic to remove all joins (usually), and most joins aren't very expensive. In normal cases a join is strictly faster than running an extra query to get that extra data.

Joins start getting particularly tricky when you do expensive stuff like filter over the correlated results of many tables. But in those cases, there's also no easy alternative: you'll need to redesign the way you store your data, if possible.

You can shape yourself to fit a problem or you can shape a problem to fit you. Wisdom is knowing when to do which.

Sometimes this is very hard to do. This is in effect Twitter's core use case as well, with the list of accounts you follow.

And it's a hard problem. Twitter spawned a huge wave of NoSQL to deal with this, and it's still not really dealt with.

But that is largely incorrect advice, given based on the fact that mysql never implemented any reasonable join algorithms, just nested loops. It didn't apply to real databases.

The original Uber blog post that led to the discussion - https://eng.uber.com/mysql-migration/

I think it's very cool that Postgres didn't just post some long thing about Uber saying "they're using it wrong omg!!!" and instead address the problems, understand their decision, and move on. I think that's very cool, there's clearly a good team working on this DB.

Actually after watching their video. I think that even MySQL wouldn't helped them. Ignoring disk space full is really not a good idea.

I see some comments call this a "very specific user case". This is not. Pretty much every major web project is going to have tables like that. User sessions are just one example. Sure, you can design around this, but it is a problem and no design is going to make it completely go away.

Storing sessions in rdbms is a rather poor choice espesially at scale.

On the other hand, sessions are important and if you're only using redis/etc for caching and not as a primary data store, it may not be a very good idea to treat it as a primary store of session data.

Sessions are vital for any product that works better with logged-in users.

User sessions is not an example of that, why would your user session have lots of indexes? And sessions don't belong in a database to begin with. I've never made a table like this in 15 years, and can't even think of a reason I would want to.

I agree it's a nice thing. But some kind of answer may also be good. Like restructuring your data to become faster.

I also wonder what happened the last few (10) years. When I was in university I'm pretty sure I learned that JOIN was Satan's mother and if you have a big DB you need to avoid JOINs as much as possible. That's not a big deal today anymore, it seems.

That kind of thinking is probably what spawned the whole "do the join in the app, not the database" anti-pattern. The truth is, the database is going to be much faster at performing a join than loading the contents of two tables into your app and iterating. If you need the data that results from doing a join, doing a join is the best way to get it.

Unless you already have your entire database in-memory in your app, that is. In that case, why do you have a database?

What you are missing here is "denormalization" -- e.g. many-to-many relationships. You can either use a JOIN with a table on a "normalized" database, or keep managing the result of the join in application code. Loading the entire tables into application code very seldom has anything to do with it...

A more realistic example is, do you get Alice's pets by doing a JOIN on tables Person, Pet, PetOwnedByPerson ("SQL") -- or by having an array column "pets" in Person? ("NoSQL")

I don't think materialized views are an answer to this problem, precisely because postgres materialized views do not automatically update when the underlying data has changed.

Sure, you have to write the code to do updates; but "materialized views" are a technique long used even in SQL databases that had no automated support for them; denormalized derived tables that are used for regular, recurring queries while the DB retains normalized base tables are a common thing (heck, in Enterprise environments, I've more than once run into a setup where custom maintenance of jury-rigged materialized views are used in a database system with strong materialized view support simply because the system has been around and maintained longer than the server software has had that support.)

Yes "denormalization" was the word for that. Is that still a thing?

Can verify, still a thing. There is even a fun saying that goes with it.

"Normalize until it hurts, denormalize until it works."

Yes, especially in business intelligence / data warehousing. Here is an excellent resource if you're interested in this sort of thing: https://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimen...

It's a term used frequently in NoSQL land, to explain a key difference to people coming from SQL. In SQL land, normalizing your data is still the canonical thing to do, and I don't recall anyone in academia officially talking about denormalizing ever having its place...but in industry, the realities of use cases and performance have meant its usage. But I don't know that it's a standard tool given out to graduating software devs and DBAs.

The link I posted above is a common SQL land usage. Dimensional modeling in a star schema is very widely used in BI projects.

There's actually two competing philosophies on data warehousing (Inmon and Kimball), but I've only ever used Kimball's method, which favors denormalization.


Of course they do, that's why there are the Normal Forms. Since like the 1970s people have been talking about this stuff. It's actual maths.

Yes, and almost always used as pre-emptive optimization that didn't need to be done. Often times even making things slower because the person doing it didn't understand what they were doing, they just heard "denormalizing makes it fast".

It really depends. Sometimes it's possible to slurp relatively small subsets in. Sometimes you've already got the data loaded in memory for other reasons. Sometimes (especially in long-running ETL stuff) you can come up with a way better query plan than a query-by-query approach could achieve.

But yeah, usually, don't be clever and don't spaff the contents of the database across a network just to do a join.

I think the "do the join in the app" anti-pattern was developed by this group of people who think that in app the programming environment they know and that it's working in their dev environment and that is all that needs to be considered to make a decision.

I would generalize what you say even further: The database is faster at most of the data crunching you need.

I think he's referring to denormalizing.

I have worked a lot with systems like Teradata, and I am trying to explain to the want-NoSQL-on-my-CV-crowd that large/complex JOINs are _really_ not a problem. Also trying to explain that relations in RDBMS' have nothing to do with relationships, and that normalization is just a tool that provides certain guarantees on the data you store is simply ignored :)

How well will INNODB scale for this particular use case? It seems that the architecture is a bit too reliant on the complex performance characteristics of one subsystem.

Its weird seeing a post mortem for losing a user (I really want to say customer) from a piece of FOSS. Its also weird (still!) to consider Uber a tech company, rather than a company that happens to use tech.

Just curious, what's your definition of "Tech Company"? All services provided by Uber are purely technical. Drivers and Riders are customers of Uber's technology. The full name of the company is "Uber Technologies Inc."

I wasn't sure what to think of that comment either when I first read it, but I sort of see where he is coming from. One side you have companies like Oracle, Microsoft, and IBM types that actually develop new forms of technology and sell the technology to people. Then there are companies that leverage technology in other industries to "disrupt" like OpenTable, Uber, and AirBnB. Then there are companies like Google and Facebook that straddle that line. They've developed and contributed back huge advances in technology, they mostly make their money from another industry, but also sell some tech in certain areas (mostly Google).

Interesting perspective, but I don't think selling technology directly should be the requirement. I think a company is a "Tech Company" when the core product is technology produced by the company. I consider Facebook and Twitter to be tech companies, even though they don't sell technology directly. AirBnB and Uber have some decent open source contributions and have the ability to contribute back huge advances but probably won't focus as much on that until they're profitable.

Acording to UK's Companies House search, Uber's nature of business (SIC) is 74990 - Non-trading company. This is quite vague but it doesn't mention technology.

Interestingly, they were also called UBER TECHNOLOGIES LTD and UBERTECHNOLOGY LIMITED at some point in time, but these companies are now dissolved.

That's because Uber UK is a pure support office, at least that's what their lawywers and accountants argue.

The real company in Luxembourg.

If that were true the nature of their company would be 'tax dodge'

Well if that's what their name says it must be true!

They don't "happen to use tech", their whole business is based on tech, and not in the way a bank would use tech say, but especially on tech they create (the client, the reservations system, extra services, etc).

It's like saying Google is just an ad company that "happens to use tech".

why isn't the tech that a bank would create in house be categorized similarly to the tech uber creates?

Because at least it's a really high tech bank based on that, the tech that the bank creates in house is irrelevant to the public, and they could even do business without it or even without computers, albeit much less effectively.

Whereas Uber is all about their client app, GSP tracking, etc. Without that they wouldn't be Uber but a large taxi company.

Without banking technology, a bank can still be a bank and provide its distinctively unique services, just slower. It worked for the Templars, and they didn't even have double-entry bookkeeping.

Without Uber technology, Uber would not exist - it would just be Uber Taxi Inc., a perfectly ordinary taxi company indistinguishable from all others.

Well, not really. Their product is really a two-sided market with prices controlled by themselves - matching drivers, unregulated and not employed by Uber, to passengers. They could do that with nothing more than a call center and a spreadsheet of roughly where drivers are if the tech didn't exist, and it'd still be cheaper and avoid the monopoly of the taxi companies.

Whether it'd largely avoid the ire of law enforcement if they couldn't hide behind their tech is a different matter.

Wouldn't that more or less just make them into a taxi company? That approach just wasn't working well.

The main difference between Uber and taxi companies is that Uber fully utilized modern technology from the ground up to build the platform, and as a result they were able to realize major improvements in terms of speed, reliability, and cost-effectiveness.

I think Uber is, in practice, an unregulated taxi company, allowing it to cut costs and pay its drivers less, and it would be just as popular if it had no tech at all.

Well, the main difference I'd expect from Uber vs a bank is that the majority of Uber employees are (software) engineers while the majority of bank employees are economists.

The majority of Uber employees... are drivers. Clearly proportion of software engineers cannot be a good metric for what makes a "tech" company.

Wouldn't you call drivers (one half of) the customers of Uber? Uber don't hire the drivers (as far as I'm aware), the drivers use Uber to find fares.

Drivers are independent contractors, not employees.

That's the company line, sure.

The IRS sees it that way too.. Just because you consider them employees doesn't make them employees.

Only in USA. And this only after paying $ 100M to settle lawsuit (http://www.latimes.com/business/technology/la-fi-tn-uber-law...) - they literally paid for drvrs not being considered employees in LA. Until next case.

In EU, they're considered what they are: nice gimmick to shift company cost to workers, and taxate workers salary with 20% tax, while having (on company side) fixed cost: https://www.jacobinmag.com/2016/07/uber-drivers-app-rideshar...

As a casual observer, their whole business actually seems based on lobbying and lawyers.

Here we go down the etymology hole, where original intent is forgotten, and argumentative minutiae reign supreme.

Also known as the "precision" hole, and getting meanings right hole, with profound impact on what we think and do.

I am guessing that VACUUM only kicks in for row deletions? With that busy a table, can't you just bypass VACUUM and reuse dead keys? I am assuming because you are talking indexes here, you are talking fixed length records and not dynamically allocated VARCHARs etc.??

Postgres uses MVCC, which is basically copy-on-write: every UPDATE to a row actually creates a new row. Once all transactions that could see the old row have finished, the old row can be pruned.

Ah, thanks for the clarification - the bloat issue makes more sense now.

It would be more helpful if this linked to the entire thread:


Here is the original message in the thread, and it gives some additional reasons as well.


Are there any recommended resources to extensively learn about databases?

great explanation by the Postgres crew.

it makes me wonder, though: if I had a table with 50k rows, updated hundreds of times per second and used in joins throughout the database, is there any way I can just stick that whole table into memcached or redis? I know there are some cases where this works, some where it doesn't. curious if this option was explored.

Completely agreee. Highly mutable => in memory ( then add a secondary stream for colder storage backups just in case).

A relational db is originaly meant for fetching data from a slow storage to a fast one, and the other way around, in a smart way. Doing it 500 times a second isn't a scenario for any db.

Build your own in memory data + process tructure, maybe using something like and agent network and using eg akka or erlang for failovers.

I'm curious as to why uber had to rely on a relational db for this case...

It's often simpler to keep data in a relational DB for reporting, especially if you've got data lake tooling -- it's just easier to get wider insights if your data is stored relationally. I know it's possible to do this without relational DB, but you get it for free with SQL.

As you say though, put it in memory first and write it out to a DB every now and then.

Just to keep it in memory? Postgres does this already-- it will put tables or parts of tables in memory to speed up reads [1].

[1] Source: https://momjian.us/main/writings/pgsql/hw_performance/

It's not about "just" keeping it in memory, it's about handling hundreds of updates per second without creating un-vacuumable bloat (per description in OP's link).

Is this not what Redis was designed for? In this situation, I would value Redis over memcached just based on its performance values and the cheapness of RAM in the cloud right now.

I'd really like to know why they DIDN'T consider something like Redis for this - if I'm thinking of my own apps where I just take redis==sessions for granted, why would they make a technology change of this magnitude in order to cover a usecase like that? Maybe there was a lot more?

If only politicians gave answers like that, country would be a better place. Kudos for the answer. I use both MySQL and Postgresql depending on my use case.

The whole discussion (this and [1] and [2]) is very interesting. I find it that Uber is holding on the ACID guarantees of a relational DB so much, in a way that is clearly hurting them. IMHO it will do them a big service to break down the responsibility of such a DB into multiple distributed systems that can work on a global scale. For example, a distributed lock system can help them when they need transactions. If they keep moving from one relational DB to another, they are bound to hit problems of Availability because they are choosing very strong Consistency, and the CAP theorem says we can't have it all (Partition tolerance is required).

[1] https://news.ycombinator.com/item?id=12166585 [2] https://news.ycombinator.com/item?id=12179222

Tom Lane's response:

this seems like an annoyance, not a time-for-a-new-database kind of problem.


Question about Postgres architecture: Why were secondary indexes designed to refer to ctids instead of primary keys?

There's no concept of a secondary index. An index is an index. Some are unique indexes, which force each value to be unique, some are not.

Primary Key is essentially syntactic sugar for not null and a unique index.

Among other things, there's a substantial performance penalty for secondary index lookups with clustered indices (since they need to traverse two index structures).

I'm not sure about the other databases, but in MS SQL you can include columns in the index leaves. This mitigates the need for the second lookup and can be even faster than the Postgres approach.

Keep in mind that the complaint Uber had was with writes in Postgres affecting secondary indices that didn't cover a particular column. As far as I know, in all databases, if a covered column is modified the secondary index must be, too. So that does not really resolve Uber's problem.

Additionally, I can't quite recall whether this is the case in MS SQL (since it uses pessimistic locking by default, not snapshot isolation, which has different performance characteristics), but in most MVCC architectures there's the additional problem that the underlying row value could have been changed concurrently with your query (e.g., deleted or modified). While this might not seem so bad for simple row-level lookups, this gets much more problematic if you are doing something like a range query, where the index might contain rows that weren't in the database at the beginning of your snapshot. There are a variety of ways of dealing with that problem, but most of them involve increased write traffic (index sizes get even more bloated because now they need versioning information), increased read traffic (index reads that touch out-of-date rows may have to follow an undo pointer, requiring the extra seeks you were trying to avoid in the first place), more locking (for instance, you could lock the entire index when a transaction modified it to keep it consistent--but that would decrease concurrency--or you could try to do range locking--which can lead to increased probability of deadlocks and also decreases concurrency and increases contention on the lock manager), opportunistic optimizations that only work on mostly-immutable data (Postgres and HyPer's solutions is to maintain a much smaller visibility map with bits indicating whether it's safe to assume the index is unchanged), or giving up multi-key read consistency (I'm assuming if this were an option you would be using another storage engine, because lots of them can perform unbelievably well if that restriction is relaxed!).

My point being, there's no such thing as a free lunch. Personally, I'm usually extremely happy to give up on pessimistic locking for the concurrency benefits of MVCC, and index utility decreases sharply as it gets larger, but as with many other things it entirely depends on your workload. Two-phase locking actually works far better than MVCC of any sort under heavy contention with short transactions (what Uber is apparently doing), so they really probably should have investigated SQL Server or another database optimized for pessimistic concurrency control.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact