Hacker News new | past | comments | ask | show | jobs | submit login
Bye Bye Mongo, Hello Postgres (theguardian.com)
1562 points by philliphaydon on Dec 19, 2018 | hide | past | favorite | 417 comments

Specific technology choices aside, this was an incredible write-up of their migration process: thorough, organized, readable prose about a technical topic. It is helpful to read how other teams handle these types of processes in real production systems. Perhaps most refreshing is the description of choices made for the various infrastructure pieces, because it is reasonable and real-world. Blog posts so often describe building a system from scratch where all latest-and-greatest software can be used at each layer. However, here is a more realistic mix of, on the one hand, swapping out DBs for an entirely new (and better) one, but on the other hand finding new tools within their existing primary language to extend the API and proxy.

Great read. Well done.

This was what was particularly interesting to me - that they went to the effort of writing a purely technical article on the particulars of how parts of their environment operate, and to publish it on their platform even when that's not the sort of content they're known for.

It's not their main platform though:

> Digital Blog

> A blog by the Guardian's internal Digital team. We build the Guardian website, mobile apps, Editorial tools, revenue products, support our infrastructure and manage all things tech around the Guardian

Hi! Thanks for your comments. I'm one of the authors of this post. It is the same platform at the moment (just not tagged with editorial tags so it stays away from the fronts), though sometimes the team that approves non-editorial posts to the site can be concerned about us writing about outages and things as it might carry a 'reputational risk', so we may end up migrating to a different platform in the future so we can publish more quickly, we'll see!

In our era of deplatforming a publisher publishing on something like Medium seems antithetical, even for a dev team that just wants to get words out. Should you spend some cycles on the dev blog? Probably, but you should also split test the donation copy and get the data warehouse moved forward for 2019 initiatives and fix 1200 other issues. Thanks for sharing a great post. I shared it with my team and we all learned a lot.

Thanks to hacker news and reddit this piece got over 100,000 page views which should be enough to justify the blog staying on platform!

Does SecureDrop run on your AWS infrastructure?

No, I think it's on our own infrastructure.

Thanks for the info, here and below.

It’s probably better if they didn’t answer this question...

It's a bit disturbing to me that they seem to be using AWS for confidential editorial work.

> Due to editorial requirements, we needed to run the database cluster and OpsManager on our own infrastructure in AWS rather than using Mongo’s managed database offering.

In a happy world the guardian wouldn't rely on a company we spend a lot of time reporting on for unethical practices (tax avoidance, worker exploitation etc.) - but we decided it was the only way to compete. One of the big drivers was a denial of service attack on our datacentre in 2014 on boxing day - not an experience any of us want to have to deal with again.

>Since all our other services are running in AWS, the obvious choice was DynamoDB – Amazon’s NoSQL database offering. Unfortunately at the time Dynamo didn’t support encryption at rest. After waiting around nine months for this feature to be added, we ended up giving up and looking for something else, ultimately choosing to use Postgres on AWS RDS.

Anyone who gets control of the live server can still read a database, even if it encrypts its storage.

Exactly. As I read the original article, which mentions "encryption-at-rest", there was a voice in my head crying: "No, what they need is E2EE". That would enable the authors to write confidential drafts of the articles, no matter where the data is stored (and AWS would be perfectly fine of course).

Disclaimer: The voice is my head does not come out of nowhere. I am building a product which addresses this: https://github.com/wallix/datapeps-sdk-js is a API/SDK solution for E2EE. Sample app integration is available at: https://github.com/wallix/notes (you can switch between master and datapeps branches to see the changes of the E2EE integration)

In which case they could've just used a separate encryption layer with any database, including DynamoDB. The HSM security keys available from all the clouds makes this rather simple.

Yes, any db including Dynamo would have been fine.

Our software E2EE solution has advantages over HSM though: Cost obviously, and more features and extensibility.

Great idea.

Encryption at rest is still important as it closes off a few attack/loss vectors: mis-disposed hard drives, re-allocated hosts. I'm probably missing a few others.

Yeah, but it doesn't really address my concern.

Anyone who can control a server in any environment can potentially interact with the database powering applications running on that server.

How is running on AWS different than Guardian Cloud in their basement?

The level of control over who has physical access, of course.

Did reports of the Snowden revelations reside on the CMS?

Sadly we don't trust our security practices anywhere enough for that! Secret investigations happen in an air gapped room on computers with their network cards removed then get moved across to the main CMS when they're ready to publish.

Probably not, no, until they were about to be published. I imagine that the choice between "run an entire data centre ourselves, store everything there" and "use AWS, but keep high sensitivity stories on local machines" is an easy one.

After all, the client computer that connects to the CMS is just as, or more likely to be compromised. I wouldn't be surprised if the coverage (or at least parts of it) were edited on airgapped laptops.

> the choice between "run an entire data centre ourselves, store everything there"

If those were the only two choices, you might be right. But the resources needed for the actual CMS functionality sound modest enough to run independently of the main website.

> the client computer that connects to the CMS is just as, or more likely to be compromised

That's faulty reasoning.

> That's faulty reasoning.

Why? It's an obvious potential point of compromise.

Sorry, I misunderstood. I read it as saying "We're going to get hacked via this other vector, anyway, so why bother?" I see your point, now.

They're using AWS VPC (Virtual Private Cloud) which isn't open to the world (you use a VPN to bridge the VPC into your internal network) and which you can spin up dedicated instances that don't share underlying hardware with other AWS customers.

Thanks for writing the blog post. Insights like these and of such high quality are rare.

Can I ask what was the total cost of the migration?

If there was software that could do this database migration without downtime, how much would you/Guardian be willing to pay?

This is pretty much how all Guardian articles are formatted. Some of their regular pieces could be called "blog posts" - Felicity Cloake's cooking series comes to mind.

Guess it makes sense to reuse the platform that already has the templates than use another platform and reimplement the design.

Given their employers, one would hope that they get good editorial support though!

Ish. From what I can tell, the “Digital Blog” seems to be set up as just another column on the platform.

Totally agreed. This is pretty much the definitive guide on how to perform a high stakes migration where downtime is absolutely unacceptable. It's extremely tempting, particularly for startups, to simply have a big-bang migration where an old system gets replaced by something else in one shot. I have never, ever seen that approach work out well. The Guardian approach is certainly conservative but it's hard to read that article and conclude anything other than that they did the right thing at every step along the way.

Well done and congratulations to everyone on the team.

I agree, but it looked them a year if I am reading the article right.

In most early stage startups, that would be an unacceptable loss of time.

So I don't judge them for doing a one-shot migration even if it causes an hour of downtime.

It all depends on the business.

Yeah it did take a long time! Part of this though was due to people moving on/off the project a fair bit as other more pressing business needs took priority. We sort of justified the cost due to the expected cost savings from not paying for OpsManager/Mongo support (as in the RDS world support became 'free' as we were already paying for AWS support) - which took the pressure off a bit.

Another team at the guardian did a similar migration but went for a 'bit by bit' approach - so migrating a few bits of the API at a time - which worked out faster, in part because stuff was tested in production more quickly, rather than our approach with the proxy which, whilst imitating production traffic, didn't actually serve Postgres data to the users until 'the big switch' - so not really a continuous delivery migration!

The article mentions several corner cases that weren’t well covered by testing and caused issues later. What sort of test tooling did you use, Scalacheck?

Agreed! I don't think enough engineering orgs appreciate the value of a great narrative on any technical topic.

Part of my duties at work require me to deal with "large" issues. While a solution to them is usually necessary and quick and high quality, I've seen the analyses that come after them vary in quality.

Good writeups tend to stick around in people's memories and become company culture, and drive everyone to do better. Bad writeups are forgotten, and thus the lessons learned from them are forgotten as well.

This particular article stands out for me. English is not my first language, and I've spent most of my life dealing with very fundamental technical details, so most of my writeups aren't the best. I'm going to bookmark this one and come back to it to learn how to write accessible technical narratives.

Writing is their Core business after all :) I agree though, I read it like a fascinating breaking news story

The BBC's Online and R&D departments have very interesting blogs, if you like this sort of thing.



Has world class platform for delivering articles. Uses Medium.

I don't even.

I imagine this is a nice benefit of working at an organization that is primarily about writing.

I was actually a little confused by the article - it seems to go up and down in terms of technical depth. It feels like it was written by several people. The hyperlink to “a screen session” was odd as well ... ammonite hyperlink I get but... screen is a pretty ancient tool... people either know it or can find out about it. Like you link to screen but not “ELK” stack?

I like the article but it was a bit hard for me to consume with multiple voices in different parts.

screen is hard to google.

This is the first thing I noticed too. This was an excellent read.

Old versions of mongo were very bad.

We accured lots of downtime due to mongo.

But later versions were rock solid and I've matainer mongo installations at many startup and SMEs once you setup alertd for disk/memory usage, off you go. Works like charm 99% of the times.

MogoDB is proof that with the right strategy, marketing and luck, you really can fake it until you make it.

Not that that's really a surprise or was unknown, it's just fairly new to see in the open source ecosystem instead of the enterprise one.

I'd posit its more a matter of maturing a new paradigm. There's a lot more edge cases you have to cover as NoSQL became more popular for production-at-scale.

SQL has decades of production maturation, and has wider domain knowledge.

I'm sure there's some of that. But a lot of the early problems were a bit more weighted towards poor engineering in general, IIRC. For example, I seem to recall an early problem was truncating large amounts of data on crash occasionally.

That's probably true.

Unfortunately the number of my customers who would sign off on just "two nines" is approximately zero...

The Guardian example was heavily used by MongoDB as a case study to pitch their database to others in 2011:




And reupping my previous, three-part series on MongoDB:

On MongoDB

NoSQL databases were the future. MongoDB was the database for "modern" web engineers and used by countless startups. What happened?


I think you're asking the wrong question. The question should be: How did MongoDB become so successful?

IMO, the reason is that newer developers faced the choice of learning SQL or learning to use something with a Javascript API. MongoDB was the natural choice because they excelled at being accessible to devs who were already familiar with Javascript and JSON.

Not only that, their marketing/outreach efforts were also aimed at younger developers. When was the last time you saw a Postgres rep at a college tech event?

I think you'll enjoy the series then, I spent several months investigating and made the same point about JSON and the Javascript-like CLI (plus great Node support, plus savvy marketing). For example:

> 10gen's key contributions to databases — and to our industry — was their laser focus on four critical things: onboarding, usability, libraries and support. For startup teams, these were important factors in choosing MongoDB — and a key reason for its powerful word of mouth.

Startup Engineers and Our Mistakes with MongoDB


The Marketing Behind MongoDB


Halfway into part two, this is very good so far. Thank you for the effort you put in (it really does show throughout).

> IMO, the reason is that newer developers faced the choice of learning SQL or learning to use something with a Javascript API.

The thing I dislike about this type of comment – although I now notice yours doesn't explicitly say this – is the implication that devs don't like SQL because they're lazy or stupid. Well, sometimes that is probably true! But there are some tasks where you need to build the query dynamically at run time, and for those tasks MongoDB's usual query API, or especially its aggregation pipeline API, are genuinely better than stitching together fragments of SQL in the form of text strings. Injection attacks and inserting commas (but not trailing commas) come to mind as obvious difficulties. For anyone not familiar, just look at how close to being a native Python API pymongo is:

    pipeline = [
        {"$unwind": "$tags"},
        {"$group": {"_id": "$tags", "count": {"$sum": 1}}},
        {"$sort": SON([("count", -1), ("_id", -1)])}
    result_cursor = db.things.aggregate(pipeline)
Of course you could write an SQL query that does this particular job and is probably clearer. But if you need to compose a bunch of operations arbitrarily at runtime then using dicts and lists like this is clearly better.

Of course pipelines like this will typically be slow as hell because arbitrary queries, by their nature, cannot take advantage of indices. But sometimes that's OK. We do this in one of our products and it works great.

With JSONB and replication enhancements, Postgres is close to wiping out all of MongoDB's advantages. I would love to see a more native-like API like Mongo's aggregation pipeline, even if it's just a wrapper for composing SQL strings. I think that would finish off the job.

Elixir's primary database wrapper, Ecto [0], lets you dynamically build queries at runtime, and also isn't an ORM. Here's two examples directly from the docs:

  # Query all rows in the "users" table, filtering for users whose age is > 18, and selecting their name
  |> where([u], u.age > 18)
  |> select([u], u.name)

  # Build a dynamic query fragment based on some parameters
  dynamic = false
  dynamic =
    if params["is_public"] do
      dynamic([p], p.is_public or ^dynamic)
  dynamic =
    if params["allow_reviewers"] do
      dynamic([p, a], a.reviewer == true or ^dynamic)
  from "posts", where: ^dynamic
Across all the different means of interacting with a database I have experience with (from full-fledged ORMs like ActiveRecord, to sprocs in ASP.NET), I've found that it offers the best compromise between providing an ergonomic abstraction over the database, and not hiding all of the nitty-gritty details you need to worry about in order to write performant queries or use database-specific features like triggers or window functions.

My main point, though, is that you don't need to reach for NoSQL if all you need is a way to compose queries without string interpolation.

[0] https://github.com/elixir-ecto/ecto

As I said to a sibling response, this is not a substitute for Mongo's aggregation pipeline unless it can do analogous things to Postgres's JSONB fields. For example, can it unwind an array field, match those subrecords where one field (like a "key") matches a value and another field (like a "value") exceeds an overall value, and then apply this condition to filter the overall rows in the table?

Also, one of the benefits of Mongo's API is that it has excellent native implementations in numerous languages (we already use C++ and Python), so a suggestion to switch language entirely is not really equivalent.

> As I said to a sibling response, this is not a substitute for Mongo's aggregation pipeline

Huh? The aggregation framework is a solution to a mongo-only problem. Most other databases are performant, but Mongo suffers wildly from coarse locking and slow performance putting things into and retrieving things from the javascript VM.

> For example, can it unwind an array field, match those subrecords where one field (like a "key") matches a value and another field (like a "value") exceeds an overall value, and then apply this condition to filter the overall rows in the table?

This sounds suspiciously like a SQL view.

Edit: But if you actually need an array in a cell, Postgres has an array type that's also a first-class citizen with plenty of tooling around it.

The "this" was referring to dynamically building queries (the GP comment by me) in Ecto (the parent comment by QuinnWilton). What you've said is a non-sequitur in the context of this little discussion. My whole original point is that raw SQL isn't right in all situations, and you appear to be arguing that I just use SQL instead.

I can't speak to every ORM or database interface in existence but ActiveRecord will happily handle Postgres arrays and let you use the built-in array functions just handily without having to write queries by hand. Ecto is less elegant, but you can still finangle some arrays with it.

As far as views are concerned, I don't know what to tell you. Sure, you'll probably have to craft the view itself by hand. The result is that you can then use most abstractions of your choosing on top of it though.

There's also the possibility of using automation to create, update, and manage views. That lets your app be 'dynamic' with regards to new data and new datatypes, but also preserves the performance, debugging, segregation, and maintenance benefits of the underlying DB.

>For example, can it...

Yes. There will be a subquery and jsonb indexes need to be thought out in order to make it fast

> Across all the different means of interacting with a database I have experience with (from full-fledged ORMs like ActiveRecord, to sprocs in ASP.NET), I've found that it offers the best compromise between providing an ergonomic abstraction over the database, and not hiding all of the nitty-gritty details you need to worry about in order to write performant queries or use database-specific features like triggers or window functions.

Ahh Elixir. My favorite language that really just tries so hard to shoot itself in the foot. I'm currently in the protracted process of trying to upgrade a Phoenix app to the current versions. Currently I'm at the rewrite it in Rust and try out Rocket + Diesel stage.

Diesel is... interesting and makes me long for Ecto (which is often used as an ORM although the model bits got split off into a different project).

Love the downvotes instead of comments. I've walked away from Elixir as the best practice deployment methodology (Distillery) is non-op on FreeBSD[1] and has been for a few months while the Distillery author is mum. All of this despite the vast love that the Elixir community seems to heap on FreeBSD.

Erlang and Elixir have plenty of promise but there simply is no good story for production deployments. Distillery and edeliver approximate capistrano, and that sounds great when it works (although I'd just as soon skip edeliver). But when it doesn't I'd much rather dig into the mess of ruby that is Capistrano than the mess of shell scripts, erlang, and god knows what else goes into a Distillery release.

Elixir is a really interesting language, but Phoenix seems to still be pretty wet behind the ears and very much in flux. Ecto too to a much smaller extent.

1: Some of the distillery scripts can communicate with epmd, some just give up.

Well... you can also use a modern ORM. I think "stitching ... text strings" is definitively not the way to go when interfacing a SQL database. My go-to ORM is Sequel[1]. I think their API is one of the best I've seen: you can choose to use models, but you can also work directly with "datasets" (tables or views, or queries) and compose them as you like. It's really powerful and simple.

[1]: http://sequel.jeremyevans.net/

> genuinely better than stitching together fragments of SQL in the form of text strings. Injection attacks and inserting commas (but not trailing commas) come to mind as obvious difficulties.

You're using the Pymongo library as an example. Someone can just as easily use SQLAlchemy and not have to worry about those things.

I'm confused by the implication that someone doing things like the above would be writing in SQL. SQL is a little like assembly language in a game: You may need to drop down to it for some key highly-optimized areas, but you rarely need to directly use it for most tasks. While it's true that you should understand how it works so you don't generate queries that suck performance-wise, the same goes for Mongo's intricacies too.

Every language I know of has great ORMs which do this for whatever SQL flavors people tend to use on that platform. I write things like this all the time, and it gets turned into SQL for Postgres:

```` Article.where(author_id: 37).order(:modified_date, :desc).where.not(published: false) ````

When using an ORM correctly (and indeed, the less I'm using any of my own bits of SQL the more this is true) I am also protected against injection attacks.

I'm not saying NoSQL has no value, but I believe it to be the wrong tool for data that lends itself to an RDBMS. If you have a bunch of documents who have deeply nested or inconsistent structures and where it makes no sense that you'd want to query by something other than the primary key, sure, it's a no-brainer to use a NoSQL system. For a CMS, which has been implemented thousands of times in RDBMSs, it is madness though. I cringe at realizing that apaprently there are developers out there who have avoided learning SQL entirely in their career out of fear, and as a result have to use Mongo for every application because that's the only thing they know how to do. I'm sure they're out there, but I wouldn't hire one.

I’ll try to avoid a flame war, but since you’re using python, SQLAlchemy allows for composing sql strings.

Yes, an ORM, not SQL itself.

You can compose queries, using the queryAPI in SQLAlchemy, without touching the ORM.

SQLAlchemy is not an ORM. There’s a companion ORM project if you want it, but it’s not necessary.

Yes? What's the qualitative difference between using a best-of-breed SQL ORM and the Mongo API?

For it to replace MongoDB's aggregation pipeline, it would need to play nicely with JSONB. Does it do that? This is the thing I'm really missing.

For example, if documents in the JSONB column all look roughly like this:

        "someArrayField": [
            { "key": "steve", "value": 7 },
            { "key": "bob", "value": 15 },
        "someOtherField": [ "whatever" ]

* Can I count the number of entries in someArrayField, summed across all records?

* Can I get the per-record mean of the "value" sub-field, summed across all records?

* Can I filter by records that have a "someArrayField" entry where "key" is "steve" and "value" is at least 10 (the above record should NOT match)?

Yes, you can!

The jsonb_array_elements function is roughly similar to Mongo’s $unwind pipeline op. It explodes a JSON array into a set of rows. From there it’s pretty simple aggregates to achieve what you’re looking for.

I was evaluating Mongo a couple months back to solve roughly the same problems. Eventually discovered Postgres already had what I was looking for.

More the point Postgres has an actual array data type (and has for a while). You don't need to shove everything into a JSON/JSONB blob unless you absolutely cannot have any sort of schema.

Not only arrays, you can, with some limitations, create proper types with field names, if your ORM supports that you should use that over JSONB if it fits.

See also: PostGIS.

Allow me to restate this question:

> Does it do that?

It was supposed to be clear from the context that this meant:

> Does building queries programmatically with SQLAlchemy do that?

Maybe I'm misreading your comment, but you seem to just be talking about writing queries directly in SQL.

If not, could you give an example/link of how to programmically build a query in SQLAlchemy that dynamically makes use of jsonb_array_elements? It would be hugely useful if I could do that.

There are some old examples of how to use jsonb_array_elemens in SQLAlchemy here: https://github.com/sqlalchemy/sqlalchemy/issues/3566#issueco...

I was speaking of SQL, but if you can write it in SQL you can usually map it to SQLAlchemy. If worse comes to worse, you can use text() to drop down to raw SQL for just a portion of the query.

SQLAlchemy’s Postgres JSONB type allows subscription, so you can do Model.col[‘arrayfield’]. You can also manually invoke the operator with Model.col.op(‘->’)(‘arrayfield’).

So you should be able to do something like:


(Writing on my mobile without reference, so may not be fully accurate)

It absolutely can, but in my experience, 99% of the time, choosing to make a data field JSON/JSONB ends up being a mistake.

To add to the pile of responses: in Scala, Slick is great library that lets you compose sql queries and fragments of queries quite effectively. (http://slick.lightbend.com/)

At my company we built a UI on top of Slick that lets users of our web app define complex triggers based on dynamic fields and conditions which are translated to type-safe SQL queries.

QueryDSL and jOOq as well for java.

> I think you're asking the wrong question. The question should be: How did MongoDB become so successful?

Marketing, marketing, and more marketing. Mongo was written by a couple of adtech guys.

> Not only that, their marketing/outreach efforts were also aimed at younger developers. When was the last time you saw a Postgres rep at a college tech event?

I remember being underwhelmed by two things at the one MongoConf I went to earlier this decade:

1.) My immediate boss was an unfathomable creep who was there mostly to pick up women

2.) Mongo was focused on how to work around the problems (e.g. aggregate framework) rather than how to solve them.

I can't recall ever seeing a Postgres rep, but I can recall having worked out a PostGIS bug with a fantastically tight feedback loop. The Postgres documentation and community are nothing short of amazing.

Meanwhile with Mongo I watched as jawdropping bugs languished. IDGAF what the reps say, anyone with even a few years experience should've been able to see through the bullshit that Mongo/10gen was/is selling.

From my POV the rise of 'NoSQL' some years back was tied into a number of things:

- Misunderstanding by most developers of the relational model (I heard a lot of blathering about 'tabular data', which is missing the point entirely).

- The awkwardness and mismatchiness of object-relational mappers -- and the insistence of most web frameworks on object-oriented modeling.

- The fact that Amazon & Google etc. make/made heavy use of distributed key-value stores with relatively unstructured data in order to scale -- and everyone seemed to think they needed to scale at that level. (Worth pointing out that since then Google & Amazon have been able to roll out data stores that scale but use something closer to the relational model). This despite the fact that many of the hip NoSQL solutions didn't even have a reasonable distribution story.

- Simple trending. NoSQL was cool. Mongo had a 'cool' sheen by nature of the demographic that was working there, the marketing of the company itself.

I remember going to a Mongo meet-up in NYC back in 2010 or so, because some people in the company I was at at the time (ad-tech) were interested in it. We walked away skeptical and convinced it was more cargo-cult than solution.

I'm _very_ glad the pendulum is swinging back and that Postgres (which I've pretty much always been an advocate of in my 15-20 year career) is now seeing something of a surge of use.

> When was the last time you saw a Postgres rep at a college tech event?

Is a Postgres rep a thing?

I remember a hyperbolic readme or other such txt file for Postgres in the far-away long-ago time when everyone was on Slashdot. The author had written one of the most enthusiastic lovenotes to software I'd ever read, and that includes Stephenson's "In The Beginning Was The Commandline." It was a Thomas Wolfe level of ejaculatory keenness. I'd love to read it again if anyone else knows where I can find the file. So, even if there aren't actual Postgres reps, there are most assuredly evangelists.

That's the point.

Absolutely - if you don't know SQL and you do know JSON, postgres looks scary and Mongo looks familiar.

Saying "I don't know SQL so I will just use JSON" really misses the point though. SQL is easy. Data is hard. NoSQL products offer to get rid of SQL which includes an implication that SQL itself was the challenge in the first place. The problem then is that you have lost one of the best tools for working with data.

I dunno that SQL is exactly easy, though. It's one thing to say "select statements are essentially identical to Python list comprehensions", but in practice I still have to look up the Venn diagram chart every time I need to join anything, and performance optimization is still a dark art. I'd say SQL is easy in the same way that Git is easy: you can get away with using just 5% of it, but you'll still need to consult an expert to sort things out when things go sideways.

You could solve that by altogether dropping the Venn diagram metaphor when reasoning about joins. This is the number one problem I see with junior devs who have a hard time grokking SQL. If you think about a join as a cartesian product with a filter, where the type of join defines the type of filter, the reasoning is extremely easy.

Here's a good article about that: https://blog.jooq.org/2016/07/05/say-no-to-venn-diagrams-whe...

The hard parts of "SQL" are the hard parts of data. Joins aren't easier in Mongo. The performance optimizations you reference are tuning of a relational database, not SQL itself.

If you want to work with databases a domain specific language like SQL really provides a lot of value in solving these hard data problems.

> performance optimization is still a dark art

The idea is, in relational databases, that the vast majority of the time you shouldn't have to do it. Because you're writing your queries in a higher level (nay, functional) language, the query planner can understand a lot more about what you're trying to do and actually choose algorithms and implementations that are appropriate for the shape and size of your data. And in 6 months time when your tables are ten times the size, it is able to automatically make new decisions.

More explicit forms of expressing queries have no hope of being able to do this and any performance optimization you do is appropriate only for now and this current dataset.

> I'd say SQL is easy in the same way that Git is easy: you can get away with using just 5% of it, but you'll still need to consult an expert to sort things out when things go sideways.

Mongo and Javascript don't solve that either. In fact you get additional problems by virtue of not being able to do a variety of joins. For extra points, you're going to need to go well beyond javascript with mongo if you want performance. 10gen invented this whole "aggregation framework" to sidestep the performance penalty that javascript brings to the table.

On the other side, the postgresql documentation is second to none. SQL isn't necessarily easy but the postgres documentation gives you an excellent starting point.

Here is the dirty secret of Mongo: you always have a schema, you just don’t have any tools for validating it or enforcing it or manipulating it.

You make it sound like learning SQL is like learning Assembler. It's not that hard. And ORMs exist in every language to abstract it all away.

PostgreSQL looks scary because it is a swiss army knife. It has a million different features and data structures. MongoDB does only one thing.

> You make it sound like learning SQL is like learning Assembler

It's not that learning SQL is hard. It's that people are inherently lazy. "Learn another thing on top of the thing it already took me a couple of years to learn? No thanks."

You seem like the kind of person ready and willing to learn the right tool for the job. From my experience a few years ago on an accredit computing course that covered database admin and programming, this attitude is not representative of most of the software engineering students //unless// there's a specific assignment that requires particular knowledge.

Cs get degrees. And for plenty of developers out there, knowing one language (not even particularly well) gets jobs.

> It's not that learning SQL is hard. It's that people are inherently lazy. "Learn another thing on top of the thing it already took me a couple of years to learn? No thanks."

And that's a big fat mistake. There are so many ways to shoot yourself in the foot with mongo such that simply knowing the language mongo uses for most of its queries while not actually knowing the particulars of how mongo uses that language… well that's just a road to a world of hurt.

For example, when I first inherited a mongo deployment I noticed the queries were painfully slow. Ah hah says me, let's index some shit. Guess what? Creating an index on a running system with that version of mongo = segfault.

After a bunch of hair pulling I got mongo up and running and got the data indexed. But the map reduce job was STILL running so slowly that we couldn't ingest data from a few tens of sensors in real time. So I made sure to set up queues locally on the sensors to buy myself some time.

Even in my little test environment with nothing else hitting the mongo server, mongod was still completely unable to run its map reduce nonsense in a performant manner. Mongo wisdom was: shard it! wait for our magical aggregation framework! Here's the thing: working at a dinky startup we can't afford to throw hardware at it especially that early in the game. Sharding the damn thing would also bring in mongo's inflexible and somewhat magical and unreliable sharding doohickey.

So I thought back to previous experience with time series data. BTDT with MySQL, you're just trading one awful lock (javascript vm) for another (auto increment). So I set up a test rig with postgres. Bam. I was able to ingest the data around 18x faster.

And that's the thing. Mongo appeals to people who are comfortable with javascript and resistant to learning domain specific knowledge. All that appealing javascript goodness comes with a gigantic cost. If you're blindly following the path of least resistance you're in for a bad time.

P.S. plv8 is a thing, and you can script postgres in javascript if you really wanted to.

I think what happens (and I have this attitude too) is that "learning" SQL takes a weekend...but then you know you'll wind up having to spend a lot longer learning the patterns of the language, and the nuances of the specific dialect, and which of the integration tools will work well with your workflow and pipeline. So while "sure I'll just learn SQL" is great for a personal or school project, when you've got to get something done next week, it's better to take maximal advantage of the tools/skills/workflow that you already have.

IOW, it's not just laziness, it's a kind of professional conservatism. which is partly what gets older engineers stuck in a particular mindset, but it's also a very effective learned skill. The opposite is being a magpie developer, which results in things like MongoDB taking off :)

> I think what happens (and I have this attitude too) is that "learning" SQL takes a weekend...but then you know you'll wind up having to spend a lot longer learning the patterns of the language, and the nuances of the specific dialect, and which of the integration tools will work well with your workflow and pipeline.

You have to do the exact same things with Mongo+JS (e.g. learning when to avoid the JS bits like the plague).

learning" SQL takes a weekend...but then you know you'll wind up having to spend a lot longer learning the patterns of the language,

SQL is a skill that rewards investment in it 1000x over, in terms of longevity. It has spanned people’s entire careers! What’s the shelf life of the latest JS framework, 18 months at most...

Yes, I know that, and that's why I know and use SQL instead of MongoDB. But that's a very similar reason to why I've resisted learning Rust, and Ruby, and React, and Docker, and Scala, and many more. I know I could learn the utter basics in a weekend, but I also know that those basics are utterly useless in a real-world context, and I would prefer to spend the weekend hacking on my open-source project in Python or C, which I've already invested the years into. And that's how engineers age into irrelevance..

Well, that and SQL has a somewhat undeserved reputation for being easy to learn, but also easy to screw up. Like you write a simple looking query and it turns out to have O^2 complexity and your system ends up bogged down in the database forever.

In practice people who fall into complexity traps are usually asking a lot more of their database engine than any beginner. It's usually not that hard to figure out the approximate cost of a particular query.

> Like you write a simple looking query and it turns out to have O^2 complexity

Or you have a simple fast query with a lovely plan until the database engine decides that because you now have 75 records in the middle table instead of 74, the indexes are suddenly made of lava and now you're table-scanning the big tables and your plan looks like an eldritch horror.

[Not looking at MySQL in particular, oh no.]

Learning SQL syntax isn't hard, but learning how to properly design relational databases is not something you can pick up from skimming blog posts.

I'd say that this goes for databases in general, not just relational ones.

Which brings us back to the original point: data is hard.

Assembly language isn’t hard. It’s actually quite simple. The issue with learning assembly is the absolutely useless otherwise domain knowledge.

Document-based storage definitely fits the generalised use-case better than tabular storage.

> Not only that, their marketing/outreach efforts were also aimed at younger developers.

I do remember a lot of MongoDB t-shirts, cups and pens around every office I was in around 2011-2013. When I would ask they would tell me that a MongoDB developer flew halfway across the world to give them all a workshop on it.

> The question should be: How did MongoDB become so successful?

Ability to store Algebraic Data Types and values with lists without a hassle of creating a ton of tables and JOINs. Postgres added JSON support since, plus there are now things like TimescaleDB, which didn't exist previously.

None of this makes sense.

ORMs have existed for decades so developers can use a SQL database just fine without knowing the language. So it's definitely not this.

It's more likely because Mongo is (a) is extremely fast, (b) the easiest database to manage and (c) has a flexible schema which aligns better with dynamic languages which are more popular amongst younger developers.

Postgres is faster at json than mongo. Also the pipeline query strategy of mongo is terrible to deal with. A schema should not be flexible. Now I have to write a bunch of code to handle things that should have been enforced by the database. Postgres is incredibly easy to manage with actual default security. I know the mongo tutorial says to not run the default configuration, then why is it the default configuration. It's so easy to manage anyone can take it over for ransom.

Mongo literally has no upside vs postgres.

To use an ORM and not get crap performance you still need to understand sql, and what is happening under the hood.

Fashion driven development and not understanding how to actually make use of SQL.

At "large financial news company" we had a "designed for the CV" tag that applied to stupid architectural decisions (of which there were many)

One of the biggest and most expensive was using Cassandra to store membership details. Something like 4 years of work, by a team of 40, wasted by stupid decisions.

They included: o Using Cassandra to store 6 million rows of highly structured, mostly readonly data

  o hosting it on real tin, expensive tin, in multiple continents (looking at >million quid in costs)

  o writing the stack in java, getting bored, re-writing it as a micro service, before actually finishing the original system

  o getting bored of writing micro services in java, switching to scala, which only 2/15 devs knew.

  o writing mission critical services in elixir, of which only 1 dev knew. 

  o refusing to use other teams tools

  o refusing to use the company wiki, opting for thier own confluence instance, which barred access to anyone else, including the teams they supported

I think the worst place I ever worked was like that. Was going back quite a number of years now but it was a startup fired up by a one of the lesser MBAs to utilise a legal loophole to slice cash off the public via a web app and a metric ton of marketing to desperate people.

Step one was hire everyone the guy had worked with at his previous company. They were all winforms / excel / SQL / sharepoint / office developers from big finance and had no idea where to go really. None of them had even touched asp.net.

Cue "what's popular". Well that was Ruby on Rails back then on top of MySQL and Linux. 4 people with zero experience pulled this stack in and basically wrote winforms on top of it. Page hits were 5-8 seconds each. Infrastructure was owned by SSH worms and they hadn't even noticed.

I think I lasted two days there before I said "I'm done".

I once saw a CEO who wrote his own Web Framework and forced the entire company to use it.

At the time, under the influence of React, the idea was to "build web application sorely based on Functional Programming". Since after years of trying no one could figure out what that meant, the company ditched the CEO and ended up wasting a couple years of work.

RDD, resume driven development....

well Elixir might have being a good decisions :)

only if more than 0.3% of your workforce know it....

And thus, nothing new was ever developed again.

It's not that you don't write anything in a new thing. It's that you start with small, less critical projects. Get your feet wet, give people a chance to get a feel for the pros and cons, that sort of thing.

Jumping right into writing "mission critical services" in the brand new language that few people at the company know well is asking for trouble.

There is a time and place for new. This isn't it.

there were no problems of scale, speed or latency. It was migrating from one terrible system to something that should be smaller, simpler, cheaper and easier to run.

The API is/was supposed to do precisely four things:

  o provide an authentication flow for customers
  0 provide an authentication flow for enterprises to allow SSO
  o handle payment info
  o count the number of articles read for billing/freemium/premium
That is all. Its a solved problem. Don't innovate, do.

Spend the time instead innovating on the bits that are useful to the business and provide an edge: CRM, pattern recognition and data analytics.

"real tin"?

Actual physical hardware, bought by the company and installed in a rack somewhere.

Not ephemeral Cloud Stuff rented from some company.

Reminds me of

“The computer industry is the only industry that is more fashion-driven than women’s fashion.” — Larry Ellison

This actually resonates with me, maybe not in the way Ellison intended as I’m not familiar with the context he said it in. A bit off the main topic but the more I revert to just using emacs for some task I previously used a .app bundle or web page for, the more I question how much we the computer industry has just been spinning its wheels for the last 30+ years. I honestly can’t really tell what value WIMP-centric GUIs have brought to the table besides fashion, let alone the endless debates in the form of actual implementations about the best way to build one. Possibly the best argument ever made died with Mac OS 9.

I suspect that in a year I’ll be using what is effectively an Elisp machine with an Apple logo paired with an iPhone.

Discoverability and self documentation is generally the advantage of GUI systems. Well built systems can be understood in a few moments without needing to consult a manual. That's almost impossible in a pure CLI environment.

Of course it's entirely possible to screw this up, modern phone-centric design standards are really bad about it for example, but in general you need to consult the manual (or Google) far less often.

I’m not convinced these are inherent properties of GUIness. We have had GUIs in use for long enough that many elements even across slightly different GUIs are familiar enough, this had allowed conventions and then conventional wisdom to develop.

Early GUI developers also put effort into making their GUIs at least somewhat intuitive, but they also bundled thick books of documentation on how to use their systems.

To a degree, they are self-documenting, but not because of their GUIness so much as their design choices; menus helped a lot with this. Menus actually are a good subject to touch upon, they are definitely one of the better conventions we developed, and they were based upon and analogous to a restaurant menu. However their very nature as a list of commands you can issue a GUI does also typically but not always, limit the application. If the only options available to you are what is on the menu and there is no other interface, then an application is much more limited. This isn’t even true of most restaurants which will often allow you to issue an order for something not on the menu if they have the ingredients, equipment and expertise to make it.

But as a convention, it is not limited solely to GUIs, you can incorporate menus into any interactive interface.

I think the real innovation wasn’t GUIs, it was interactive software. The innovation beyond that is scriptable software.

What is important isn’t GUIness, but the developer’s intent. If you develop software with the intent to be self-documenting and discoverable, you will end up with an interface that is both of these things provided you did a competent job of it. A GUI might help, relying on platform conventions might help, and using common cultural conventions might help, but these aren’t the necessary ingredients for those qualities.

Emacs has the quality of self-documentation, but it is emphatically not a GUI even though it is interactive.

I would say well built systems, provided you intend to do anything productive and even slightly complex, should have a manual included, or else it isn’t a well built system.

> I suspect that in a year I’ll be using what is effectively an Elisp machine with an Apple logo paired with an iPhone.

It's been working for me for quite a while now. And it's depressing, but here we are.

It isn’t all bad. My laptop is 6, almost 7 years old at this point and it has received a few upgrades in that time. This change in habits does lower the minimum system requirements for its eventual replacement from “runs Mac OS X” to “runs emacs”, but I might be able to stretch its life out a bit longer now.

vi, cvs|svn, C, bash, awk,sed, python and tcl. Linux (rpm based) or FreeBSD. Firefox or lynx. Haven't changed much in 20 years.

> I honestly can’t really tell what value WIMP-centric GUIs have brought to the table besides fashion […]

Literal billions of users.

Billions of users of which things that are worthwhile?

IMO there is still a place for schema-less document databases. It's just that Postgres's JSON columns mean you can get the best of both worlds, which makes Mongo look weak by comparison.

I would rather say, "Postgres's JSONB provides a hybrid compromise that may meet the needs of many users." JSONB feels like it's closer to creating more complex data types that are less primitive than those that SQL currently allows.

The real driver of the NoSQL movement, I believe, was that everybody wanted to be the next big social network or content aggregation site. Everybody wanted to be the next Facebook, Instagram, Twitter, etc. and that's what people were trying to build. Ginormous sites like these are are one of the applications that strongly favors availability/eventual consistency over guaranteed consistency, whereas most other applications are quite the opposite.

Nobody really cares if your Instagram post shows up 10 minutes later in New York than it does in LA, and certainly not if the comments appear in similarly inconsistent order. It's one step above best-effort delivery. However, your bank, hospital, etc. often care quite a bit that their systems always represent reality as of right now and not as of half an hour ago because there's a network problem in Wichita.

The question is, "If my data store isn't sure about the answer it has, what should it do?" RDBMS says, "Error." NoSQL says, "Meh, just return what you have."

> The question is, "If my data store isn't sure about the answer it has, what should it do?" RDBMS says, "Error." NoSQL says, "Meh, just return what you have."

Even that's too simplistic. For most RDBMSes, the answer depends on how you have it configured, and usually isn't "Error". If you're using a serializable transaction isolation level, it usually means, "you might have to wait an extra few milliseconds for your answer, but we'll make sure we get you a good one." Other isolation levels allow varying levels of dirty reads and race conditions, but typically won't flat out fail the query. This is probably the situation most people are working under, since, in the name of performance, very few RDBMSes' default configurations give you full ACID guarantees.

To the "DB in NY knows something different from DB in LA" example, there are RDBMSes such as the nicer versions of MSSQL that allow you to have a geographically distributed database with eventual consistency among nodes. They're admittedly quite expensive, but, given some of the stories I've heard about trying to use many NoSQL offerings at that kind of scale, I wouldn't be surprised if they're still cheaper if you're looking at TCO instead of sticker price.

This likely isn't true for banking systems.

Many ATMs will still give you money when they're offline, and things become eventually consistent by comparing the ledger.

Shops also generally want to take orders and payments irregardless of the network availability, so whilst they might generally act as CP systems, they'll be AP in the event of network downtime, but will likely lose access to fraud checks, so may put limits of purchases, etc.

They're probably all CP locally and AP (or switchable from CP) from an entire system perspective.

I like the JSONB support. Not for storing data but to query it with shitty drivers.

Some array magic and jsonb_agg and suddenly you can get easy to json decode results instead of having to play with rows in your app. Yes you can also do it with xml_agg but these days people tend to consider anything xml as evil (they're wrong).

I would say that development of JSON fields in Postgres and MySQL was accelerated by the adoption of Mongo.

Speed to market/first version using JSON stores is attractive, especially when you're still prototyping your product and won't have an idea of exact data structures until there's been some real world usage.

It's amazing how many insurmountable SQL performance problems can be surmounted by putting things in 3rd normal form.

That, and simply reading the documentation. N+1 problems and denormalization-by-design are a failure to understand SQL.

But denormalization is one strategy to improve the performance of a rdbms.

Standard rules about performance optimization apply. Denormalization can improve performance, but it can just as easily harm it.

For code, I think by now we all understand that you should always start with clean, well-factored code, and then optimize only as much as is necessary, which is usually not at all, and always under profiler guidance. It's the same with DBs: You start with a clean, well-normalized schema, and then de-normalize only as much as is necessary, which is usually not at all, and always under profiler guidance.

Also, keep in mind that improvements in compiler technology over time mean that the performance tricks of old can be useless or even actively harmful nowadays. This is true of SQL every bit as much as C.

Denormalization as way of dealing with performance issues is like guilottin to cure headache. And more frequently than not the headache is still there even after that.

It is true that for some narrow class of analytical workloads 20-25 years ago (behold BW of 199x) the denormalized case performance was better compare with straight non-optimized running of the same queries over normalized schema. Since then, the exponential availability of RAM and huge increase in streaming speed of HDD with stagnating IOPS (the main mistake in analytical workloads on HDD in the last 10-15 years - using nested loop join with indexed lookup into the large facts tables :) have made denormalization obsolete and harmful. If anything, the emergence of SSD and the huge RAMs moved things even further toward and beyond normalization, by making the "super-normalization", i.e. columnar tables, a viable everyday thing.

What do you suggest in place?

I would suggest not denormalizing at all.

The idea that joins are slow is a holdover from the bad old days when everyone used MySQL and MySQL sucked at joins. On a more robust DBMS, a normalized schema will often yield better performance than the denormalized one. Less time will be lost to disk I/O thanks to the more compact format, the working set will be smaller so the DBMS can make more efficient use of its cache, and the optimizer will have more degrees of freedom to work with when trying to work out the most efficient way to execute the query.

(edited to add: If you're having performance problems with a normalized schema, the first place to look isn't denormalization, it's the indexing strategy. And also making sure the queries aren't doing anything to prevent proper index usage. Use the Index, Luke! is a great, DB-agnostic resource for understanding how to go about this stuff: https://use-the-index-luke.com )

It's not, it's a strategy to improve the performance of a particular query or access pattern, and is usually the last resort after things like proper indexing, aggregations and materialized views.

JOINS are fast and it all comes down to how much data you're moving. If it's a large table joining to a small set of values, then the joined data is quickly retrieved and applied to the bigger table, with great performance.

If the join is between two large tables where every combination is unique then that's the unique case where the joined table is just adding another hop to get the full set of data for each row and is a perfect candidate for denormalization, although in that case it probably should've been a single table to begin with. Of course there's a spectrum between these two scenarios but it takes a lot before denormalization makes sense on any modern RDBMS.

True, if

1. You start from a normalized schema.

2. You denormalize in a structured way (eg dimensional modelling), rather than any old how.

3. You test the change.

Database query planners work better when they can take logically-safe shortcuts in their work. In large part that comes down to a properly-constructed schema.

Denormalizing makes it harder for the query planner. It also means you will probably lose out on future query planner enhancements.

Generally speaking, if someone wants to denormalize, I want to know the actual business value created and that the business risk is properly understood.

Sure. But before one makes that argument, one must measure. It is not in any sense a panacea.

It turns out that not defining a structure to your model makes it easy to prototype but difficult to query the data afterwards.

Which is why MongoDB is awesome for teaching and building new stuff but horrible for reports, metrics, and scaling.

Who said that they moved to relational model? They chose PostgreSQL because it's JSON support makes it also a superb document database.

What really happened is people wanted something new but did not want to change the ways they DESIGN their application and processes. There is place for NoSQL but if you are going to use it as if it was SQL then you would be better served by an SQL database.

Also, I think MongoDB tried to be everything and failed to be good at anything. It offers neither stellar performance nor scalability and I guess for most projects there is not much advantage over regular SQL database. Certainly nothing to fight over when there is much more technology choices to make.

Just a few thoughts. A lot of little companies like picking hyped tools because they think it will differentiate them and server them well later, but most companies never get to that stage and don't really need what is being offered. Seems much safer to take the tried and true standard solution that has worked for decades and just make your UX outstanding rather than try to put together a "dream team" of new tech...

Another thought. Most companies have absolutely no idea how to select the technology. A lot of people that are in position to decide don't have all that much hands on experience with the new technology and so they will decide based on other factors like whether his manager would like or would like not see new technology, what other companies are doing, etc.

Another example is "Agile". Everybody is doing it yet I still wait to see a single company that understands what the term means. My current boss is big promoter of "Agile" which in his language is synonym to "Scrum". Yet when asked he has never heard of Pheonix Project, The Goal, theory of constrains or basically any theory at all. So what the people are doing is fighting fires almost 100% of the time with not much project work done for the effort and absolutely no improvements. Yet, because everybody complies to do daily standups and Jira updates we are 100% agile.

Don't even get me started on devops...

Nice writeup. I love this quote: "The first few times an engineer sees this kind of hype, they often think it's a structural shift. For engineers later in our career, we’ll often dismiss structural shifts as misplaced hype after getting burned too many times"

> What happened?

Text and still imagery based media companies aren't big data providers. NoSql was a bad choice from the start.

People realized a lot of the claimed not only SQL offerings were actually no SQL at all. It turns out it is nice to have those extra features of NoSQL as well as a more traditional RMDS instead of just the NoSQL parts.

Also, the RDBMS world is full of some of the oldest scaling experts in the yellow pages. It's interesting how surprised people were that many of the "traditional" RDBMS were able to catch up on some of the scaling support that were the biggest advertised advantages of NoSQL.

It's interesting because I feel like the NoSQL world spent a lot of time reinventing the RDBMS from the "opposite direction". For all its faults, SQL is a fascinating language because it mostly ignores low level details of how the database scales, how it operates under the hood. It's not that SQL is intrinsically hard to scale (certainly at the relational algebra roots it shouldn't be hard, in theory), but it certainly leaves a lot of work to anyone building a database engine to figure out what/when/why/how to scale. I feel like a lot of RDBMS' query analyzers/planners resemble things like HBase a lot more than folks realize.

It's great that NoSQL realized that sometimes those "low level details" in an SQL engine are useful in their own ways, and have increased the spectrum of performance versus power/flexibility trade-off options. But it shouldn't be that big of a surprise that SQL databases remain competitive in that trade-off space, given under the hood they've had to think about a lot of that stuff over many decades.

I found it a little funny that NoSQL started becoming popular during at least some of the same years that static typing starting becoming popular (again).

I don't see it that way -- I feel like NoSQL's rise was more or less coincident with the adoption of Rails, Django, and Node over Java. The surge in interest in new static languages has mirrored the resurgence of Postgres, right around version 9.4 (and JSONB).

Could be. I haven't thought too deeply about what happened when.

I think they're both responses to the same challenges: distributed web enabled applications.

On the server front that means ever increasing complexity with decoupled microservices and latency issues that play nicely with the classic approaches to those domains (static typing, functional programming).

On the data front sites like HN, Reddit, or Facebook need scability more than consistency, and have oodles of 'uninteresting' data that jives nicely with a schemaless document store.

I've noticed that, too. There's an interesting ping-pong effect where a fair number people have flipped from strong typing at the database layer & dynamic typing at the view layer to the reverse — it seems like someone could write an interesting group psychology paper about how that cycle has repeated over the years.

I don't think I'm a Luddite but I always had the feeling I just had to sit out the noSQL / JS everywhere storm.

A great series - thanks! Required reading for anyone preparing to "take a punt" at a technology they've not really done much actual work with.

Even large organization like Guardian can have some cheep tech team who rush for such a poor choice? Also dumb for management to let it happen.

I really don't get this as an indictment of MongoDB, or their OpsManager product really.

They used the version of OpsManager that doesn't manage the deployment - is specifically not a deployment manager. Mongo does offer a managed version of this software, which the author mentions - with a justification for why they couldn't use that offering. However, I think this was the main mistake that The Guardian made. As the author notes: "Database management is important and hard – and we’d rather not be doing it ourselves." They underestimated the complexity of managing database infrastructure. If they had been attempting to set up and manage a large scale redundant PostgreSQL system, they would have spent an enormous engineering effort to do so as well. Using a fully managed solution - like PostgreSQL on RDS from the beginning would have saved them time. Comparing such a fully managed solution to an unmanaged one is an inappropriate comparison.

Full disclosure - I used to work at MongoDB. I have my biases and feelings w.r.t the product & company. In this case I felt that this article didn't actually represent the problem or it's source very accurately.

Fair criticisms! It's true if we'd used Mongo Atlas or something similar it would likely have been a different story - often the MongoDB support spent half the time on the phone trying to work out what version of mongo, opsmanager etc. we were running.

Re criticism of OpsManager - I think this is fair, given the sheer number of hoops we had to jump through to get a functioning OpsManager system running in AWS - no provided cloudformation, AMIs etc. £40,000 a year felt like a lot for a system that took 2 or more weeks of dev time to install/upgrade. The authentication schema thing was a bit of a pain as well, though we were going from a very nearly EOL version of Mongo (2.4 I think).

> often the MongoDB support spent half the time on the phone trying to work out what version of mongo, opsmanager etc. we were running.

That sounds awful. Reading these stories I'm happy I work with small companies without such a huge infrastructure.

Support at this scale is a hard game. I know some of the guys who work over there, and they make a valiant effort.

Training is exceptionally hard. Databases are hard to manage, and it takes years to learn to diagnose their function on unknown hardware / software.

This all being said, "no provided cloudformation, AMIs etc." is no bueno - not a good experience for the user.

If you haven't used Mongo with WiredTiger, you really haven't used it at it's best.

I do like the article but it sounds like they’re in over their heads, this whole (very risky) project could have been avoided if they just brought in someone that knew what they were doing.

> Clocks are important – don’t lock down your VPC so much that NTP stops working.

> Automatically generating database indexes on application startup is probably a bad idea.

> Database management is important and hard – and we’d rather not be doing it ourselves.

This is true of any database infrastructure with redundancy/scalability requirements.

What they did was take a technical problem and solve it by buying an off the shelf solution. Which is fine, of course, but I’m a bit surprised by the reaction here on HN.

The pricing of hosted MongoDB solutions is high especially as volumes increase. The last I checked, disk space doesn't free up when you drop documents or collections and that adds to hosted cost unless you find time to fix the issue through manual intervention. This costing is moving us away from MongoDB in the future.

MongoDB will reuse free space when documents or collections are deleted (even though the storageSize stat will remain the same). You can compact the database to release free space to the OS. You can read more about how MongoDB's WiredTiger storage engine reclaims space here: https://docs.mongodb.com/manual/faq/storage/#how-do-i-reclai...

I work for MongoDB so if you have any questions about storage, feel free to reach out.

Document DBs are like blockchain projects - overhyped and worse than existing solutions for nearly every use case. Why do ostensibly smart engineers keep falling for this stuff?

Well, we're kind of comparing MongoDB of ~2011 (when The Guardian started using them) with Postgres of today.

One major change is that in 2010 you couldn't always run your whole DB in RAM, so there were some real performance benefits with MongoDB.

Another difference is that MongoDB was early with great JSON-support. Something that Postgres has since gained.

I think there are pros and cons with both. If I'd chose one today, for most tasks I'd probably chose Postgres.

That said, to understand why people made those choices, you need to 'teleport back' to that time and compare them in that year (given the tradeoffs at that time).

If somebody is teleporting back then, the fact Mongo had a _global_ Mongo instance level lock for writes should be more than enough for people to run screaming.

I worked with databases that only had table level locks(not row level) and there were more than enough occasions I cursed the creators.

Instance level(& indeed DB level) is insanity unless your DB is a read only DB.

Not saying you're wrong, but a news site and CMS has different database requirements than many other products.

They have very few write actions; in the Guardian's case hundreds of authors publishing a few articles a day, versus millions of users reading data. Writes would be sporadic and batched.

News products also changes more rapidly than you might expect. A modern news team may be writing and shipping code to better cover a breaking news event. I can imagine why a document store would be appealing to them.

I don't think Mongo was used much by people who would know why instance level lock is bad, or even what it is. And if someone who knew came later, it was too late - Mongo's proprietary querying language is a vendor lock-in.

Oh! From The Guardian write up I had the impression they were using it until mid 2018..

It is not easy to switch, so until the pain is too big, one doesn't do it.

For one thing, a good chunk of what happens in the industry isn't particularly well described by the term "engineering," which in my mind describes someone who is well-trained and experience-versed in relevant scientific models for the application domain and the dynamics of relevant "materials" and their interaction that can be composed into a solution. It might be the term "craft" better describes a lot of software development activity.

For another, it may be everyone is susceptible to the influence of trends. Even engineers. And the more complex the details of a subdomain of software engineering are, the bigger the tradeoff between becoming someone who can make a true engineering assessment in that niche and developing expertise elsewhere...

It's also largely a fact that we're paid better for following hypes, at least in the UK. Thus a smart "engineer" who also cares about their bank balance pretty much has to use everything new to stay relevant.

Further, most of the use cases are fairly shallow, so it doesn't actually matter that much, and you'll be paid more to paper over the cracks.

We need to change the market, but it doesn't like the strong and stable engineers are getting lead roles to pay more strong and stable engineers to build boring tech, despite that actually being better for most businesses.

Probably for the same reason people have a build process to load in 100s of KBs of Javascript in order to render an <h1> tag, and the compile it all down to HTML and declare static sites are the wave of the future.

You're going to need some supporting arguments if you're actually trying to say that every document database is not worth using.

Sometimes the choice isn't in the hands of the person most qualified to make that decision.

They are not overhyped, definitely not as much as old school RDBMS stuff. It's just most engineers can't make good database choices no matter what database they choose. Or more generally they can't make good infrastructure choices, as those are out of their competence and are mostly about things like operations and distributed systems, that take a long time and a lot of experience to get to a level of good decisions.

You cannot be old-school and hype. That's the reason why it is old school. And the people who can't make good database design choices are exactly the kind of people who should be using SQL. Postgres knows how to optimize and plan queries efficiently based on the actual distributions of values in your dataset. These poor choosers should be doing that... by hand?


hipsters are allegedly the cool old-schoolers.

the whole promise of mongo is/was distributed (HA+LB), which was all the rage back then, when AWS AZs dropped like flies every few weeks and scaling was seen as the problem. go fast, break things was the mantra.

and it's still not trivial to do pgsql maintenance without downtime, whereas in a clustered/distributed "solution", you can enjoy certain additional freedoms. this includes the freedom to shoot yourself in the foot (data inconsistency, but you had to fight a bit for that state by promoting a non latest slave to master).

> the whole promise of mongo is/was distributed (HA+LB)

Yes. It is so easy to bring up a mongo cluster and feel like you have HA. Don't worry it won't be proven wrong until you have writes during your cluster degradation and are unlucky.

it works. 4.0 gives you transactions.

it's pretty great.

and, most importantly, as far as I know the default replica set config is safe. (durable, consistent, atomic) it handles a lot of failover scenarios for you automatically. and you have to manually intervene to get into a bad state - which might be okay for that particular business case (better than full downtime)

This is what I'm talking about. You think that some feature that makes performance unpredictable is that important. While it's the last thing you should care about.

What people should care about is the integrity and consistency of the data they're storing. RDBMS do an extremely good job of generalising this problem with reasonable performance, which suits the majority of real world problems. The ideas behind it are understood by most of the industry.

It should be the default choice (or plain filesystem storage), unless you have a specific requirement for something different. In the latter case, this should be people who are well informed about the database choices they're making.

Unpredictable performance would mean, you can't predict how long a specific query will take. This is the opposite of how postgres works. If a query is slow, it's predictably slow if the data doesn't change. So what you must mean is, when the data changes, it doesn't behave like you expect.

So again, how do these poor choosers make better decisions in the face of changing data when they have to write all the algorithms by hand, and understand how to scale the data correctly? Seems to me that pretty soon they'd just be writing a very crappy database on top of some keystore.

The only way for query performance to be truly predictable is for your data to be static. If that's not your situation then a query optimizer is incredibly helpful because it means that you don't have to rewrite your queries every time you add an index.

It also means that as your data grows or shrinks, the optimizer will notice and change the plan accordingly if it makes sense.

That means that 18 months after you implemented a feature you've long since forgotten about, you don't have to come back and figure out what the new plan should be. And that's huuuuge.

You can find accounts of the db changing the plan from a good one to a bad one, but I'll go out on not that far of a limb and say that those are < 1% of the cases. Nobody complains about the queries they didn't have to come back and change. And the better the optimizer, the better that trade off will be.

> Why do ostensibly smart engineers keep falling for this stuff?

Review driven design is likely a large contributer. If you know you will be looking for a new job in 2 years, you better get some experience in $latestTrend instead of $provenTechnology.

perhaps they don't agree with your thesis.

I never got around to using Mongo as my main doc db because it was incredibly hard to find a management tool.

I now use json supported functions in SQL Server and do not have the need for a different type of database. SQL Server handles my small 'documents db' implementation with the infrastructure of a RDMS. Win win for me.

To me Mongo just got popular by mistake way too early. It's like having a celebrity retweet your post because they liked what they saw at the time, exposing you to the world where everyone now thinks you have something important to say. Not surprisingly, you don't!

I don't have any links to prove my point but I suspect we need to thank Mongo for pushing the traditional SQL databases to support JSON.

I think the fact that the Cosmos DB "document db" API clones the MongoDB API, of all the available options, is a link in that chain.

Why? Seriously, JSON is lame, it's an idiotic trend, a fashion from the UI that's infected the back end for no gain whatsoever for anyone not using JavaScript. XML is better and we've just watch JSON slowly reinvent everything XML already had.

I think he meant json as a data model, ie nested objects, arrays and PODs, rather than json as a textual representation..

FWIW and insofar as I can recollect, JSON was supported by Postgres before Mongo was a thing, and Mongo was built based on Postgres.

No, that's not correct: Initial json support in PG was added in early 2012 https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit... whereas mongodb was made public in 2009, according to wikipedia: https://en.wikipedia.org/wiki/MongoDB

I don't see how Mongo was built based on Postgres, either?

My guess is the grandparent is (mis-)remembering the kerfuffle[0] around mongo shipping a copy of PG as a "BI Connector". But yeah, the timeline is off, that was in 2015.

0: https://news.ycombinator.com/item?id=10697692

Actually, no. I'm remembering asking myself how Mongo worked back when it launched, and reading somewhere that it was a heavily modded Postgres under the hood (like Amazon Redshift, which makes no secret about it, since).

I might be misremembering, though. And there's a possibility that what I read then was inaccurate.

I might admittedly be remembering this wrong, but I seem to recollect that Postgres had a couple of (unofficial) contrib modules to support JSON back in 2008 or so, and that (unfortunately buried in Postgres vs Mongo articles) the initial release of Mongo was actually a fork of Postgres for all practical intents.

The relevant data structures and indexing mechanisms (hstore, tsvector, gin, gist) were all available -- if only experimentally -- before Mongo launched.

I have seen so many critical articles of Mongodb. Even if I did have a use-case particularly well-suited for it, I would never consider using Mongo.

As an aside, I did consider using SQL Server until I looked at the licensing fees. Why would someone choose SQL Server when options like Postgres or MySQL/MariaDB exist? Is there a specific SQL feature (MSSQL or Oracle SQL) provides which is not available elsewhere and would be a core feature which the companies data storage is built upon? I.e. that feature is so important that the companies product architecture would be fundamentally different without using the proprietary database.

SSRS is a big draw. There are other and probably better reporting tools, but the integration and support from MS is worth something to most companies. There's also simply the network effect at play and marketing, that MS has always backed that up with results. The money involved really isn't prohibitive for most small to mid-size companies that I've worked with. I've only seen the opposite, companies leaving their expensive UNIX licenses to move to an all-MS stack to save money. I have never seen large companies try to reduce their technology costs to zero with pure open source, because I think they intuitively know that you either pay with your licenses or you pay with maintaining a lot of expensive expertise, that also needs properly managed and effective. In the discussion of free as in freedom vs free as in beer, most companies core business is not technology, and they know there's no truly free beer and don't really care about the freedom. They want everything I've already mentioned, and the short and longterm support the license buys.

Outside of my professional career, with my personal projects I use SQLite, but if I were to build out something that intended to be larger scale business venture, I'd probably go with Azure SQL Database for multiple reasons. A large one being MS's overall integration, including their full control of the stack and CI/CD with .Net->Azure DevOps->github->Azure Pipelines->Azure PaaS/SQL Database. I admire what they're building, but a lot of people work outside of the MS realm. Companies don't tend to have a tech stack bone to pick as we do on HN, and many are already using part of the MS stack.

In sum, there's just a lot of business realities at play. I wouldn't personally run out and buy a SQL Server license myself, for what I do at home or with my (very) small side business, either.

Integration. It is all about the integration. M$ platform is so tightly coupled that many companies see the value in spending on this tight integration than spending it on 25 different Engineers to manage their 40 vendors that may be needed to run some of these applications and platforms.

Dealing with one vendor, M$, is better than dealing with 20. You pay for one support package and get all the benefits that come with one platform.

As much as I love open source, I do see the benefit of working with one vendor on all of your stack needs.

M$ reporting systems of SSRS SSIS and other is probably their bread and butter when it comes to DB space. Very few want to spend 60-70 hours a week building complex command line reports where with SSRS and SSIS most of these tools come with a nice GUI that helps you build these reports. Sure there are others that do the same but most required a third party vendor to add to the functionality. For example Apache. With Java and Apache I've had to deal with literally 15 to 20 different vendors just to do the same thing I can do with one vendor. Jenkins, Zookeeper, Camel, Cassandra, Tomcat all under the banner of Apache, but in reality governed by their own set of standards. I mean take a look for your self: https://www.apache.org/index.html#projects-list

>Why would someone choose SQL Server when options like Postgres or MySQL/MariaDB exist?

Because they're in an organization where they're already invested in Microsoft technologies, so it's much easier to just use MS's DB instead of something entirely different that doesn't integrate as well.

It's similar to why many people use Apple software products that aren't as good as alternatives: if you already have an Apple platform/device, it's easier and better integrated and likely already installed.

Understood, but I should have prefaced my question with “for greenfield / new development”.

My answer is still the same, and actually assumed that. No one is really "greenfield": everyone is already invested in some technology to some extent. If your organization is already invested in MS technologies, your managers only know MS, and your developers only know MS/.NET/etc., then the choice is pretty obvious: MS SQL Server. Switching to Linux/PostgreSQL would be a sea change for an organization like that (even though, IMO, it would be better in the long run).

Similarly, if your organization is already a Linux-based one, running Linux servers for everything, adopting SQL Server would be a huge PITA and would likely be laughed out of the room if suggested.

I really like SQL Server, especially the dev environment, 1st and 3rd party management tools. Not sure if they have improved this, but replication was a limiting factor and the pricing changes a few years back pushed it down my preference list.

Disappointing article. There are no clear reasons mentioned for migrating away from MongoDB. "All these problems" and some issues with Ops Manager are mentioned, where "all these problems" is a couple of outages. As if other database technologies prevent outages. As soon as they experience a couple of outages with their new stack they will migrate to something else, presumably.

Thank you! I was reading through the comments wondering how no one has seen through the bullshit veil purported by the title, followed by a detailed outline of their migration which I think is the "shiny object" to distract people from an unfounded argument.

The main advantage of the migration was getting onto a fully managed DB. What we didn't mention was that there were also huge cost implications - savings of ~£40k - through switching to RDS - compared to paying for a mongo support contract.

My takeaways:

1) MongoDB Atlas would work well here: it’s Mongo hosted on a cloud provider of your choosing managed by the people that make it.

2) a DBA, even a noSQL one*, is worth every penny. (I don’t mean this to put down noSQL DBAs just that it’s good to have someone dedicated to managing them and that a DBA versed in the administration and upkeep of a relational system could retain pretty quickly to help get the most out of a noSQL one)

Fantastic write up article.

A few things stood out at me. In no particular order:

- Going with Mongo in the first place cost them dearly. CMS are a weird application for schemaless. Not necessarily wrong, but definitely weird. I wonder if they would benefit from moving to more structured schema and I'm willing to bet a lot of the migration complexity comes from that in the first place.

- God damn that is a long migration. Holy shit. I know tech isn't their core competency but they do seem to have very competent staff. I've worked in places this glacially slow and I get how it can get this bad but it really strikes me as having no right to be. 10 months to migrate .... from the point where they were ready, until they were done. And somehow integration tests got overlooked during all that.

- Close call with dynamodb. Bit of a wtf on waiting for the feature to be implemented for nine months though. I'm sure they have an account manager with aws... I definitely think blocking an internal process on such a fragile externality as a closed process upstream publishing a feature is the wrong move and a red flag. Their migration path would have probably been harder with dynamodb too.

- I feel the pain of their troubleshooting issues on the load testing step. I can completely see how this can happen. That said it also raises a few red flags to me. Letting something as simple as the load testing step get complex enough to require weeks of engineering is ... Eh.

Have more thoughts but I hate typing on mobile. This is a fantastic write up, I love when non-tech companies publish this sort of stuff. And hurrah for postgres.

> CMS are a weird application for schemaless.

That depends. If the CMS is managing documents that have flexible structures that are tree-like, it might not be such a horrible idea to model that structure in a document instead of relationally.

Document stores might be great for the C but not so much for the MS.

Or to put that in context of a basic data structure:

C = Actual content to be published

MS = User accounts, user permissions, user authentication options, user authentication logging, meta change logs (user X removed tag Y at time Z), behaviour logs (user X viewed revision Y of article Z at time A)... and that's just scratching the surface of a very, very basic CMS.

I'm being generous and assuming articles, tags, bylines, and attached media (with full change history) is all "C".

I enjoyed the write up. I love these kinds of semi-technical, semi-story-telling, pieces.

I am saddened by many of the comments here though which equate to: "never try anything until you know everything" - sorry but that's just not realistic and it's unfair to the people who - commendably - contribute to these write ups and hold their hands up to mistakes-made, decisions that went badly with hindsight, etc.

Bigger picture: the Guardian appears to be thriving, and is succeeding based on the efforts of the tech team here. So if you read this and come away with a sense of "failure" you're probably missing something important.

It's OK to try new things and fail - just get back up and keep on trying, and use the new wisdom you build!

The creator of Envoy had some choice words about Mongo a few days ago. https://twitter.com/mattklein123/status/1074717204224999427

Plenty of people here like to dish on Mongo and the product seems to have been re-architected a few times since I used it seven years ago. By what metrics can we say the product is one worthy of passing a HN smell test? Passing Jepsen was seemingly not enough. https://www.mongodb.com/jepsen

> By what metrics can we say the product is one worthy of passing a HN smell test?

Common sense and formal education? I'm sorry, I'm aware of how incredibly snarky and arrogant that sounds, but in this case I always struggled to comprehend how MongoDB, or most of "NoSQL" in general, was considered viable to begin with.

"Schemaless" just immediately means that instead of the database keeping consistency, you now essentially have to do all your type and constraint checking in the application; I never understood how that's favorable, and how that remains maintainable in any way. On top of that, NoSQL (and maybe Mongo in particular?) also decided to throw away all guarantees that classical SQL databases have been offering for multiple decades for very good reason.

I think everyone who's had a somewhat theoretical class on databases and learned about, for example, what ACID really means, will have smelled that something doesn't quite add up here.

> "Schemaless" just immediately means that instead of the database keeping consistency, you now essentially have to do all your type and constraint checking in the application...

This is so true. Some component has to maintain integrity of the data. Should it be the application developer? Or the database software team?

I know which one I'd choose, and which one is focused on data integrity and not business logic.

And it’s not just who you trust, it’s also at what layer you want to enforce your integrity and consistency. Not letting the db do that means that every part of the application that touches the database needs to do that enforcement. Even if you abstract that away into a strict layer directly above the database—which essentially means you’ve just implemented parts of a database on top of a database—you have to essentially invent and implement the methods and languages to do so yourself. At which point you’re pretty much guaranteed to be much worse of then if you had just let the DB do it. This might mean handling unexpected errors if you have a bug in your application logic, but that’s usually orders of magnitude better than introducing hard to fight inconsistencies.

> This is so true. Some component has to maintain integrity of the data. Should it be the application developer? Or the database software team?

if you're using partitioned (inherited) tables in Postgres, then it's the application developer, as foreign keys on inherited tables don't really work.

the workaround is triggers.

I also don’t see the benefit of schemaless. Schemes have been way too useful in my programming career.

But eventual consistency may be a viable trade off to help you scale, but the thing is you probably need to be at a massive scale before it is worth moving away from traditional SQL, which can have read replication and the app can have caching too. But there is probably some point at which eventually consistent nodes make sense.

In a sense that is what a CDN is, and we like to use those.

Are you reading the detailed reports prepared by the Jepsen audit team, or the press release? Jepsen audits are public.

The summary of the latest MongoDB report [1] follows,

"In this Jepsen report, we will verify that MongoDB 3.6.4’s sharded clusters offer comparable safety to non-sharded deployments. We’ll also discuss MongoDB’s new support for causal consistency (CC) in version 3.6.4 and 4.0.0-rc1, and show that sessions prevent anomalies so long as user stick to majority reads and writes. However, with MongoDB’s default consistency levels, CC sessions fail to provide the claimed invariants."

[1] http://jepsen.io/analyses/mongodb-3-6-4

> the product seems to have been re-architected a few times since I used it seven years ago

That's really not what I want to hear about a place I'd be storing data.

Was it just a prototype back then? (When did it stop being a prototype?) Would the creators be comfortable using version 1 still?

Actually, I think that could be a good test of database stability: would you be willing to run a 5 year old version of your own software?

Passing might not be the right way to look at it?

> This interpretation hinges on interpreting successful sub-majority writes as not necessarily successful: rather, a successful response is merely a suggestion that the write has probably occurred, or might later occur, or perhaps will occur, be visible to some clients, then un-occur, or perhaps nothing will happen whatsoever.

> We note that this remains MongoDB's default level of write safety.

- http://jepsen.io/analyses/mongodb-3-6-4 2018-10-23

In my previous startup we used mongo as the main datastore (this was pre Postgres 9.4, before indexable JSONB). It worked for us, and it worked fine. I saw the mob rising up against it, but we have not had any of the problems (maybe we just read the manual and had it configured correctly?). Today my go-to is postgres of course, but I had to wait until JSONB was out to make the switch.

> Since all our other services are running in AWS, the obvious choice was DynamoDB – Amazon’s NoSQL database offering. Unfortunately at the time Dynamo didn’t support encryption at rest.

Whoa, that was close. I really don't see why anyone would choose DynamoDB as a general purpose data store, unless they enjoy wasting countless hours finding ways around the limitations it imposes about how data should be stored and accessed. At least that was my (admittedly limited) experience. Postgres is a much better choice.

NoSQL like DynamoDB or Cassandra does make for good general purpose data stores, but you have to work within the concepts of NoSQL: typically, use 1 table with careful choices in partition and sort keys.

It's an epic shift in thinking to go from a schema of 50-60 tables to 1. Almost every dev I've worked with is very new to this.

The #1 sign a dev has no business using NoSQL: They chat about how flexible NoSQL is. Yeah, you can add attributes on the fly but I've found Dynamo to require much more careful planning then RDBMSes. Most devs I've met can understand when they need to use an index. But, almost all of them have issues predicting if their changes will lead to unbalanced requests against partitions.

Anyhow, since you can even connect the PostgreSQL WAL easily to a log stream like Kafka or Kinesis, I'm not sure why you'd ever start with a NoSQL DB, unless you just had master NoSQL data wonks.

It's not a great article tbh, it's well written but it shows the clear lack of knowledge running a backend. The title should be "we didn't know what we were doing so we switched to a managed DB"

I mean yeah who knew that blocking NTP therefore time drifting would break everything...

For those criticizing MongoDB, Fortnite generates $3B/year and runs on MongoDB, you should tell them it's a mistake and that they should use PG instead.

It was not because of MongoDB itself if you've read the article.

I don't usually bite for these "X uses Y, so Y must be good", but I didn't know about Fortnite and MongoDB. A quick google suggest they've had downtime due to issues with Mongo and have had problems scaling it though.

Their level of scale is .. amazingly higher up there than most companies using mongo I imagine, and they seem to be doing a good job at it now.

I thought the article was well written but I have to agree about the MongoDB use case by Epic. Mongo has its place and it is mature enough that it can handle itself in a production setting.

That being said they had a massive outage due to a MongoDB issue.

Yeah except not every use case is the same. Treating every use case as the same shows, let's see what was it, 'clear lack of knowledge running a backend'.

Having a successful product doesn't mean that all technical decisions that were involved in making that product were successful. It raises the bayesian estimate that they were, of course - but not to an absolute boolean value.

I'm sure mongodb is fine and powers a lot of high-profile sites and services. But I've never had a good experience running it or administering it.

But I'm also one of the the people that still prefers to run MySQL over PostreSQL just because the tooling is still far superior

Can you give an example of superior tooling or tooling that isn’t good in PostgreSQL compared to MySQL. Serious question. I’m curious cos I mostly use PostgreSQL.

It looks like Percona has partial support for PostgreSQL. It's not as detailed as MySQL.


My takeaways:

1. Stop. Trying. To. Build. Your. Own. Cloud. A pizza shop doesn't build their own cars to deliver pizzas.

2. There's no such thing as hassle-free anything, unless you are paying someone else to deal with the hassle. Sales teams lie.

3. Justifying an untested idea with "but it's modern technology" is going to backfire. Follow established patterns with good track records.

4. Writing your own in-house behemoth product, of which there are already many kinds available, results in long-term expensive engineering projects necessary to to get around the high costs you didn't know were coming.

5. Don't write business logic, or your primary software product, in a way that talks directly to a database. Just... No.

6. Magic new technologies that remove the problems of old technology also introduce the problems of new technology.

Some issues with your statements:

1. Well, they probably own the car though which is a better parable. You don't need to rent your car, you can simply purchase it just as you can purchase servers. People build their own garages.

4. So you should only used already written software? Sometimes it is just faster and better to write it yourself. You get less dependencies, you know how the whole thing works etc.

5. Why not? That is exactly what one should do.

Point 5. I'm not a software/application person... can you expound on how else you would interact with a DB if not directly?

I'm pretty sure he wants to say that you should abstract your database away so that your business code / domain model doesn't depend on it. It becomes a "plugin" to your application and you can easily switch it just by writing another implementation.

I couldn't disagree more, at least in the context of this article. If you have an abstraction layer so high level that app developers can't tell whether they're using MongoDB or PostgreSQL, then they're not able to use any of the advantages of either of those systems.

Sure, use an ORM to abstract away the differences between PostgreSQL and MySQL (up until you need to care about them). That's reasonable. But maintaining a magical MaybeSQL layer that's powerful enough to not totally suck is going to totally suck.

Why wouldn't you be able to take advantage of those systems? I think it's quite the opposite where you can take the advantages without even knowing about it / affecting other parts of the system.

Because of how different they are, taking advantage of them requires using them in specific ways that are also very different. If you have an abstraction layer that hides those differences, it's practically a given that it does so with the lowest common denominator approach, where you get all the flaws of both and none of their unique benefits.

(Alternatively, people find ways to use the abstraction layer such that it produces the desired usage pattern. Of course, then the code is no longer truly portable to another DB, because that same pattern is likely to be a perf issue there.)

I'm pretty sure you can have one interface for YeSQL an NoSQL.

* getUsers() * saveUser() * listUsers() * deleteUser()

and implementation can implement it in specific way to take advantage of chosen technology. If you have some esoteric use cases they could be handled in special way separate from business code.

For that specific use case, maybe. Where it falls apart is in `searchUser`, where the methods (and performance characteristics) of digging through the respective databases are going to be radically different. In a newspaper's implementation, you're going to have to search by date, subject, keyword, body text string, reporter, etc. etc. In MongoDB, that generally involves creating an index on the combination of fields you'll be searching together. In SQL, that generally looks like adding a `where reporter = "Jane Smith"` predicate. The MongoDB version may be faster if you have an enormous amount of data spread across a cluster. The PostgreSQL version will be more flexible when your boss wants to know how many reporters wrote stories about candied yams within three days of the publication of any story containing the word "Thanksgiving".

Being tasked to come up with an abstraction layer that supports the speed of precomputed, clustered indexes with the flexibility of SQL - if I were in a content creation business and not a database engine writing business - sounds like the kind of project that would make me quit my job and go on a one year silent meditation retreat.

That objection doesn't make sense. Queries are nothing more than a tree of predicates, how the backend end uses those predicates is not relevant to the API of specifying the predicates. Things like indexes whether in Mongo or in SQL are implementation details that can easily be hidden and not infect the API. You can interpret the tree of predicates into a SQL where clause or into a Mongo index search.

The OP is correct, your app can speak to an internal API without the underlying database infecting your domain code. That in no way implies you can't take advantage of the best of each database.

This is all rosy in theory. In practice, the way you write the query matters quite a bit. Often even between different SQL implementations.

And it's not just queries. Transactions often have important semantic differences that will be visible on application layer - again, even between different SQL implementations (e.g. MVCC vs locks).

> In practice, the way you write the query matters quite a bit.

Which is hidden in the query interpreter for said db implementation. Each implementation can break down that abstract query into whatever implementation specific query works best in that database.

There's always some abstract way to represent it that doesn't require vendor specific knowledge nor does it remove the ability to apply vendor specific abilities.

Look, I just don't agree with you, I agree with OP. Db specific stuff should be hidden from the domain layer by an abstract query representation and an abstract transaction representation to be plugged in at a later time.

Have you ever implemented an ORM? I did.

Stuff like "each implementation can break down that abstract query into whatever implementation specific query works best in that database" is wishful thinking. It's like saying that Java is faster than C++, in theory, because JIT can produce better code. And in theory, it can. In practice, we're not there yet. Same thing with high-level database abstractions - they're all either leaky in subtle ways, or they constrain you to extremely basic operations that can be automatically implemented efficiently on everything (but e.g. forget joins).

> Have you ever implemented an ORM? I did.

Several actually, which is why I know what I'm talking about; I've explored this area extensively. When Fowler first released PEAA I dug and went nuts and spent years coding up and exploring all the possible approaches and figuring out which ones I liked and why and which ones I didn't and why.

> Same thing with high-level database abstractions - they're all either leaky in subtle ways, or they constrain you to extremely basic operations that can be automatically implemented efficiently on everything (but e.g. forget joins).

If you're doing joins in your ORM, frankly, you're doing it wrong. Most ORM's do it wrong, they try and replace what a db does best; the right way to do it is to keep joins in the db. The role of an ORM when used properly is to map tables and views into objects and allow querying over those tables and views with an abstract query syntax. Joins belong in a view, not in code. It's called the object relational impedance mismatch for a reason, you have to draw a line in a reasonable place to get anything reasonable to work well and putting joins into the ORM is crossing that line and is why most ORM's utterly suck. Joins aren't queries, they're projections; put the queries in the code and the projections into the database, this works perfectly and lets each side do what it does best. Queries are easily abstracted, projections are not, projections don't belong in the ORM.

Any language with named tuples has a type system that is sufficiently expressive to handle joins without any sort of impedance mismatch. So, the only reason to avoid them is exactly the one that I cited earlier - the underlying implementation difference between databases.

> Any language with named tuples has a type system that is sufficiently expressive to handle joins without any sort of impedance mismatch

Incorrect. Named tuples will give you nothing back but a result set; the impedance mismatch refers to the mismatch between result sets and a domain model; getting tuples back doesn't remotely address this problem. I'd suggest you don't understand what the objection relational impedance mismatch problem actually is.

It's not a type system problem, it's fundamental mismatch between the relational paradigm and the object oriented paradigm. If a domain model has customers and addresses, and you do a relational query that joins the customer table and address table to return only the customer name and address city, the resulting set (name, city) doesn't map to the domain objects and isn't enough data for the domain model to load either of those objects which may contain various business rules. This is what the impedance mismatch refers to, relational projections of new result sets simply do not map to the OO way of doing things. Joins that create new projections are a relational concept that have no place in the object oriented world view: objects don't do joins, and object queries don't return differently shaped objects.

Hacks like partial loading of domain objects are attempts to mitigate the impedance mistmatch, but they do not solve it; they cannot solve what is a fundamental difference between two different ways of seeing data. Data is primary in the relational model and its shape can change on a per query basis, this is incompatible with the object oriented view of the world in which whole objects are primary and data is encapsulated and thus hidden.

The object relational impedance mismatch does not refer to a language problem, it refers to a difference in paradigm between OO and relational. It exists in all language regardless of the languages abilities and it's not a problem that can be solved, only mitigated, if you want to use both paradigms. You can solve the problem by avoiding using two paradigms, by either bringing the relational model into the application and not using OO or by using an object database.

I don't think there's a point in arguing this further. The other poster seems unwilling to understand that there are differences it databases within a category, let alone that there are multiple categories of databases with entirely different characteristics.

It's not a common denominator. If I need a Customer aggregate or a Payment aggregate it doesn't matter if I get it from SQL or Document database as long as I get that aggregate. My code doesn't care about query implementation.

FedEx needs both big, slow, high-volume trucks for moving stuff between cities, and small, nimble tricks for delivering to the doorstep. Someone decides that it’s inefficient to maintain two separate standards: any driver should be able to get into any available vehicle and have it Just Work for whatever job they have at hand, right? So they decide to make a single vehicle that can fulfill all roles.

Well, a vehicle that can squeeze down an alley won’t have the cargo room of a giant highway truck. The inter-city drivers will hate its poor capacity. Likewise, one that has a big 20-speed transmission for for hauling heavy loads is going to drive the city drivers nuts. They’re going to end up with one single interface to all possible roadways that everyone can come together and agree to hate.

If the database API is so free-form that you can store anything in it, you won’t get the advantages of PostgreSQL’s strict typing and lightning fast joins. If you make it so regimented that your data model ends up looking like a set of tables with foreign keys, then it won’t be able to make full use of MongoDB’s... whatever it does well.

They’re different animals. Choosing one highly affects the rest of your system design, from how you arrange your data to how you add new data to how you search for it. PostgreSQL and MongoDB have fundamentally different strengths and weaknesses, and if you make something that works equally well with both, it’s inherently going to suck equally on either.

It could be an application frontend interfacing with a webservice or network service. That component handles the database. That could make implementations clean for migration from on-prem to cloud, or if you're ripping out the frontend stack semi-regularly as more web-oriented devs may do more often. I can't say I disagree, and moving an application to this model myself so that I can easily migrate it from a WinForms frontend to Blazor SPA at some point.

RE 5: I'd argue that with a relational database (barring extreme scale cases) is the perfect place for your business logic. Constraints, unique indexes, etc, all ensure that your data is always consistent with itself and with the business logic you've encoded.

Re 5:

Adding an abstraction on top of DB query won't help much if you are moving from one type of DB to another, say SQL to document, or document to graph.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact