A one-size-fits-all database like PostgreSQL fits most. Moving to a specialised solution is always an option, if necessary. Starting with specialised solutions is a bad idea that will limit flexibility needlessly.
I mean MongoDB is schema flexible, there's some cases where data is so dynamic it just makes sense to use something like MongoDB. I do agree, PostgreSQL (or if someone's more familiar with MySQL that's fine too - use what's most effective for you and your team) to start is a good route.
This sort of post was most useful in the days when the term 'NoSQL' was being thrown around as the silver bullet that would revolutionize one's business, and there was a significant knowledge gap between the domain experts -- who quite often weren't the promoters -- and everyone else. While it's still plausible that someone obtains this information for the first time right now, today the mystique around the term 'NoSQL' has more muted, the marketing has become more factual, and a fair number of people understand the different types of data paradigms (KV, document, graph, timeseries) that the newer offerings provide.
As it stands, this post is a brief comparison sheet of AWS's datastore offerings, with bite-sized anecdotes that hint at a valuable usecase. It doesn't make a particularly strong case for the necessities, justifications, and tradeoffs of each particular type of paradigm or concrete offering, but is instead a brief content marketing piece with just enough practicality to justify its existence and make its point.
What's weird is that we create a tech (say, NoSQL) to address the shortcomings of another tech (say, SQL). We then call it a win and give it this strange grace period, during which we don't honestly assess the shortcomings of the new tech.
Worse, we actually declare the good parts (say, ACID transactions) of the old tech unnecessary and even dub it a feature that they are absent in the new tech.
Finally, the grace period expires and everyone seems to get the memo at the same time. The response is then "This is nuts. What the hell are we doing?"
The old tech is then restored to its rightful place and it's on to the next tech.
SQL is just an abstraction layer. What makes relational databases "slow" is actually the overhead to implement ACID and implictly guarantee consistency and integrity at all times.
When Facbook losses a dozen status updates that will piss off a few users. When a bank loses a single transaction and depending on the value and importance of that transaction (I.e. mainting a position in FX, with deals that can go into the billions) this can literally kill the bank.
Besides: relational databases (which I'd wager are still the backbone of > 90% of businesses) are definitly not slow. And very few companies have such insane requirements like Google, or Facebook.
Banks do lose transactions and make plenty of other mistakes that result in sudden appearance or disappearance of money from people's accounts. They are also eventually consistent, somewhat strongly. And are the ones who can in fact drop ACID pretty much completely in favor of proper strong eventual consistency with CRDTs and stuff, with all that "consistency and integrity at all times".
What makes things slow with interactive transactions in distributed environments is explicit tradeoff of latency and availability for consistency. It just can't be done in bounded time [1]. But that's not the reason traditional databases are slow, they do not actually make this tradeoff in most deployments and are assuming you are ok losing some consistency due to network problems (yeah, so much for that ACID guarantees, you really need distributed algorithms to guarantee consistency if you talk to a database over an asynchronous network, like ethernet).
They are slow because for performance SQL is just a bad leaky abstraction and arguing that it can be fast, is like arguing for a sufficiently smart compiler [2] that can turn your high level abstraction into the fastest possible opcode.
Having so many database choices is something of an embarrassment of riches. So many options, but also so much stress for the software architect who hopes to make the right choices. Its impossible for someone to be so familiar with each option so as to anticipate future issues that may arise because of a missing piece of functionality or other strange quirk. Something made worse when you go down the proprietary route.
I prefer to stick with one single reference data store, be it MySQL, Postgres, SQL Server or whatever. If we need to add some kind of search optimised database or big data analysis database they should replicate from the reference data store and be treated as non-mission critical add-ons.
There are cases where that strategy doesn't work effectively (IMO). Consider a social graph or purchase history relationships in a graph database. You could restrict your engineering team to only make entries in a (relational) reference data store and then express those relationships in a graph DB for querying.
That is not likely to give you the same pace of development as a competitor who chooses to use a graph DB natively while your team goes through a labor-intensive translation or impedance-matching process for each change. If they're able to outpace you in development/innovation, you are more likely to lose in the market.
This is a possible but not very likely scenario. It depends on your data serving a single purpose and continuing to serve a single purpose indefinitely.
In my view, the reason why so many people are returning to relational DBMS for the primary data store is that data tends to live longer than you think and serve more than one purpose over time.
Choosing the "right tool for the job" is easier said than done if the job keeps changing.
> If they're able to outpace you in development/innovation, you are more likely to lose in the market.
These things always depend on business requirements. What relational database offers, above all others, is data consistency: if used properly, it's an incredibly useful safety net that prevents you from errors that corrupt user's data. Transactions and especially proper schema can catch a lot of bugs that could otherwise wreak havock.
So, it's a trade-off: how bad, from a business perspective, would some data corruption be? If you're Tinder, mixing up profile information or deleting a match is probably not that bad. If you're dealing with other people's money, on the other hand, it could destroy your entire business.
Do relational databases really give you consistency? They can do if you are careful how you write your app. And many people don't use serialisable transactions because of the implications.
Also relation databases enforce their version of consistency. What if I only want consistency within one customer's data but not all customers?
I could have two inserts for operations on two different customers fail because of a deadlock on page slips in a secondary index. I didn't need that consistency but the database didn't know that.
I mostly used MySQL. But for my app, I now use CouchDB. It has its limitations, but the automatic sync between devices is pure gold. I haven't written a single line of sync code and it just works.
It baffles me a bit that CouchDB isn't more popular and that people rather use MongoDB for NoSQL, which doesn't come with the same syncing capabilities.
No one ever got fired for choosing Postgres, but I know plenty of people who were out on their ass when they built a client's application using MongoDB and then as it matured it became too cumbersome for some of the more esoteric features a client needed.
I once undertook a contract at a place that opted for MongoDB based purely on 'speed' and then effectivity used it to build a relational database once they realised the need to apply different permissions to different parts of a document.
Some hilarious bugs included a "relationships" collection that stored a record's identifier and an array of all the other documents it was related to. It was intended to be used for "joins" (yeah, I know). As the system grew and some customers created tens of thousands of records, all of which belonged to them so were represented in this collection. Eventually, the max array size in MongoDB was hit, which failed silently, rendering all of the data worthless.
I spent much of my time creating a roadmap for them to migrate to PostgreSQL.
I'm not blaming MongoDB here, it was just the wrong tool for that job. But some companies ate up through marketing material and thought it was the solution to all their problems.
We fell into same pattern. I do not know why MongoDB was used for the project I am currently maintaining, but to me it seems that it is a bad idea to use a relation-less database for inherently related data.
I think that the original thinking was along the lines "well, we do not know what structure the data will have yet"
That contract took place in 2011. As I recall the Perl MongoDB driver had a setting called WriteConcernResult that used to default to false. That meant that it operated in 'fire and forget' mode and assumed that everything worked. It was one of the main reasons it was so fast! I think over the years the defaults have improved and favour safety over speed.
Like I said in another thread, as I remember Perl's MongoDB driver used to default WriteConcernResult to false, essentially making it assume that everything was successful. It's been years since I touched it so my memory is a little hazy.
I've seen my fair share of MongoDB dumpster fire codebases as well and one thing that makes it so much worse is it's almost always MongoDB + noob developers who picked MongoDB because a blog post told them it's all cool and it's a part of a "hip stack".
That's the deadly combination.
If it was MongoDB + experienced developers who picked it out of careful consideration things would be different. But you never see that because experienced developers are smart enough to pick Postgres instead.
What's new is old again. Michael Stonebraker had a paper about this in 2007 [0]. Over that time a slew of products came into the market that claimed to do it all. JSON, SQL, ACID transactions, Streaming, time-series, Full-text search, batch processing, ETL etc. Things that don't even fit in the same bucket. For awhile companies were positioning Hadoop to do all of it
Seems like through every technology hype cycle, the same reminders need to be written and re-circulated again.
a slew of products came into the market that claimed to do it all. JSON, SQL, ACID transactions, Streaming, time-series, Full-text search, batch processing, ETL etc
Stonebraker’s own Postgres does all of that.
The article is just an infomercial for Amazon’s (expensive, proprietary) offerings.
I agree. But we are still quite a young industry, certainly as far as relatively mature tooling goes. Perhaps one day soon the technology cycle will be taught to programmers in the same way that MBAs learn about the business cycle.
Sadly it talk against "Relational databases" instead of "RDBMS implementations".
The relational model is incredible powerfull. Maybe graphs need something seriously specializes (and indexes, of course) but the relational model do fine:
I read this article I found on HN a few years ago, and for some reason it came to my mind just now.
The neat thing about the graph model of course is being more immediately expressive to the query-er, and that takes cognitive load away from that person.
I'll never say a bad thing about SQL or the relational model. I completely agree that it's superior in terms of flexibility, performance (almost always), modelling (usually), etc but I guess there are times where you have a weird project and it's just I guess the convenience and ease of using Neo4j and a few cypher queries and then it's over.
In the case of the article I linked, there is also the benefit of having a convenient way to see how changes echo throughout the graph, something that is more involved to do with tables and joins and all that. I'm not a DBA, I've really only ever dabbled, but if I got dropped into that situation and DIDN'T have a graph model, I'd have no clue where to even start. Maybe I'd start trying to walk tables and build graphs out from each item type to each other item type. That'd be gross.
Of course, that's probably not a problem scenario that comes up very often. Most of the times I've interacted with databases in general I've been just dealing with sanitizing params, getting HTTPS configurations configurated, just boring infosec stuff. That problem is probably very rare in the realm of "problems that graph seems to have an advantage"
Anyways, yeah. Relational, awesome. Graphs, sometimes convenient.
True, a graph is a relationship between vertices, a key/value store is relationship between keys and values, etc. Relations (sets of ordered n-tuples) remain the theoretical way we reason about data.
I was expecting some sort of interesting discussion of different databases. What I got was an ad for Amazon's products. And I found it quite off-putting that the post seemed to go way out of it's way to avoid even acknowleding databases not being sold as Amazon services.
High availability would be the most common case when you need to look somewhere else. And it's a serious distributed systems problem that requires pretty big architectural and cultural changes, you can't just optimize for it later.
Yep, but you can "buy" HA from any DB provider (Amazon RDS, compose, etc.) until you're big enough to dedicate resources to solve that particular problem.
My comment doesn't necessarily contradict the article. I only mean to point out that a general purpose "non-optimized" thing still fit a whole lot of companies out there, depending on their scale.
EDIT: Although I concede that yeah some DB tech makes it easier to configure HA systems, if you can live with the downsides.
Is there a matrix out there with the various needs that one might have of a DB/storage solution (consistency, throughput, sharding, replication, etc) with products associated with each of those sets of requirements? I feel like for a lot of what I do on a daily basis I do end up using the same old one-size-fits-all approaches, and as those scale we feel the pain -- but the product search and vetting process is tough and time consuming to even begin. Would be nice to be able to narrow options down to a smaller search space at least.
One thing that strikes me about this article is the examples of companies that move from one system to another (whatever that move entails) and achieve big performance wins.
We see long articles about this very often here. Sometimes it's moving from relational to non, sometimes it's moving back. But what really throws me for a loop is that so many of these sites are so mind-bogglingly, face-punchingly slow. All of Atlassian's web properties are stupidly slow. AirBNB (mentioned in this particular article) also--painfully slow. Github--slow. Reddit--somewhat slow almost always. Twitter--slow. Facebook is usually the exception.
I realize that sounds like maybe I'm just on a slow connection, but I'm not. I don't know where the time is being eaten up if it's not in the database. But I feel bad for all the people working on db performance when the end result is so bad.
I don't understand how websites whose only purpose in life is to serve text and images as fast as possible can be so slow and that the companies that make them can find this acceptable. No database technology in the world can make up for bad product decisions and companies that don't consider speed a feature.
Every use case listed on this article followed the following nominal pattern:
(1) Start with a relational database.
(2) Build product until you have a deep understanding of your market and their needs.
(3) Move to a specialized data store.
For established products, or well-researched use-cases, sure, pick a specialized data store. It will probably serve your needs better.
If you don't yet have product-market fit, use a relational database. It will give you far greater flexibility when you discover that "what we thought the market wanted" is different from "what actually made the right numbers go in the right direction".
Perfect example is that article about the Facebook messenger database migration earlier today. Messenger transitioned from an email-like system to an instant messenger.
> Ad hominem (Latin for "to the man" or "to the person"), short for argumentum ad hominem, is a fallacious argumentative strategy whereby genuine discussion of the topic at hand is avoided by instead attacking the character, motive, or other attribute of the person making the argument, or persons associated with the argument, rather than attacking the substance of the argument itself.
Not applicable as a fallacy here, questioning the style and the motive is reasonable with tech blog post mixed with advertising.
I might have added "for me" at the very end, so it would not sound general and was only describing my personal vibe.
If this was a very interesting article about the importance of hydration and was full of references to coca cola products, I would feel the same.