A Postgres Perspective on MongoDB

_skel · on July 18, 2016

I think the NoSQL trend is something similar to what is going on with programming language trends (in industry, not in academia). Many commonly used languages were strictly typed and static, such as C, C++, and Java. Then, dynamically typed languages became popular because they are easy to use, although they give up a lot of the safety and static analysis capability of the older languages. Let's call that the second phase, and it is exemplified by Ruby and Python. Now, newer languages with type inference are coming out that combine the safety of the older languages with the ease of use of the dynamic languages -- Scala, Go, etc. Let's call that the third phase.

In the database world, we are still in the second phase, which is exemplified by databases such as Mongo, which have started from scratch and thrown out many of the good ideas embodied in older databases. I suppose that the new JSON capabilities of Postgres are an attempt to move database design into the third phase, which combines the ease of use of NoSQL with the many superior qualities of traditional relational databases.

phamilton · on July 18, 2016

There's definitely a lot of throwing the baby out with the bath water. ACID is hard on a distributed system. Let me rephrase that. Performant ACID is hard on a distributed system.

ACID and schemas are not actually coupled. You can have strict schemas without ACID and vice versa.

Relational stores without ACID are very hard to reason about. Any associations that are created with the primary entry may or may not be there at lookup. This is solved via documents.

I agree with your prediction that e will have document schemas enforced in the DB.

forgotpwtomain · on July 18, 2016

> There's definitely a lot of throwing the baby out with the bath water. ACID is hard on a distributed system. Let me rephrase that. Performant ACID is hard on a distributed system.

> Relational stores without ACID are very hard to reason about. Any associations that are created with the primary entry may or may not be there at lookup. This is solved via documents.

This is only solved via. documents for applications which don't need relational consistency or transactions, which is a minor percentage of web-applications.

Most applications of Mongodb that I've seen do not fall into that category, further MongoDB is marketed as a generic, not as a very application specific kind of solution..

As soon as you find your application does need transactional consistency between entries, you'll be up the creek making your own paddle - since unlike with SQL databases where ACID is at least supported MongoDB just gives you:

> $isolated does not work with sharded clusters. [0]

[0] https://docs.mongodb.com/manual/core/write-operations-atomic...

phamilton · on July 18, 2016

Yeah, completely agree.

I should rephrase: "This is solved by modeling your data as documents whether or not that is its natural structure."

tremon · on July 20, 2016

ACID and schemas are not actually coupled.

Well, somewhat. The C in ACID (consistency) is impossible without schema's, because they contain the data validation rules. Without schema's you'll have to do all data validation in the application layer. Which is probably what you want, given that you're going schemaless, but it's AID at the database layer, not ACID.

mtrycz · on July 18, 2016

There's also additional considerations that weren't touched in the article, and they're not functional but architectural.

Sometimes you have payloads that can't fit in a traditional RDBMS, and SQL/Relational don't cluster/shard well.

NoSQL bring convenient clustering/sharding to your data. Anecdote: an old team chose Cassandra solely for its admin-tools (which are really cool), even if we didn't actually need a NoSQL; I lost that battle, but I least we all learned something.

BurningFrog · on July 18, 2016

My pet theory is that the rise of automated testing was/is a big factor in the rise of dynamic languages.

Test suites guard against most things type safety does, so if you have a test suite anyway, the type safety becomes a net burden.

matt4077 · on July 18, 2016

I don't remember much automated testing being done in those early days of PHP, and I don't see how your theory would explain the current trend back towards typed languages.

Diederich · on July 18, 2016

I think this is true.

Perl was one of the first hugely successful dynamic languages, though I will politely defer to Shell and TCL folks for a rigorous discussion of that claim. (:

Perl has been really serious about testing since early on, and, to my understanding, remains one of the most test-centric major languages out there.

abjorn · on July 18, 2016

On the other hand you could use static typing to avoid writing a lot of tests. Maybe it's a bit of preference thing, I don't mind either approach personally.

_skel · on July 18, 2016

I think automated testing is more about integration and regression than code correctness. If you have a web app that connects to a Java backend, for example, static typing won't prevent issues. You need to test the API calls and the UI, and make sure the right requests are made, and the expected outcomes occur. This was done manually by QA people in the past, but it's obviously more scalable to automate it.

threeseed · on July 18, 2016

The third phase already exists. It's SQL engines e.g. Hive, Presto, SparkSQL, SAP Vora etc. on top of NoSQL and something almost every big company is using today.

SparkSQL in particular can "run all 99 TPC-DS queries, which require many of the SQL:2003 features".

paulryanrogers · on July 19, 2016

Yet without ACID. So many look to me like a familiar skin on a different animal.

bojo · on July 18, 2016

> MongoDB is basically a native JavaScript filesystem

That is probably the most apt description of MongoDB I've ever read.

threeseed · on July 18, 2016

Filesystems don't intrisincally support searching the contents of the files to satisfy predicates. And that is the most important part of a database.

I think I understand what the author means by the statement. But technically it doesn't make much sense.

raverbashing · on July 18, 2016

Yeah, technically you could bring grep to your fs instead of having it as a userspace tool, but 90% of the work is on the fs

bonesmoses · on July 18, 2016

It may or may not be apt, but it just felt like I was dumping JSON objects wherever, with a loosely applied path of db.collection to keep things together.

graffitici · on July 18, 2016

I wish the OP had also drawn a comparison between Postgres and Cassandra (which he mentions early on).

Based on the schema he writes at the beginning, it looks as though he works mostly with time-series. My understanding is that Cassandra is particularly well-suited for these types of workloads (definitely more so than MongoDB?). If the write throughput is such that a single server can handle it, how would Postgres compare with Cassandra? What are the distinctions in read-latencies? Storage space?

Perhaps for the next article..?

sciurus · on July 18, 2016

If you enjoy this sort of exploration and comparison of databases and want more, I can't recommend the book Seven Databases in Seven Weeks enough.

https://pragprog.com/book/rwdata/seven-databases-in-seven-we...

zihotki · on July 18, 2016

So true, I often see people choose a database without considering pros and cons, they often considering only what they want as a developer to work with or to try and don't consider the nature of the data. Time series data fits the best in Cassandra and some other column-oriented DBs but definitely not in MongoDB.

threeseed · on July 18, 2016

Time series data is actually a great fit for MongoDB.

You create a collection for each measurement, JSON document for each day and then preallocate the fields for each hour/minute/second. With batching you can update 100K documents a second pretty easily off one node. For example: https://blog.serverdensity.com/using-mongodb-as-a-time-serie...

rlau26 · on July 18, 2016

I don't like MongoDB as much as the next guy, but can people explain why this comment is downvoted?

danbmil99 · on Aug 1, 2016

Because haters gonna hate

eddieroger · on July 18, 2016

I read that more as what he normally uses for tests and demos since it can be really easily generated using native SQL functions, and not necessarily his day job. Not that it matters much, I guess, and I agree it'd be neat to see that compared to Cassandra.

bonesmoses · on July 18, 2016

That's pretty much it. No time series just yet, just an existing Mongo set that I'm trying to wrap my head around, and that means picking it up ASAP.

I've messed around with Cassandra, but not in much depth. I ran into the way it treats NULL, and that all searchable fields must be indexed, and put it on the backburner for later.

patsplat · on July 18, 2016

From my experience building and deploying production systems with mongodb the primary benefit is not needing a migration or ORM framework.

- Deployments get much simpler when migrations for additive changes are unnecessary. Hundreds of deployments a week!

- JSON queries easy to pass around and decorate in application code. Hydration from a dictionary is straightforward. Combined, an ORM isn't really necessary.

mrweasel · on July 18, 2016

>primary benefit is not needing a migration or ORM framework

You don't need an ORM, but they are handy. I'm pretty sure it wouldn't be an issue to load rows from a RDBMS, using a language like Python, as a dictionary, and just decorate the snot out of that dictionary, to make it look like an object.

Anyway, regarding migrations, I sort of agree. Migrations can be terrifying and slow things, but at least you're allow to make certain assumptions about your data afterwards. Assumptions like "this field will always be there".

With MongoDB, and any other NoSQL database I've used, you need to handle your "migrations" in the application code. So now you have block of code that checks if a given attribute exists in your JSON document.

It's not necessarily a problem, but it is something you need to consider. Personally I much prefer RDBMS, because I actually like SQL and the ability to make ad hoc queries easily. If JavaScript is advertised as the query language for a database I'm not terribly interested, unless the pay off is extremely high.

pbreit · on July 18, 2016

Is NoSQL really all that much different than relational? At the end of the day isn't a collection pretty much like a table? And you have to tie two collections together with a common ID (ie, join)?

NoSQL is less rigid but your code pretty much has to be as rigid, right?

lastofus · on July 18, 2016

This is often what happens when Mongo is used, except now the task of joining and aggregating is pushed up to the application, instead of getting it for [mostly] free from the DB itself.

In the process of pushing it up to the app, you lose ACID guarantees. Without ACID guarantees, your ability to view/update a snapshot in time of all your data becomes a very difficult problem to solve. As such, it's hard to say the two methodologies are equivalent.

When you start putting relational data into a non-relational DB, you are going to have a bad time.

eyelidlessness · on July 18, 2016

In addition to losing ACID guarantees, you lose single-query joins. They're performed in memory, sequentially, with all the incurred costs of a waterfall series of requests and whatever network communication is involved.

eyelidlessness · on July 18, 2016

(Much like the default behavior of some ORMs I work with...)

toasterlovin · on July 18, 2016

Database views have been incredibly useful for working around limitations of the ORM I work with on a daily basis.

zihotki · on July 18, 2016

To add to that, MongoDB has a plenty of issues with consistency and documentation has some statements which are not proven and even disproven - https://aphyr.com/posts/322-jepsen-mongodb-stale-reads

tremon · on July 18, 2016

The entire premise of NoSQL is based on two (perceived) shortcomings of SQL databases: its rigid structure limits flexibility and its transactional safety limits speed and scale. So it really depends on what you mean. No, NoSQL isn't that different from regular SQL in that it still provides an object-based data storage and retrieval facility. Yes, NoSQL is different from regular SQL because it does away with the most fundamental properties of SQL.

isn't a collection pretty much like a table

No, it isn't. A table (or relation) has predefined columns, and each row adheres to the data type and constraints for each column. If you were to model a NoSQL "collection" as a table, you'd end up with a sparse matrix where the number of columns could be equal to number_of_rows * number_of_distinct_attributes of the original collection.

you have to tie two collections together with a common ID

Nothing in SQL requires you to join collections based on a common ID (though it is the most useful). But still, an SQL join has a predefined result structure regardless of the actual table contents or join condition. There is no predicting the data structure of a NoSQL join.

asimuvPR · on July 18, 2016

My experience says that the issue is not differences, but bad architectural decisions. Most applications out there are glorified CRUDs. They are perfectly serviced by SQL databases. NoSQL fits a smaller set of systems whose scope is mostly passing data around. PostgreSQL now supports JSON as a datatype and allows querying it. Works beautifully. No need for MongoDb at all.

eyelidlessness · on July 18, 2016

My understanding, which is admittedly minimal, is that MongoDB provides some conveniences around partitioning and distributing your dataset that Postgres does not without additional work. Simply supporting JSON accommodates MongoDB's approach to schemas (not having them; er, pushing constraints to application code), but doesn't address the distribution concern.

How well does MongoDB support this need, and how hard is it to achieve equivalent results with Postgres? What are the drawbacks?

elmigranto · on July 18, 2016

Side note: you can have constraints, `check`s, indexes and triggers on `json` and `jsonb` columns in Postgres. In fact, you can have everything other data types have. So "no schema" or "moving schema to app level" doesn't have to be the case, when using PG.

bonesmoses · on July 18, 2016

Exactly. With Postgres, you can add CHECK constraints to enforce specific keys, with any level of nesting desired. So an INSERT could reject a document for having missing fields, fields of improper type, etc. It all depends on how strict you want to be.

_skel · on July 18, 2016

Mongo's partitioning/distribution/replication does not work as well as it should. In particular, my team has seen several serious bugs pertaining to replication in release versions that were so bad, they should have blocked the release.

chucky_z · on July 18, 2016

I will argue _against_ MongoDB all day, but there is most certainly a time and place for it. If you have a real firehose of data, don't care too much about the accurateness, and really need good speed... MongoDB is not an awful choice.

omginternets · on July 18, 2016

>MongoDB is not an awful choice

Frankly I think it's still an awful choice given it's broken replication model.

bonesmoses · on July 18, 2016

Could you explain a bit more about this? From what I can see, it's just journal consumption. That's basically how Postgres does it, too.

omginternets · on July 18, 2016

Aphyr explains it better than I every could: https://aphyr.com/posts/284-jepsen-mongodb :)

bonesmoses · on July 18, 2016

Very informative. Thanks!

eyelidlessness · on July 18, 2016

How well does it stand up against other "NoSQL" solutions that compete in the same space? My impression has been that it tends compete poorly with a number of other solutions for data integrity, and my intuition is that its appeal is correlated to the proliferation of JavaScript and the convenience of interoperating with it.

chucky_z · on July 19, 2016

From my experience, the problem is not how does any NoSQL solution stand up vs. any other NoSQL solution. The question to ask is "What caveats affect me the least?" and then go with that NoSQL solution. In the case of MongoDB, the biggest issue (at least in the past) is that the caveats simply aren't known well, or at all!

Compare this to something like Elasticsearch, which is well-known (and well-documented) to have data loss in certain situations.

Kiro · on July 18, 2016

Is here something like mLab for Postgres?

xorcist · on July 18, 2016

Is a relational database really all that much different from a file system? At the end of the day isn't a directory pretty much like a table? And you tie data together with their common id, or file name?

I think there is a difference, especially in practice. Relations are a first class citizen in relational databases. Integrity is enforced. Most types of operations can be bundled up in transactions.

That abstraction differs from non relational data stores and you have to design your application to fit it.

Corrado · on July 18, 2016

Well, the predecessor to DB2, IMS, is basically a database in a file system. Sure, IMS is hierarchical and not relational but it is a full on database with fields and indexs. In fact, Mongo (and other NoSQL solutions) seem to have a lot in common with IMS including the lack of a proper schema and referential integrity.

Illniyar · on July 18, 2016

Well, there are a few other massive differences - RDBMS have basically its own version control built in (which is what allows ACID) and by containing the logic for joins it allows performance based on statistics (which can't be done in an upper layer)

eyelidlessness · on July 18, 2016

Wouldn't it be great if those facilities for version control provided something akin to event logging as a first-class API out of the box?

Illniyar · on July 18, 2016

Well oracle has an eventlog (which you use to synchornize databases).

I'm guessing that other databases have similar implementations.

Though no API to hook into it, as far as I know.

pritambaral · on July 18, 2016

Postgres has logical replication. I'd guess MySQL and MariaDB also have something of this sort, because Galera uses that.

ibotty · on July 18, 2016

Well, NoSQL usually does not have much to do with relational algebra. That means, it usually is _very_ different.

petetnt · on July 18, 2016

> While it’s in our best interests to keep similar documents restricted to certain collections, there’s no kind of constraint that enforces this.

However, if you need such a thing, there's always things such as Mongoose[1] that can do this trivially for you.

[1] https://github.com/Automattic/mongoose

elmigranto · on July 18, 2016

But you still can insert non-consitent data into the database with CLI, or otherwise. And `default` values defined inside Mongoose is no good when filtering or searching documents.

There are plenty more problems with emulating schema in app code (Mongoose or anything else), so it's not that trivial.

Kiro · on July 18, 2016

I'm building a Node.js app where I was thinking of using MongoDB but I'm reconsidering it after reading the comments here.

The database's only job is to keep persistence between server restarts. On server start I fetch the a collection from the database and populate a variable with it, which I use for any direct access.

When a user updates something it will update the variable as well as updating the collection in the database.

Something like this:

    var foo = readFromDB();

    socket.on('connection', function() {
        this.emit(foo);
    });

    socket.on('update', function(data) {
        foo[data.id] = data.val;

        updateDB(data.id, data.val);
    });

(In my real application each collection has sub-arrays etc and not just key-value like in the example.)

What should I use instead of MongoDB for this? Or is my approach fundamentally flawed?

Fannon · on July 18, 2016

Basically any database should do for this usecase. A simple key-value store probably best. I'd aim for keeping the technological complexity as low as possible.

What is the simplest tool to do the job? Or maybe better yet: Use something that's already (or will be), in your technology stack to avoid having one more thing in it.

Kiro · on July 18, 2016

Thanks!

> What is the simplest tool to do the job?

For me it's definitely MongoDB but I'm afraid of data loss etc. My example was simplified. In reality I want to update a sub-array within the collection, like this: collection['some_identifier']['items'][3] = 123. Overwriting the whole 'some_identifier' document seems wasteful. With MongoDB it's easy though to update nested sub-data.

elmigranto · on July 18, 2016

If you ever only need to read or update single document in a single collection, Mongo's fine.

When you need anything more (which is probably "soon" if not "already"), it'll bite one way or the other.

threeseed · on July 18, 2016

This absolutely isn't true.

If you batch operations it is extremely fast at updates across multiple documents.

elmigranto · on July 18, 2016

Poor wording on my part. My point isn't about performance, it's about API. Other than selecting documents from single collection (.find), Mongo's queries are shockingly wierd and inconsistent to me.

Compare aggregation framework to what SQL has, for example:

https://docs.mongodb.com/manual/reference/sql-aggregation-co...

Or even within Mongo, querying and aggregating are astonishingly different things, that you need to learn separately.

And then there are JOINS. And stale reads with other serialization issues.

weddpros · on July 18, 2016

If you ever want to scale your app to more than one process (CPU core), or more than one server, you'll have a problem...

Using Redis could be a solution (if Redis can accommodate your read pattern), or you'll need a pub/sub system (like that of redis) to propagate db changes from one process to the other...

Kiro · on July 18, 2016

Thanks! I'm building a real-time app using WebSockets so a pub/sub system sounds like a good solution.

elmigranto · on July 18, 2016

Postgres has it with LISTEN/NOTIFY.

It also can read from and write to Redis using Foreign Data Wrappers. Pretty neat, if you want your DB to be single place you store data, but something faster for reads or cache.

lmm · on July 18, 2016

If you want to do updates you want a traditional transactional database. If you really need massive scalability (which seems unlikely) you should figure out a way to store your data in a log-structured/append-only way, and then use Cassandra or Riak.

_skel · on July 18, 2016

You could even use Redis for this. You really need a cache or key-value store, not a database.

Kiro · on July 18, 2016

Thanks! In my real application I want to be able to update a sub-array of the document, like this: collection['some_identifier']['items'][3] = 123. Just setting the value for the key (e.g. overwriting the whole document) seems wasteful since I only want to update a portion of it. With MongoDB it's easy to set sub-data so I guess I'm looking for something where it's as easy but without the downsides.

amdavidson · on July 18, 2016

In redis couldn't your key be something like "someIdentifier_items_3" and your value be 123?

Then if you wanted to make it 456 you would only be updating "someIdentifier_items_3".

bflesch · on July 18, 2016

We've been running node+postgres in production for a couple of years and it's been great so far. If you have advanced questions, #postgresql on freenode is always there for help, and the node-pg libraries are very mature.

Kiro · on July 18, 2016

Cool! Thanks a lot. I will give it a try.

orf · on July 18, 2016

What's wrong with using a plain file?

elmigranto · on July 18, 2016

This should read "plain SQLite file". At least it'll handle serialization and FS failures for you.

Kiro · on July 18, 2016

It would probably work but in my real application it's a bit more complicated since I only want to update a sub-array of the document, like this: collection['some_identifier']['items'][3] = 123.

I guess using a plain file would mean I have to overwrite everything on each write which seems wasteful.

orf · on July 18, 2016

Sure, it depends on your use case though. For infrequent updates I don't think it's wasteful to just overwrite the file, it seems less wasteful to me than running a whole database. But it all depends on your setup :)

Also I think updating a sub array in a collection triggers a whole write to that collection in mongodb. In any case I would avoid mongodb if you can, it seems redis might be a better fit for you.

raverbashing · on July 18, 2016

Good to see an article with a fair comparison and not just "anything is worse than PostgreSQL" mentality

elmigranto · on July 18, 2016

Article is more an introduction of Mongo to Postgres users, rather than comparison.

And yeah, anything's better than Mongo :)

Lio · on July 18, 2016

I note the smillie but could you actually expand on that comment for someone who is only really familiar with relational databases?

I've recently become interested in using PouchDB (a CounchDB implementation) for syncing offline content between a user's browser and server where the content is based on a set of fixed documents. I believe that MongoDB can provide similar replication tricks. Once on the server I would expect to extract the data from the documents into a relational warehouse database (probably PostgreSQL as it's one of the databases I know best).

A document store _seems_ like the right tool for the job but could expand on why something like Mongo or Couch is a bad idea for situations where you're just storing/syncing documents?

elmigranto · on July 18, 2016

> why something like Mongo or Couch is a bad idea

Couch is fine, other NoSQL is fine. It's just Mongo I don't like:

- weak, if any, guarantees on data consistency and safety;

- poor API. If you need anything more complicated than `findOne`, there are at least 3 unrelated syntaxes with major differences that are not compatible;

- performance is not what marketing team says it is. Even at home JSON turf, Postgres is pretty comparable.

If you have data and queries that will benefit from document store, go for it. But don't choose only based on marketing.

There are many more rant on the web, I like this one:

http://cryto.net/~joepie91/blog/2015/07/19/why-you-should-ne...

raverbashing · on July 18, 2016

Look up on CouchDB problems, several of its customers have given up on it (more than MongoDB it seems)

It's not a bad idea, MongoDB has a lot of undeserved hate (some of it deserved, but the minority), however I'm not sure it would do syncing as you want automatically (CouchDB is probably closer out-of-the-box to what you want)

raverbashing · on July 18, 2016

> And yeah, anything's better than Mongo :)

Yeah I guess keeping to their comfort zone while keeping a sense of superiority is easier :)

elmigranto · on July 18, 2016

I started with Mongo, used it for little bit over 2 years. Then new project used Postgres (2 more years). After that I got a chance to work with Couch.

Whenever I encounter Mongo nowadays, it's either simulating multi table joins on app level, or inventing locking with Redis.

Part of the issue is people storing relations in NoSQL just because, but I can't think of any use cases where Mongo wins over other NoSQL solutions. Perhaps you have some?

lloyd-christmas · on July 18, 2016

Image storage * When working on any type of image-based data science, it's incredibly easy to manage, it's fast, and meta data is incredibly convenient. The drivers are pretty consistent and readable across languages. Assuming that you know javascript, it's a pretty quick and easy set up to add a light weight node server with an image cropper on the front end, then set up a python app to do your data processing once you're done munging. I set that up any time there is an image-related competition on kaggle or setting up HAL-like robots on my raspberry pi to mess with my roomates: "I'm sorry Dave, I can't let you cook that".

* all you "OMGZORS USE A FILE SYSTEM!" can keep your mouth shut. This is drastically faster and much more manageable for the use-case.

raverbashing · on July 18, 2016

> but I can't think of any use cases where Mongo wins over other NoSQL solutions

For small deployments, and when you don't have to worry about (many) relations between tables it should be ok to use Mongo

Maybe Cassandra is better in some cases, or PostgreSQL with JSON

And of course Redis in a lot of different situations

I don't know what you mean by "inventing locking with Redis" but it doesn't seem like anything I would advise doing either with Mongo or Redis

eljulio · on July 18, 2016

One missing point Is the scalability. Postgres is awesome until you need to setup a db cluster.

segmondy · on July 18, 2016

Very few organizations in the world truly need to scale out their database. Their problem is usually that of a bad design. For those that need to scale out, NoSQL is not a magic solution either. The scaling out solution pretty much these days for SQL database is replication, sharding and partitioning tables. Postgres does these very well.

forgotpwtomain · on July 18, 2016

> The scaling out solution pretty much these days for SQL database is replication, sharding and partitioning tables. Postgres does these very well.

I tend to disagree that Postgres does partitioning well; partioning table resolution is rather dumb, there is no hinting (e.g. there is no way to tell postgres that there is a unique mapping for each query to a table) and no caching of CHECK results.

gnaritas · on July 18, 2016

> Very few organizations in the world truly need to scale out their database.

Everyone organization that has a database wants to scale out to at least a 2 node cluster just for not having a single point of failure. Scaling is not always about performance, at first, it's more about availability.

threeseed · on July 18, 2016

You might want to learn a bit about "big data".

Almost every major organisation will have a big data analytics program doing feature extraction, modelling, machine learning etc. Usually this starts with Hadoop/Spark with data in HDFS but then a point comes when you want this in a database.

PostgreSQL is awful at this which is why almost no uses it in this space. Apart from the lack of native drivers it is simply too difficult to do basic clustering. And the fact that it isn't built into the product doesn't give you a lot of confidence that it is (a) going to work and (b) is going to be supported.

And yes for those that scale out NoSQL is a magic solution. That's why they are so popular. I can deploy a 40 node Cassandra cluster in 30 mins and guarantee it works. Likewise MongoDB replica sets are ridiculously easy.

collyw · on July 18, 2016

Maybe you could just design your data a bit better with smaller data types, and keep it all one one machine.

bonesmoses · on July 18, 2016

Well, we ran a billion queries per day on a 1.4TB system I was in charge of a couple years ago. Now we're building a 500+TB reporting system and probably putting it on Postgres-XL. Seems to scale fine to me. ;)

gnaritas · on July 18, 2016

You're talking about performance, that's a different thing than scaling.

bonesmoses · on July 18, 2016

The first was performance, while the second is a 500TB database, and that very much requires scaling. I'm just saying it can do both well.

gnaritas · on July 18, 2016

Ah, Postgres-XL is multi-machine, ok sure, that's scaling.

iswim · on July 18, 2016

are you saying they're not related?

gnaritas · on July 18, 2016

Only loosely in so far as bad performance can make scaling more important; however, scaling is not about performance, a super fast db could scale poorly while one that scales well could perform worse than the fast one but scale better. Scaling is about growing across machine boundaries well, so that adding more machines increases the ability to handle more load. That doesn't require good performance, it's requires good architecture.

collyw · on July 18, 2016

Is it actually any better in Mongo? I know their marketing says it is, but from what I have read the reality isn't so.

scarlac · on July 18, 2016

Agreed. Scaling out databases is never something you want to do, particularly for Postgres.

But it's something you're less likely to need if you structure your data, in an RDBMS. If you know your SQL, then optimizations should be preferred before scaling out. Not just because you need to, but because it can improve user experience (server response latency).

code_research · on July 18, 2016

would you like to describe the problems you had with the existing solutions?

IshKebab · on July 18, 2016

Why does it take 110ms for Postgres to count the rows in a table? That seems like an insanely long time.

cpkpad · on July 18, 2016

I would guess two issues:

1. In many cases, there is a ~30ms floor on operations if a server is non-local. That's true of virtually all services on AWS -- whether SQS, SNS, SES, RDS (MySQL/PostgreSQL), etc.

2. In many cases, rows are missing. A table has a million rows. Ten are deleted. It's not compacted. So it's not a simple size lookup. A lot of this depends on how the data is stored/indexed. That's something which is naively pretty slow, but easily optimized.

JimmyAustin · on July 18, 2016

Postgres can require a full table scan in order to count the number of rows.

See: https://wiki.postgresql.org/wiki/Slow_Counting

bonesmoses · on July 18, 2016

In Postgres, because it uses MVCC, every count requires verifying the table row in the data segment. This means a sequential scan over the contents of the table to verify the rows exist and are visible to the current session.

Few operations have this particular limitation, and it's been a sticking point in the Postgres world for years. Unfortunately there's no easy way around it.

franzwong · on July 18, 2016

I tried to scale Postgres few months ago but I found that it is quite tedious.

forgotpwtomain · on July 18, 2016

> I tried to scale Postgres few months ago but I found that it is quite tedious.

If your application is big enough that you really need to scale postgres, I assure you, you will have a whole lot of tedious problems - not just sharding.