
Jsonb: Stories about performance - erthalion
http://erthalion.info/2017/12/21/advanced-json-benchmarks/
======
Thaxll
That post fails to mention something critical, Postgres doesn't scale and is
not HA out of the box so you're comparing a single server solution vs another
one that as built-in cluster feature ... What do you do when you need HA? or
your data is too big or you need a lot write/sec ect ...

The hate on MongoDB is strong on Hacker News / Reddit but I suspect most of
you never had to design a solution when master/slave is not enough.

Also: [https://www.mongodb.com/mongodb-3.4-passes-jepsen-
test](https://www.mongodb.com/mongodb-3.4-passes-jepsen-test)

~~~
itsjustme2
I'm curious whether the parent poster knows about the most recent HA features
in Postgres, such as those introduced in v10, and if so, what is still missing
for comparable HA. Here's a link to the HA docs for out-of-the-box postgres:
[https://www.postgresql.org/docs/10/static/high-
availability....](https://www.postgresql.org/docs/10/static/high-
availability.html)

As for scaling, it certainly scales out of the box with partitioning, but
admittedly a partitioned table doesn't offer all the features that a non-
partitioned sql table offers. A great assessment of the current state of
postgres scaling can be found here: [https://blog.timescale.com/scaling-
partitioning-data-postgre...](https://blog.timescale.com/scaling-partitioning-
data-postgresql-10-explained-cd48a712a9a1)

------
zmmmmm
This is a great post but sad that they addressed everything thoroughly except
the most important question for me - query performance. I don't do all this
mass updating of document oriented entities. They are stored and they stay
stored. But we query the hell out of them so I would love to know how well
that actually performs with various indexes on various data types within the
document.

~~~
nojvek
Should take a look at flatbuffers. It’s an evolution of protobuf that’s super
optimized for querying sunsets of data. Basically lookup of offsets and mmap
all the way.

------
pier25
Surprising that PG beat Mongo at its own game. We've been recently using Jsonb
with PG as a nice-to-have but always assumed Mongo would be much faster with
schemaless data.

~~~
gaius
Well, PG is written by people who know what they're doing.

~~~
Klathmon
Come on can we keep this kind of stuff off HN?

If you have concrete criticisms say them so we can have a discussion, don't
write snarky lines like this is some kind of "us vs them" game.

~~~
gaius
In an earlier comment yesterday I remarked that it was good to avoid platform
wars, but the MongoDB crew squandered the right to converse as peers with all
the lies they told. Remember when they compared their speed writing to memory
with Postgres speed writing to disk, as if this made them faster? Either they
have fundamentally misunderstood what a database is, or they have deliberately
acted in bad faith, which is it?

------
ris
> how to apply document-oriented approach in the world of relational databases

I've increasingly found that the number of situations where this is actually a
good idea are quite, quite rare.

~~~
HumanDrivenDev
I'll play devils advocate.

In any real world system I've worked there are always entities that are purely
self contained, but are represented by many different database tables. When
they're saved, the entity is decomposed into its constituent parts and written
to those tables. When they're retrieved, those tables are joined together to
re-create the entity we care about. And that's all that ever happens, it's
just serialised and de-serialised.

Isn't it premature optimisation to normalise that entity right out of the
gate? Why not stuff it in a JSONB column. When - and only when - you find
yourself needing to dig inside of it for queries in other parts of the system,
then you normalize it.

~~~
clhodapp
One word: constraints. The problem with actually storing things in a document-
oriented manner is that it requires extreme vigilance and no mistakes to
actually keep the shape of your documents consistent. Further, it can be
extremely costly to discover all the shapes of data you actually have if you
initially didn't treat it as important. More often than not, it is useful to
use tools that guide you into being disciplined along the way.

~~~
scarface74
I use C# with Mongo. When I get a Collection<T>, the only thing I can put in
the collection is a strongly type collection and my Linq queries are strongly
typed. The compiler enforces consistency.

~~~
ris
Right, but can you _ensure_ that the only thing that ever connects to your
datastore (even to perform auxiliary/admin actions or quick data tweaks) is
using that code in strongly typed mode? Otherwise you can't be _sure_ what's
in your data.

~~~
scarface74
I can't be be sure that a developer doesn't go behind the scenes and change
Sql server either.

But programmatically, i do ensure only one microservice (out of process) or
module (in process) is writing to a collection. When I'm using an RDMS and
developing a system, I also ensure that only one module is writing to a subset
of tables that make up an aggregate root. I would never have a system where a
bunch of programs are writing to the same tables willy nilly without going
through a common interface.

~~~
ris
> I can't be be sure that a developer doesn't go behind the scenes and change
> Sql server either.

You basically can. The difference is that it's quite obviously notable to a
developer of any level that changing the schema in the database is something
that should be done with some forethought and care. And then when that change
is made, all applications accessing the database get the same new view of the
schema so you're not going to have two different clients with a different idea
about the schema trying to operate on the database at once because schema is
global.

Making a change that suuuuubtly alters the way data is stored in an
unstructured object is the kind of thing that's really easy to overlook in a
pull request.

Your "only one module" rule is laudable, but you've also got to make sure that
module can do _everything_ you'd _ever_ want to do to the database, including
all the "one off" data mangling admin tasks anyone occasionally has to do.
Otherwise it will get bypassed. Case in point I used to use an ORM heavily,
but it wasn't possible to express everything in that ORM, so where that broke
down we had to resort to manual sql. Add a few developers to the equation and
you don't _know_ _exactly_ the format of your data.

Good schema management practise also means you end up defining all of your
schema mutating operations as migrations, leaving you with a rather vital log
of how things have changed over time and a good pinch-point to catch
inadvisable changes to schema.

~~~
scarface74
_You basically can. The difference is that it 's quite obviously notable to a
developer of any level that changing the schema in the database is something
that should be done with some forethought and care._

And if they change the schema in Mongo, with the official C# driver, by
default, any program that tries to read the collection<T> and the document in
the databases doesn't match the C# class, you'll get a loud and obvious error
at runtime.

Of course if you _want_ to have the flexibility of having objects change in
the data store not affect downstream changes, you can decorate your class with
[BsonIgnoreExtraElements] attribute.

 _You 'll get all types of And then when that change is made, all applications
accessing the database get the same new view of the schema so you're not going
to have two different clients with a different idea about the schema trying to
operate on the database at once because schema is global._

My contention is that you should not have two "clients" operate on the same
schema. All clients should be using one module. I'm not advocating that
microservices are the one true way. I'm saying that all clients should be
using the same module, microservice, or stored procedures (well I hate stored
procs for that type of stuff, but that's another rant)

 _Making a change that suuuuubtly alters the way data is stored in an
unstructured object is the kind of thing that 's really easy to overlook in a
pull request._

There is nothing "unstructured" about a data class (POCO/POJO) in a strongly
typed language. Yes you can operate on a Mongo Collection in .Net by getting a
Collection<BsonDocument> but most of the time you're going to get something
like a Collection<Customer> and the compiler is not going to allow you to
insert, search, etc. anything but a C# defined Customer with C# defined data
types.

 _Your "only one module" rule is laudable, but you've also got to make sure
that module can do everything you'd ever want to do to the database,_

That's the beauty of .Net, Linq, and expression trees. You can define your
repository with a method of something like:

Find(Expression<Func<Employees,bool>> query) {...}

and the client can pass in any arbitrarily complex, strongly typed, Linq
expression and your repository can pass that expression to the Mongo driver
and it will translate that to Mongo query. It gives you even more flexibility
because if later on you decide to use an RDMS, those same client queries can
be translated to Sql by the Entity Framework driver.

But even if you don't want to go as far as "one true module" for your data,
there is no reason not to have one module that defines your data types. Anyone
querying against or inserting into a collection<T> will always have the right
document definition of course you can put all of the definitions into a Nuget
package and it will still be obvious when changes were made because your nuget
package will be versions.

 _including all the "one off" data mangling admin tasks anyone occasionally
has to do._

A one off type of admin task wouldn't be part of your production flow and you
would probably bypass any abstraction that you had - including stored
procedures, views, etc.

* Otherwise it will get bypassed. Case in point I used to use an ORM heavily, but it wasn't possible to express everything in that ORM, so where that broke down we had to resort to manual sql. Add a few developers to the equation and you don't know exactly the format of your data.*

There should always be one team responsible for your business objects (in DDD
your aggregate roots). Why would you let any one developer change your
underlying data store without a code review and/or pull request?

 _Good schema management practise also means you end up defining all of your
schema mutating operations as migrations, leaving you with a rather vital log
of how things have changed over time and a good pinch-point to catch
inadvisable changes to schema._

If I'm using a strongly typed language like C# and I'm defining my documents
based on a class. My Employee.cs file (and all of the child classes) are in
sync with my Mongo document store. The language ensures that at compile time.
"knowing my schema changes" is simply a matter of doing a "git blame" on the
Employee.ca file.

Even if I'm not abstracting all of the data retrieval in a repository, I don't
see any reason not to at least have a Nugst package of all of my POCOs that
define the documents in my collection.

------
dominotw
>MongoDB has some problems with spinlocks,...Interesting enough is that this
situation is getting better between releases (but it’s still quite
significant), for example here is comparison of throughput for 3.2 and 3.4:

Is there an image missing or something?

~~~
erthalion
Hm...no, it's right there, below this phrase
[https://imgur.com/a/vN0Q3](https://imgur.com/a/vN0Q3)

------
hermanradtke
A bummer that VM’s were used. I have used packet.net to provision bare metal
servers for the purposes of benchmarking. The server specs are close to the
same servers techempower uses as well.

------
larrykwg
There is absolutely no good reason for anyone to use MongoDB ever, its just
such an amazingly rotten database that its astounding how much hype it
generated.

~~~
013a
*Today. The thing you're forgetting is that Mongo was actually somewhat novel at the time it was released. I've heard at least two stories of engineering teams who felt that their specific product couldn't have scaled the way it did and they wouldn't here today without Mongo, for whatever reason. I've heard the words "it was the best engineering decision we ever made" from one team who's company ended up selling for 8 figures.

You're being hyperbolic. Yes, there are so many better choices for databases
out there. There are also many worse choices. Mongo is very hard to recommend
nowadays, but unless you go in completely ignoring its problems its not like
your company is going to go up in flames.

~~~
cookiecaper
> I've heard at least two stories of engineering teams who felt that their
> specific product couldn't have scaled the way it did and they wouldn't here
> today without Mongo, for whatever reason. I've heard the words "it was the
> best engineering decision we ever made" from one team who's company ended up
> selling for 8 figures.

Do you expect to hear them say, "Yes, choosing MongoDB was a bad decision and
our production infrastructure is currently on an unwieldy, slow, dangerous
piece of crap" and then go on to sell the company for tens of millions of
dollars?

Do you expect people to admit the real reason they're using MongoDB, which, in
99% of cases, is "we don't know how to use SQL and are really perturbed that
anyone thinks we should have to learn"?

It is true that MongoDB only spontaneously combusts occasionally, but that
doesn't mean the choice is of negligible importance.

~~~
013a
These were private conversations, professional to professional, not
conversations in the context of their sale. Maybe there is confirmation bias,
but they could have just as easily said "Mongo was a piece of crap, we
replaced it, it took a lot of engineering hours."

All of the complaints I've seen against Mongo are from people who read Aphyr's
blog from their SQL High Horse and say "ha, look at those dirty Mongo
peasants, can't you see that one in every ten million reads are inconsistent
under high load? Why would anyone use this crap?" Use it in production, even
under significant load, and you realize that sure you might hit a snag every
once in a while, but its tolerable. It works fine. Its not trash. Its
adequate. Understand the problem you're trying to solve. Understand Mongo's
limitations.

~~~
gaius
_All of the complaints I 've seen against Mongo are from people who read
Aphyr's blog from their SQL High Horse_

Nope, my dislike of MongoDB the technology and distrust of 10gen/Mongo the
company is 100% based on real-world experience.

By the way, one-in-ten-million means several times a day. And it was a LOT
more frequent than that...

------
faragon
TL;DR: Use PostgreSQL.

~~~
jack9
Tweak your tests till PostgreSQL looks like the best choice. FTFY

~~~
solidsnack9000
After reading this comment I went ahead and read the article. I am having
trouble finding a case where it seems like the tests are biased in Postgres’s
favor but I don’t know very much about MongoDB so could be missing something.

Large numbers of clients — really, any more than there are cores — are notably
not a Postgres strength.

~~~
jack9
It doesn't seem like you read it. It's spelled out as directly as it can be.

> I assume this one is most important questions for you now. Why PostgreSQL is
> underperforms so significantly? The answer is simple, we’re doing an unfair
> benchmark.

If PG doesn't perform well, it's unfair.

~~~
ibotty
> Why PostgreSQL is underperforms so significantly? The answer is simple,
> we’re doing an unfair benchmark. As you may notice, we were indexing only an
> ID in a document for MySQL and MongoDB, and an entire document for
> PostgreSQL.

That's hardly biased after and very much before.

~~~
jack9
Three systems treated the same, have different profiles. PG profile was low
enough, the benching had to be altered, to get the results that they were
expecting (or the reader to be expecting). If it was a technical
implementation issue, why wasn't it done right the first time? Probably
hubris, in an obvious case of tweak the testing methodology until you get the
result you wanted.

------
ris
A note to plot-drawers everywhere. Please don't hide your actual data-points
by drawing a bezier curve over them.

~~~
zmmmmm
And PLEASE just label the damn axes. Why do people go to all the effort to
draw nice plots, write a giant blog post, but leave of the 1 minute it would
take to make the plots interpretable without reading giant slabs of text to
reverse engineer out what is actually in the plot.

/rant

