PostgreSQL displays an amazing and rare combination in _any_ software, much less a free (beer/libre) product: reliable, predictable performance and usability for standard workloads combined with an extensible, flexible platform for experimentation and development.
Rock-solid "classic" OLTP database? Check. Rich geospatial data platform? You betcha. Structured document store? That too. Pluggable storage engines and column stores? Why not!
Don't get me wrong: there are definitely workloads for which Postgres is a poor choice. Its replication features are far behind MySQL, much less its pricey commercial competitors and the more reliable NoSQL options (Cassandra, HBase, etc.). There's also been little work that I know of* to optimize the storage engine for SSDs. (*- I'd love to be learn this is just my ignorance and someone is really focused on large-scale SSD-backed Postgres deploys and tuning. Links/references appreciated.)
If you can work within those constraints though it's an awesome product.
Where I work we use Postgres a lot, and the performance (for our workload) can only be described as "brittle". We often see problems whereby making slight syntactic changes to queries will cause confusion in the query planner and performance will go through the floor.
Getting around this often involves trying out a few semantically equlivalent ways of phrasing a query and trying to find one that doesn't suck, then just remembering that for next time.
Another problem we've had is that after a while the query planner can go nuts suddenly, making bad decisions based on accumulated stats. It's fun watching what is usually a two-second query last multiple hours because the query planner has lost its mind.
Anywhere else I'd claim that this is down to incompetence, but we have two senior engineers with multiple decades of DB experience, and even they end up tearing their hair out over postgres performance.
Overall, I like postgres, a lot, and I'd choose it over MongoDB and other noSql solutions for most workloads, but it's performance can be very hit-or-miss compared to other (proprietary) relational databases.
TLDR: postgres performance is usually pretty good, but in some cases can be brittle, resulting in a lot of effort spent needling the query planner into doing the right thing. The effort spent is much more than would be spent with some other (unfortunately proprietary) databases.
BTW I wonder what you mean by "accumulated stats" ...
Regarding your senior engineers - maybe they're competent, maybe not.
But most importantly, the experience from one database is often difficult to transfer to a different database. For example I often work with people coming from the Oracle world, and they often apply things that work on Oracle but fail on PostgreSQL. And don't use the advanced cool stuff in PostgreSQL.
The only response I have to that is "talk to developers on the mailing list" (either pgsql-performance or pgsql-hackers).
Maybe it's possible to improve the planning - maybe your queries are uncommon / difficult to estimate / running into a thinko in the planner. I don't know.
Yeah, IRC is a good place to ask too. The obvious problem is that not all developers hang there all the time, so for longer discussions the mailing lists are better (and you can just attach test cases, for example).
Where I work we have the same problem with MSSQL. Unfortunately the dataset I work with is not very consistent, leading to the main key not being very useful to the query planner. I'm sure Postgres would have similar problems on the dataset. Unfortunately I'll never get to try.
On the other hand, for my personal projects it's amazing. They'll never see the scale I see at work, but they do get millions of rows sometimes :)
The one big feature (imo) missing from postgres is query plan locking. I too -- although in our case, somehow nightly vacuuming got turned off -- have seen single digit millisecond queries suddenly get mangled by the planner and turn into multi-second queries.
I'm not saying "plan locking" is not a useful feature in some cases, and maybe it would really solve this particular issue. But my experience is that when people say "I need it" in most cases it's "I've been working with X for a long time, and that's how you solve problems looking like this there." (where X is usually Oracle or DB2).
If you really need nightly vacuuming (and not just analyze), and you accidentally turn it off, you have bigger problems than planning.
This should be handled by autovacuum in modern versions, so either you're running a very old version (before 8.3 when autovacuum was enabled by default).
So either you're running a very old version (i.e. older than ~5 years - sorry, not much can be done here except for upgrading).
Or you don't have autovacuum tuned properly, or you really have strange workload. In those cases you should really talk to people on pgsql-performance.
This experience was 8 years ago; autovacuum is nice.
Having been bitten two or three times by a fast query suddenly getting very slow because a heuristic in the query planner flipped over (including once in the last couple of years), I'd simply like to fix a performant enough query plan and not worry about it.
Just to make my opinion clear, pg along with linux and vim are the best pieces of software I use.
> There's also been little work that I know of* to optimize the storage engine for SSDs. (- I'd love to be learn this is just my ignorance and someone is really focused on large-scale SSD-backed Postgres deploys and tuning. Links/references appreciated.)
I'm by no means expert in databases or SSDs but one of the key point of SSD is fast random access and overall speed.
Generally on SSD you should set random_page_cost to 1.1, perhaps you could also adjust other _cost settings to reflect the speed differences.
I can't think of any other things that would have effect (TRIM support is handled by filesystem and not too relevant, since Postgres doesn't mutate the data (copy-on-write)), block size is not specific to SSD and applied in HDD as well.
It doesn't have MySQL-like pluggable storage, and frankly I'm thankful for that. How many of the MySQL storage engines actually work, including transactions for queries working with tables using different storage engines and such? Or support all the features like fulltext, and so on? Whenever I want to get scared at night, I either watch the first "Alien" movie or read "MySQL restrictions and limitations". So no, thank you very much.
OTOH, PostgreSQL code base is one of the cleanest and well structured code bases I've seen, and adding a new storage engine is not all that complicated, assuming you know what you're doing and have time to do that properly. If you just want something simple, managed outside but accessed by SQL, use FDW.
Re: storage engines, my apologies; those mean something very specific in the MySQL world and the more appropriate term for Postgres-land would be foreign data wrappers.
What's the purpose of having pluggable-storage-engines? From what I can see, InnoDB is the only engine to support basic features like transactions, foreign-keys, MVCC, etc. I'm sure there are reasons to use the other storage-engines, like MyISAM for archiving lots of data, but why not just use a different database for this? Is there a performance price for having this flexibility?
You effectively are using different databases (as far as, e.g., operational characteristics) but with a common interface presented to client apps, letting the visibile database engine serve as an abstraction layer, and the consuming app(s) don't need to be aware of the backend differences, and are also somewhat insulated against needing to change if the storage engine used for various pieces changes.
Expression indexes are nice, and you can still do that. It however requires you to know what you'll be looking for (and create index on that particular expression).
The stuff I find really awesome is indexing for arbitrary expressions:
CREATE INDEX jsonb_index ON jsonb_table USING GIN (jsonb_column);
And then you can where whatever key/value you want, and it'll use the index.
JSONB had issue with compression, because the compression is decided on the fly by checking if first part of the serialized data, and in the original format the first part was incompressible (thus no compression was used). But this was fixed in beta3, and so 9.4.0 works fine.
Regarding size, it's more complicated. For example after loading the 'delicious-1250k' dataset from
So, I love Postgres to death, but I'm a bit let-down by the json and jsonb stuff.
It's really great for document storage and retrieval, but the fact that there doesn't seem to be any story for partial updates kind of kills it for me. If it just had that, it'd cover every use case I could possibly want (except for maybe some of the Cassandra stuff).
It's annoying, though, because clearly that's what people often may want--just look at Mongo.
PostgreSQL does not have in-place update in the first place, on every UPDATE it does a DELETE+INSERT. So "partial update" does not make much sense, because you'll create a new copy of the row anyway.
MVCC means "Multiversion Concurrency Control" and yes, that's pretty much exactly how it works (creating copies of the rows, so that different snapshots see different rows).
You can write a function to do partial updates, and it wouldn't be any different than if it was in core. If you're using Amazon RDS they include PLV8 to make it even easier. :D
Here's someone building functions for a MongoDB API on top of Postgres 9.2. I'm glad the MongoDB API isn't in core, but they've laid the ground work so you can easily build whatever you want.
The main problem of MongoDB is that they tried to make a general purpose NoSQL database. The thing is that this is the area that RDBMS are trying to solve from the beginning. There are decades of research and solving problems. When you're using MongoDB you're essentially going back in evolution to 60-70s when hierarchical databases were used[2] and you're running into problems that were already addressed in RDBMS decades ago (for example locking vs MVCC[3]).
The original NoSQL was created to solve very specific type of data storage (originally in terabytes+ range, that you can process using highly parallelizeable map&reduce operations) and of course you're trading something in return (most of the time database consistency). There's no silver bullet solution.
The worst reason for picking MongoDB database though is because of storing data as JSON. Many companies that do that (including mine) believe DBAs just are slowing them down, so by picking a schema-less database they can release new features faster. They don't realize that they will need to still enforce schema in their applications and that is actually much harder and can become very ugly. Otherwise the data will resemble a pig sty.
Personally I'm not too enthusiastic about JSON in PostgreSQL, but I guess it gives flexibility as of what data schema should be enforced and what not. And there's always an option to migrate to use schema.
JSOB indexing looks good performance wise but all non commercial dbs including monetdb lacks UPSERTS (as inbuild functionality) apart from MySQL.
If we can UPSERT the data in JSOB based on key that would be awesome. Also since monetdb doesnt have UPSERTS so that the only reason for me not to use it ,otherwise its an awsome database for Analytical queries.
I really wonder what you mean by UPSERTS. Because the traditional meaning for UPSERT is "UPDATE if exists, INSERT otherwise", aka MERGE. And you're right PostgreSQL is missing that, but hopefully we'll get something like that in PostgreSQL 9.5 at the end of 2015. There are ways to do that in PostgreSQL using writable CTEs, in some cases.
But that's completely unrelated to JSONB, so maybe you mean something else?
I think that if you're using JSONB for data that can be easily stored in a proper tabular form (aka relation), you're doing it wrong. No doubt there will be people using it this way, though, and maybe there are cases where it really makes sense.
What JSONB makes much easier is storing data that are not a good fit for relational paradigm. For example if you don't know the exact structure (e.g. if you don't know what keys to expect), you can't reasonably decompose that into relational schema.
This is the case with mail headers, for example - it's highly variable, every client/MTA can define their own headers, and so on. You could decompose that into EAV model (yuck), use tables with generic PARAM_1, ..., PARAM_N columns (yuck!), or you can just stuff that into a JSONB column.
Is there a guide somewhere for someone that has used mongodb before but is new to postgres on how to map the first to the second? Would love to try some of this out, but the complexity is a little daunting.
The SQL can look a bit complex at first, and advanced queries can look complex, but MongoDB is quite limited what operations you can do with it, so the SQL queries that correspond to MongoDB operations are fairly simple.
Everything else is either used for database maintenance or DDL (Data Definition Language, defining how the data is stored) commands.
The DDL is the main reason why some organization are considering using MongoDB, they believe that by removing the schema (definition how data is stored) they will get more flexibility, the problem with it is that if you use MongoDB you still will have schema, the issue is that the schema will be enforced by your application. And if you don't enforce the structure you'll end up with quite a mess, and very complex code.
Also if you use Java (or JVM based language) there are many frameworks that supposed to help interacting with database each has their strengths and weaknesses. There's one called JOOQ[1] that allows you to write SQL queries in Java in a way very similar to MongoDB.
I was in this exact situation two weeks ago. Because Mongo databases tend to have an implicit schema, I used variety.js (https://github.com/variety/variety) to first uncover it for a given collection. I then hand-created a SQL table that corresponds to said collection, using variety.js's output as a guide for what types to use.
Why? What mistake? I mean, many people won't use that, but for many it's very handy. And the features delivered with JSONB (e.g. the indexing) is way better than what was available with XML.
I don't think XML was all that widely used, and the implementation looked more "looks interesting, let's develop something and people will use that".
With JSONB it's more like "people are already using JSON data type (text + validations) a lot, and want something better, and we do have hstore, and we're working on hstore 2.0, so let's use the ideas for json". And that's how JSONB was born.
My estimate is there are already more people using JSONB than XML (talking about the PostgreSQL data types), so the momentum is very different.
Rock-solid "classic" OLTP database? Check. Rich geospatial data platform? You betcha. Structured document store? That too. Pluggable storage engines and column stores? Why not!
Don't get me wrong: there are definitely workloads for which Postgres is a poor choice. Its replication features are far behind MySQL, much less its pricey commercial competitors and the more reliable NoSQL options (Cassandra, HBase, etc.). There's also been little work that I know of* to optimize the storage engine for SSDs. (*- I'd love to be learn this is just my ignorance and someone is really focused on large-scale SSD-backed Postgres deploys and tuning. Links/references appreciated.)
If you can work within those constraints though it's an awesome product.