(1 it is a better technical fit for a very specific problem
(2 there is already a legacy db in place
I have been voted down at a couple of startups that wanted to run a "MEAN" stack, invariably all of those startups moved from MongoDB or shutdown.
The only time I will advocate for anything other than Postgres is when Wordpress is involved. If the data model is simple enough then MySQL is more than up for the task, and it avoids an additional database dependency.
Thankfully all the ORM's that are worth using support MySQL and Postgres, so using both is very doable.
### Useful Postgres (or SQL in general) tools/libraries :
Bookshelf ORM http://bookshelfjs.org/
PostgREST automated REST API https://github.com/begriffs/postgrest
Sqitch git style migration management http://sqitch.org/
It isn't a popular opinion on HN, but I will still advise for Oracle or SQL Server in terms of tooling, cluster scaling, server side programming and DB drivers.
Then again, we work with customers whose Oracle and SQL Server licenses costs aren't an issue.
If you take the issues of open-source and licensing out of the equation (both important issues in their own right, but not related to the point at hand) then Oracle and SQL Server are both ridiculously good.
I personally try to avoid them (due to the cost, and the lock-in) but they are astoundingly performant and featureful RDBMSs with a huge amount of support and documentation behind them.
I guess some took it personally, although I mentioned I do like Postgres.
I just happen to like the other ones even more, since I was lucky to be able to use them in a few projects.
2) "server side programming" - as you have the source code with Postgresql and their a plugins for most major programming languages I don't buy this.
3) If you of spent the same on Enterprisedb or CitusDB I can guarantee you would get similar polish and support. People don't.
Something else you might be interested in:
I have always been a Postgres fan, but now I find it very difficult to imagine a problem MongoDB would be better suited for.
Inspired from Postgrest linked above, it automatically builds a GraphQL API by reflecting on postgres schema.
1 - http://docs.sequelizejs.com/en/latest/
1 - http://github.com/Vincit/objection.js
2 - http://knexjs.org
For instance, at my current startup, we employ at least 7 different databases (including PG). And, I don't say that to brag - each has a specific use for the problem at hand.
You have to consider the needs and trade offs of your specific project. And, if you can, try to isolate your storage behind some interface. Because often your needs will change.
(I say this with 20+ years of experience and over a dozen commercially-successful products in my belt)
Data has more value when combined, so a wise database choice also depends on what you already have. You can combine data from different systems (e.g. FDWs, which postgres has great support for), but something is usually lost along the way.
I think postgres is a good default, because it is good for a lot of things. But sure, if you make an informed decision otherwise, there's nothing wrong with that.
It doesn't matter whether you have your needs clearly defined, Postgresql is a jack of all trades and a master of many.
An abstracted interface for your datalayer is a must have.
Many startups and projects begin with a single db and grow into new dbs as the business requirements change.
Obviously if you know your data model well enough you can foresee a lot of these requirements and pick the right tool for the job.
However, you will usually need to pick a db to start with and hope that it will accommodate as many of those unknowns as possible.
I believe that Postgres is the best choice in that scenario, as it is extremely mature and has an incredible amount of flexibility.
Somehow that makes me feel better that at least, I'm on the right path.
The reason why Postgres is so popular here is because if you want to have an open source solution, Postgres is the best available. I'm sure there are better proprietary solutions, but even though it is free, PG is pretty darn close to them.
Why? Especially after point #1 and assuming the document-store was a good fit for the data model.
Many of the NoSQL essentially takes us back to 60s before Codd came up with relational model These ideas tend to come back once in a while , but so far nothing is better than relational model.
NoSQL still makes sense in many cases (generally when your specific use case does not need all guarantees of ACID), you can get in return higher performance or horizontal scalability.
MongoDB is aim to be generic database but rides on the wave of NoSQL "coolness". It was created by people who had no experience in databases and are learning as they go. As they started adding things that are essential for database they realized it's not that simple.
Currently MongoDB is outperformed by Postgres. In fact there is an application called ToroDB which provides protcol compatibility for Postgres that emulates MongoDB and even that outperforms Mongo. Mongo also doesn't scale well horizontally, so essentially you don't really don't get any significant advantage.
 For example they started with mmap'ed memory regions to store the data. Did not initially use fsync() to make sure data is saved on disk. In a way it reminds me of MySQL several years ago. It's much better now than it was in the past, but it has a lot of warts left of from the past.
MongoDB: for when you don't need consistency, availability, or partition tolerance.
There are some really good NoSQL products out there. I seriously think RethinkDB is on par with Postgres. I've also used Cassandra and BerkeleyDB and they're both decent. But unless some core part of your business logic is pathological to implement in SQL (like Reddit's comment trees) you should go with Postgres.
my favorite quote:
"the relational model is rooted depth in the set theory and math is hard to fool"
it was on a post on how all NoSQL are ending up having to reinvent having, group and join clauses on their API because that's what apps do with data
I hate hearing absolutist dogma like this. Some things are faster in Postgres, some things are faster in Mongo. We are migrating our analytics db from Mongo to PG but we hit a wall because SELECT COUNT(DISTINCT x)) performs abysmally in Postgres. Mongo's aggregation framework is an immature PITA but it performs this little trick well enough that we're pretty much stuck keeping Mongo around unless we want to use MSSQL or Oracle or something else with a better (and much more expensive) query planner.
We still love (and prefer) Postgres but it is not a pareto improvement. There are always tradeoffs, and this kind of fanboyism just speaks to inexperience.
Also MSSQL's query planner isn't better than Postgres', I work with both. Postgres does have its quirks though, especially with the MVCC row expiry.
Meh. Postgres' planner doesn't know how to generate a skip-scan/loose index scan for DISTINCT. You can write it yourself, but it's a bit painful:
If you have a low cardinality that can be a huge efficiency difference.
EXPLAIN ANALYZE SELECT COUNT(DISTINCT(calc)), calc FROM price_history GROUP BY calc;
GroupAggregate (cost=10713.04..11381.06 rows=10 width=5) (actual time=1010.383..1073.263 rows=11 loops=1)
Group Key: calc
-> Sort (cost=10713.04..10935.68 rows=89056 width=5) (actual time=1010.321..1049.189 rows=89041 loops=1)
Sort Key: calc
Sort Method: external merge Disk: 1392kB
-> Seq Scan on price_history (cost=0.00..3391.56 rows=89056 width=5) (actual time=0.007..20.516 rows=89041 loops=1)
Planning time: 0.074 ms
Execution time: 1076.521 ms
EXPLAIN ANALYZE SELECT COUNT(DISTINCT(calc)), calc FROM price_history GROUP BY calc;
GroupAggregate (cost=0.29..2804.82 rows=10 width=5) (actual time=0.117..47.381 rows=11 loops=1)
Group Key: calc
-> Index Only Scan using price_history_calc_idx on price_history (cost=0.29..2359.52 rows=89041 width=5) (actual time=0.054..18.579 rows=89041 loops=1)
Heap Fetches: 83
Planning time: 0.208 ms
Execution time: 47.416 ms
An alternative is PipelineDB, which is built on Postgres (and drop-in compatible, supposedly) and should provide an efficient implementation for those kinds of queries: https://www.pipelinedb.com/
I've never used it, though.
SELECT COUNT(*) FROM (SELECT DISTINCT x FROM table) AS temp;
SELECT COUNT(DISTINCT x))
This sounds exactly like what MySQL development looked like to me early in its rise to popularity.
IMHO, unless you have a situation such that Mongo's horizontal scaling is actually required, it's better to use Postgres even if you're doing document-store stuff.
The main problem someone is thinking about may be solved by a document DB, but then they also have 1000 other problems they didn't think about that are a horrible fit for a document DB.
And due to the nature of databases, where data has greater value when combined, using a new database system for every application isn't a great option.
SQL is great for business data, and passable-to-good for everything else. So the only way to beat it is by being way better at some specific thing, and finding users where that specific thing is so important that it's OK if the data is on an island.
The only time I would recommend Mongo is for storing geo-spatial data, as it has several built ins that make it much easier to work with. Even then I find it a lot more convenient to keep that data in Postgres and replicate it to Mongo.
However if the write load is very high then Mongo is better suited as the intial store. I then replicate to other dbs.
You can of course improve Mongo's speed by using weaker write concern, but you can also disable WAL in Postgres too.
However I found scaling write load past a single server's limits more difficult using Postgres than MongoDB.
Fortunately that is part of what the OP addresses in his post.
The biggest selling point of NoSQL databases really is increasing performance and horizontal scalability due to removing some guarantees that RDBS provides. With MongoDB you get neither of the advantages.
Generally with Postgres you can't scale easily, unless you only do reads, but postgres outperforms MongoDB. You can get some marginal improvement by running multiple MongoDB nodes, but a single Postgres will still outperform it.
Also with benchmarks like this, it is often overlooked that in a relational database you often can store data in a smarter way and get extra boost of performance.
For example in my previous job we had 3 MongoDB which were used for mapping IP address to a zip code. Those databases were 12GB each. They run on a beefy instances in AWS because they wanted to make sure all the data could fit in RAM.
I did a POC and put the same data in Postgres and it essentially was just 600MB of data. All queries were sub milisecond, on smallest instance.
How come? Well in Postgres I stored ranges of IPs using ip4r extension. And put an GiST index which works with ranges. Mongo did not understand IP addresses so what they essentially did is they computed every possible IPv4 address and stored it into 64 bit integer, then they placed an index which probably was bigger than the data itself.
Their solution also won't scale with IPv6.
Our framework only has a mysql adapter for now, but it should be pretty easy to add a postgres one.
Generally, if you can, you should consider supporting Postgres in your framework. It's a much saner and robuster database from an Ops point of view (replication doesn't fail as often) and more flexible from a Dev view (performance is generally more predictable, much wider feature set with document storage / PostGIS / etc., …).
I had never touched PostgreSQL before, nor any Linux performance tools, but I noticed that replacing certain buffer eviction locks with atomic implementations could drastically help this particular case. I emailed the list about it and Andres was someone who chimed in with helpful advice. I wrote up what I'd discovered in my deep dive here: http://tiny.cc/postgres-concurrency
Turns out Andres was already working on a "better atomics" patch to provide easier methods of using atomic operations within PostgreSQL's code base (my patch was a quick hack probably only valid on x86, if that). It's been useful in removing several performance bottlenecks and—two years in—it looks like it's still paying off.
I hope to have time to play around with it one I have new hardware (I refuse to do performance development on virtual).
But honestly, most remaining performance/scalability problems in pg are more algorithmically caused. So micro optimization, and that's what I'd call tax/hle, aren't likely to biggest bottleneck.
However, if the data grows beyond what a single scaled up machine can achieve, take a look at druid (druid.io). It is a bit more involved in setting up than influx, but was built from the very beginning to scale out horizontally. As a result, I can do realtime analysis (using grafana) of over 10 billion data points and perform aggregations over said data. It is an incredibly useful tool, and the newly released 0.9 looks ever better.
It can also count Alibaby, eBay, Cisco, Paypal, Yahoo, and Netflix as users (amongst many others): http://druid.io/druid-powered.html
It is really impressive tech. Bonus points that some of the original founders of Druid from Metamarkets just founded a company to do enterprise support around it:
You also have to spend some time to tune the query cost settings to avoid sequential scans if you're only gonna work with a subset of the data. Another optimization could be implementing table inheritance so you have a table for every year. If you work with data sets for a specific year you would get a big performance boost with sequential scans.
PostgreSQL's biggest weakness at the moment is aggregating data by using several cpu workers/cores. This is coming in PostgreSQL 9.6
Oh and I run PostgreSQL on ZFS with LZ4 compression,
I'm generally a big advocate of ZFS, but I heard that COW file systems (ZFS and btrfs) are generally not good choice for a database workload.
How does it perform for you?
I have also tested ZFS with Microsoft SQL Server by exporting a ZVOL over iSCSI(FreeBSD) over 10G ethernet. But without compression as it has no benefit on 4k blocks. Performance was similar to what you would get with the same drives striped on Windows Server 2012. The big win here is of course ZFS's data checksumming. Not sure about snapshot as backup though, I need to figure out how to talk to the Windows SQL Writer Service so it can tell SQL Server to flush and lock so I can take a consistent snapshot. Microsoft really needs to improve their documentation, because this would be really helpful for several enterprises when it comes to backup speed.
Curious since I'm currently researching how PostgreSQL could do better in this space :)
There doesn't seem to be any silver bullet yet. And it is also hard to even see how relational database compares to the existing solutions, since most people dismiss it immediately.
Unless PG has some timeseries-specific extensions I have assumed it would be appropriate for a TS-specific database. Also curious to try Riak TS.
Disclaimer: I work for Basho, makers of Riak TS.
Disclaimer: I work for Basho, makers of riak ts.
In partnership with IBM we researched PostgreSQL scalability on modern Power8 servers.
Also, the chart indicates that the benchmarks shown were run on Xeon chips.
The final straw came when I stood up a mysql 5.6 instance to use as a data warehouse for about 5TB of data (15 billion rows). To my horror after spending a few weeks on this project I discovered that mysql only supported a small subset of the SQL language. Most of the needed sql features for data warehouse queries were simply not there (no WITH clause, no ROW_NUMBER(), only one sequence per table, no schema support, and many others that SQL report ninjas love to use)
With that I had to go back to management and convince them to put more time into the project so we could try again with postgresql. There was a large learning curve to go to postgres. It was just different in many areas, but at the same time, reminded me alot of IBM DB2 (both were a product of the 1980's if you look at the history). Postgres fit into our plans nicely, it actually had a full blown intelligent query parser and optimizer unlike mysql. From what I learned, mysql's purpose was to stop people from using text files as storage points in the early 2000's. But once you graduate from that, it's onward to postgres from there.
Well shit, SQLite has that market sewn up these days.
Did you not research this beforehand?
No there hasn't:
Which makes sense since they aren't really playing in the same space. The benefits of PostgreSQL are largely lost on typical MySQL use cases (light load, simple CRUD access patterns, limited use of JSON/BSON etc).
What benefits do you see that MySQL has over PostgreSQL? What defines its niche?
You'll see the same phenomenon on Slashdot.
MySQL is popular among a subset of programmers: web developers. In corporations, Microsoft SQL and Oracle are more popular. Further, MySQL is popular among a subset of web developers: those who use PHP. Among web developers who use Python, Postgres seems more popular.
My suspicion is that MySQL's popularity is tied to those web-hosting plans (1 GB of storage! 1 TB of bandwidth! 10 databases! $10/month!). They were almost always MySQL databases. Web hosts that offered Postgres databases were few and far between. This article corroborates my theory: http://rodner.blogspot.com/2008/01/what-makes-mysql-so-popul.... MySQL's company (which at the time was "MySQL" not Oracle) pushed it in at a critical time in the development of the web.
I have written applications that use: Postgres, MySQL, SQLite, Microsoft SQL, and Oracle. My favorite by far is Postgres. The strangest thing to me is not that MySQL is more popular than Postgres, but that anyone uses Microsoft or Oracle at all. Not only do they cost a lot, but from a purely technical standpoint they are worse.
Based on what? Both commercial databases have incredible features, tooling and extensions that leave postgres behind. Postgres today doesn't even have a solid scale-up or scale-out strategy.
It does have nice SQL support and makes developer lives a little easier but this isnt anywhere close to making it the absolute winner technically.
MS SQL, Oracle and pretty much all the commercial databases are much farther along in scaling up and making the most of a single machine.
Could you elaborate a bit more on why you view Microsoft SQL Server as worse?
If you use MSSQL all the time, I have no doubt that you get along fine. You can get used to its ways, and most applications don't need features that aren't shared by almost all databases.
I didn't mean that if you use MSSQL then the app will always noticeably work worse than if you based it on Postgres. It's just that for someone who switches back and forth, MSSQL makes my life less enjoyable, provides no benefit, and costs a lot more money.
A lot of decisions are made holistically instead of a "purely technical standpoint". In large companies where there are still IT departments (e.g. use of Active Directory), MS is still very much the de facto platform, and SQL Server is the "logical" choice.
However, the trend seems to be from MySQL to MariaDB. Red Hat and many other Linux distros, for example, have moved to MariaDB. It doesn't feature Larry Ellison. So why not address it?
Slapdash web developers.
Most web developers I know now consider postgres the base standard.
Also, MySQL is run by Oracle...and Oracle a horrible company run by horrible people that behave horribly, and should never ever be trusted in any form. Any developer relying or Oracle code should not be trusted ever again, and Oracle consultants are the vilest form of life on earth.
Did I mention I don't like Oracle?
Say what you like about oracle, they have been a great steward of MySQL.
I'm not so sure everyone will agree with reports of closed source regression tests and source histories (I know these stories are old)
So yeah, Oracle, evil to the core.
IME there isn't really a compelling case to use MySQL over Postgres, unless there's a specific environmental constraint.
When a new version of MySQL comes out it's big news, but it seems like lots of the little day-to-day as it gets developed updates in Postgres get posted.
- a couple of weeks ago we released a document store (new protocol + connectors) based on protobufs
- last year 5.7 was released. Many new features, I have a list here: http://www.thecompletelistoffeatures.com
Surprised me as well.
One thing that's always bemused me is that YC doesn't insist on its companies using Lisp :-)
I'm trying to get a handle on the different databases, and VoltDB sounds exciting, but everyone's talking about PostgreSQL. Then there's Mnesia which I hear is, as all things Erlang, excellent, though it's kinda tied to Erlang.
I know it's hard to say what's best, but what would you say is the best DB for a completely new multilingual project that needs throughput but prioritizes low latency, for example?
Also, VoltDB is licensed under AGPL. Does this mean that it can't be used in commercial projects? Or is it OK as long as the other components are on different servers or similar?
VoltDB, as an example, is very different: it seems to be designed for simple OLTP workloads. It's an in-memory database, which offers opportunities for impressive performance, but if you have a large amount of data, you'll need a large amount of memory. And horizontal scaling is cool, but cross-partition operations will incur significant overhead
(IANAL, but the AGPL requires network users of software be able to download the source. Since a presumably proprietary application is the client in this case, this isn't likely to be an issue.)
If you don't know the difference, you probably want Postgres.
VoltDB is a specialty database for things like high frequency trading. It wouldn't make sense to use for, say, a consumer app or web startup.
I'd consider something else if I'm really, really sure that it's better suited to whatever niche problem than Postgres, but it'd take a lot of thinking and convincing myself.
This isn't necessarily a bad thing; it's a very easy model to reason about, it rarely has an issue if your cluster doesn't span more than one data center, and it lets you know when it happens via events you can subscribe to. But doing anything different, up to and including healing the partition automatically, is left to the user.
There's also a few warts due to its history; disc only tables have a rather small max size (so if you're not expecting everything to be stored in RAM as well, you don't want to use Mnesia), indexes are shockingly inefficient for writes, and a few other odds and ends that I don't really remember. For persisted, but fast, storage in Erlang, where partitions aren't common, it's great, but outside of those sorts of use cases there's probably something better.
11.5 How much data can be stored in Mnesia?
Dets uses 32 bit integers for file offsets, so the largest possible mnesia table (for now) is 4Gb.
Ehhhhhhhh, kinda, but not really. A couple of things: 
* If you use a disc_only_copies table, your max table size is ~2GB because that's apparently the largest file that DETS will work with. What's more, if your table grows to ~2GB, future writes (to that table) will silently fail until the table size is reduced.(!!!) You can kinda get around this limitation by sharding your table , but the hash used to determine what key goes to which shard isn't very balanced... sometimes you luck out and your shards are pretty much all the same size. Other times you end up with a few outsized shards.
* However! Everyone on the erlang-questions mailing list says that disc_copies and ram_copies tables are _NOT_ subject to this limit, so they can grow arbitrarily large.
* And, the general consensus seems to be that you really don't want to be using Mnesia in disc_only_copies mode... if you have too much data to fit into RAM, you should probably consider using something else to store your data. This isn't to say that using Mnesia in disc_only mode is bad or hazardous,  but that it's substantially slower than disc_copies or ram_copies, and you run the risk of bumping up against the DETS file size limitation.
 All observations valid for Erlang 17->18 only. Check Erlang release notes to see if major changes have occurred.
 Mnesia handles sharded tables really well, even if the documentation for the feature isn't the best.
 I use it in disc_only mode for one of my projects... it's how I learned about that mode's limitations. :p
Do they mean 4GB? Surely 0.5GB/4Gb is a bit small even for 32bit?
"No one" measures sizes that aren't network throughput numbers in bits. "Everyone" uses bytes. :)
And I mean -honestly- if you were shooting for the Pedant badge, you should have also quibbled about GB vs GiB. ;)
Given we're in code freeze anyway, I've not spent a lot of though on that yet; but I suspect that rearchitecting things so the lines are dirtied fewer times, is the better fix; with less potential for regressions at lower client counts.
I doubt it. Allowing the compiler to generate accesses with lea et al. is quite beneficial; and that'd likely be gone by making this not be a compile time constant.
It'd also end up being a tuning knob very very few knew how to tune...
> Given that cache line sizes vary that seems like it might be a prudent choice.
They usually only vary between architectures. Netburst IIRC was the last time x86 cache line sizes varied. If there's any doubt it's usually ok to just use the higher (128 byte) line size, the "unused" padding cache-line will never be accessed and thus not occupy cache space.
Compatible up to: 3.4.2
Last Updated: 2 years ago
Active Installs: 400+
It's also probably easier to add indexes to the code or rewrite the logic. Mostly it's bad database code that sucks on MySQL and would likely suck the same way on PostgreSQL.
Besides that if you use an object cache like memcache or redis you can avoid a lot of database accesses and InnoDB seems to be able to deal with concurrent tables like wp_comment.
Would be cool to see in WordPress itself but I doubt it.
Sounds like a situation I'm facing: assisting a non-profit org that wants a new website. Some members push for using Wordpress as the CMS. However the org has an established pgsql db containing lots of data that optimally could be employed on the site, e.g., lists of members/events, searchable documents, etc.
Also, there's a separate web app for accessing the database using SQL features mysql doesn't support. Lack of Wordpress/pgsql compatibility makes integrating the existing db and folding in the webapp functionality just about impossible, at least I'm not seeing how that could be done.
Since Wordpress isn't going to embrace pgsql, I've recommended considering alternative CMSs. Decisions are still pending.
You can also do some pretty nice data syncing/access between the two using Postgres' Foreign Data Wrappers to access MySQL data within Postgres - unfortunately it doesn't look like MySQL has a equivalent for accessing Postgres data from within MySQL.
How do these changes affect more heterogeneous workflows, of mixed reads and writes? A little better? A lot better? A little worse?
T<what?> Per Second?