I love PG, and don't plan to ever move away from it.
But I'm starting to wish they had in-place upgrades. Application I develop is getting to the point of "too large to make a copy of the whole db whenever I want to upgrade Postgres".
I can postpone things for a little while by replacing all our large objects with a external file storage -- but the tables themselves are growing quickly.
You can run pg_upgrade with the --link option to have it hardlink most of the tables into the new cluster's location. It'll still re-write some things, but I believe that most of the actual table data remains unaltered.
"Big" is apparently still limited by the 32TB table size limit in PostegreSQL.
My last job was running into this. Sadly we didn't get rigorous metrics, but the database 'felt' slower when it got over .5 TB. The process pumping data into the database was several percent slower in items per second than when it started.
Still, PostgreSQL is great and I'm using it right now on the new project and I hope that someday I am blessed with tables too big for it.
"Unfortunately, due to the technical challenges being more difficult than expected by the team led by EnterpriseDB programmer Robert Haas, no parallel query features have yet been committed."
1. My understanding is that the parallel query stuff they want to add into the core of postgres is about introducing parallel algorithms for scans, sorts, etc. The advantage is that when it is working there will be an opportunity to help everyone, a) regardless of how you use postgres (psql, API, etc) and b) transparently from the user perspective.
2. par_psql on the other hand is two things. Mainly it's a cute piece of syntactic sugar for psql users that makes it trivial to run multiple queries in parallel (as you observe) but also synchronises them automatically as they end, which is important.
I've also provided some guides about how to use this feature to substantially accelerate single queries without much work or refactoring. It's generally a good tool for SQL workflows that might otherwise be managed by a combination of BASH and SQL, or for situations where you have one epic-sized query that is naturally easy to parallelise.
This talk I gave at FOSS4G Europe (a GIS conference) offers some hints about identifying situations where queries or workflows are trivially decomposable into parallel small queries, allowing a huge speedup:
So, reading between the lines, the SYSTEM TABLESAMPLE algorithm appears to be biased toward small values (in bytes, or whatever the on-disk encoding of Postgres values is).
If you choose random data pages (of fixed size), then it can fit more rows of small size than of big size.
"In the query above, SYSTEM is the name of the chosen sampling algorithm. The SYSTEM algorithm chooses a set of pseudo-random data pages, and then returns all rows on those pages, and has the advantage in running in constant time regardless of the size of the table. PostgreSQL 9.5 will also ship with the BERNOULLI sampling method, which is more rigorously random, but will take longer the larger the table is."
I would love to go back to relational, but I can't get over the linear scaling and HA ease of Cassandra and I'm willing to write my applications to work around the data model limitations. Is there any open source SQL database (i.e. "newsql") that can linearly scale with Cassandra's ease right now? I look forward to CockroachDB but it isn't done yet.
pg_shard v2.0 will target linear scaling and higher HA through metadata replication. If you have any questions or feedback for the project, we'd be happy to hear from you.
If you're on AWS, you can already get linear scale and full HA. This post shows a proof-of-concept: https://goo.gl/3c2GYc
There are some SQL databases build on top of HBase. HBase and the whole Hadoop ecosystem is not what I'd call easy, but once you are familiar with it, scaling it is relatively easy (probably still not as easy as Cassandra, thanks to the fact that Cassandra always hashes things and does not allow sorted range scans).
I'm currently experimenting with Phoenix, because it's a relatively thin layer above HBase, but it doesn't provide many features that you expect in relational databases like transactions. SELECT support is quite complete, it can do JOINs, subqueries, etc.
I have recently discovered Trafodion which looks like a much more complete relational database, but at the expense of no longer being just a HBase client.
It's certainly possible to use Postgres productively without ever doing anything advanced.
My use case can pretty well be summed up as "things an ORM can do for you out of the box." As a developer living in the shallow end of the RDBMS pool, switching a hobby project over is a matter of hours. If your usage is similar, learning it well enough to get by could just be a matter of:
For longtime MySQL users (like myself, ahem), I've found the biggest initial hurdle to be the different syntax the stock MySQL and Postgres clients use. When dropping into a Postgres environment I have to constantly look up what the Postgres commands are to do simple stuff like list all tables or describe a table's structure. It's not that the Postgres commands are particularly abstruse or anything, just that they're non-obvious if you're unfortunate enough to have spent 99% of your life marinating in MySQL.
Postgres is often closer to the SQL standard. For example escaping column names is specified by SQL-99 with double-quotes (") whereas MySQL by default refuses this and requires backticks (`, same as Oracle).
Compared to MySQL, the presence of schemas is a big win for me (same as 'user' in Oracle). Also, I can't move away from Postgres because it's the only DBMS which supports transactional DDL, which the ability to roll back an "add column" as part of a transaction.
Parent seems to be discussing the cli toolset, where many commands are composed of a backslash and one or two letters, whereas the mysql cli expects verbose commands, like SHOW TABLES and DESCRIBE tablename.
I don't think people have a concern that Postgres is the one with the wacky syntax. And it's been my experience that unless the team is very disciplined, if a language offers a thing, you'll find it in the codebase. Thus, ripping out all the backticks where you've used "select" or "from" as a column name isn't a postgres problem.
To really go out on a limb, I would posit that MySQL's view on RDBMS functionality influences its use, which I bet does not square with more traditional systems. Anytime one finds a shift in world view, the code becomes hard to rewrite and still behave the same way.
It's nice to PostgreSQL adding all of these analytics and business intelligence features. BI platforms have had these sorts of features for a least a decade.
These aren't "big data" features but rather EDW/BI features.
If PostgreSQL can't trivially scale horizontally like Cassandra or Riak or integrate nicely with Spark/Hadoop like MongoDB then it isn't particularly useful in typical big data roles.
Well something not mentioned in this article is the that 9.5 should also introduce inheritance from/to Foreign Tables which is one step closer to nice and easy horizontal scaling.
Maybe thats not a No-SQL trendy way of doing things, but for I am more used to PostgreSQL this look pretty neat. I think as the old and reliable project it is it will come to BigData at its own pace and quietly... but its already moving toward this and FDW was the first step.
But I'm starting to wish they had in-place upgrades. Application I develop is getting to the point of "too large to make a copy of the whole db whenever I want to upgrade Postgres".
I can postpone things for a little while by replacing all our large objects with a external file storage -- but the tables themselves are growing quickly.
That'd be my favorite "big data" feature.