“Big data” features coming in PostgreSQL 9.5

warbiscuit · on Aug 15, 2015

I love PG, and don't plan to ever move away from it.

But I'm starting to wish they had in-place upgrades. Application I develop is getting to the point of "too large to make a copy of the whole db whenever I want to upgrade Postgres".

I can postpone things for a little while by replacing all our large objects with a external file storage -- but the tables themselves are growing quickly.

That'd be my favorite "big data" feature.

tsuraan · on Aug 15, 2015

You can run pg_upgrade with the --link option to have it hardlink most of the tables into the new cluster's location. It'll still re-write some things, but I believe that most of the actual table data remains unaltered.

jeltz · on Aug 15, 2015

Yeah, unless you have huge system catalogs (where the table definitions are stored) pg_upgrade should be almost instant if you sue hardlinks.

warbiscuit · on Aug 16, 2015

That's beautiful. Had no idea about that option.

brianolson · on Aug 15, 2015

"Big" is apparently still limited by the 32TB table size limit in PostegreSQL. My last job was running into this. Sadly we didn't get rigorous metrics, but the database 'felt' slower when it got over .5 TB. The process pumping data into the database was several percent slower in items per second than when it started. Still, PostgreSQL is great and I'm using it right now on the new project and I hope that someday I am blessed with tables too big for it.

nabla9 · on Aug 16, 2015

If 32TB table size limit and degrading performance worries you, I suggest that you delve into Postgress documentation and discover table partitioning.

(you can partition table into chunks that are together larger than 32TB)

Cieplak · on Aug 16, 2015

Insert performance degrades as indexes grow large. Indexing in batch is faster than indexing with each insert.

http://stackoverflow.com/questions/12206600/how-to-speed-up-...

Depending on the nature of your project, you might consider a column store engine for postgres, such as

https://github.com/citusdata/cstore_fdw

collyw · on Aug 17, 2015

That's orders of magnitude bigger than what most people who claim to be working with "big data" need.

moreoverlords · on Aug 16, 2015

"Unfortunately, due to the technical challenges being more difficult than expected by the team led by EnterpriseDB programmer Robert Haas, no parallel query features have yet been committed."

In the meantime... self-plug: http://parpsql.com (free, open source)

ddorian43 · on Aug 16, 2015

So the parallel-query-features help to run 1 query in parallel, while your program executes multiple queries in parallel.

moreoverlords · on Aug 16, 2015

Hi,

Yes, but there's a little more to it.

1. My understanding is that the parallel query stuff they want to add into the core of postgres is about introducing parallel algorithms for scans, sorts, etc. The advantage is that when it is working there will be an opportunity to help everyone, a) regardless of how you use postgres (psql, API, etc) and b) transparently from the user perspective.

2. par_psql on the other hand is two things. Mainly it's a cute piece of syntactic sugar for psql users that makes it trivial to run multiple queries in parallel (as you observe) but also synchronises them automatically as they end, which is important.

I've also provided some guides about how to use this feature to substantially accelerate single queries without much work or refactoring. It's generally a good tool for SQL workflows that might otherwise be managed by a combination of BASH and SQL, or for situations where you have one epic-sized query that is naturally easy to parallelise.

This talk I gave at FOSS4G Europe (a GIS conference) offers some hints about identifying situations where queries or workflows are trivially decomposable into parallel small queries, allowing a huge speedup:

http://graemebell.net/foss4gcomo.pdf

Hope this is of interest and use to you.

chubot · on Aug 15, 2015

So, reading between the lines, the SYSTEM TABLESAMPLE algorithm appears to be biased toward small values (in bytes, or whatever the on-disk encoding of Postgres values is).

If you choose random data pages (of fixed size), then it can fit more rows of small size than of big size.

"In the query above, SYSTEM is the name of the chosen sampling algorithm. The SYSTEM algorithm chooses a set of pseudo-random data pages, and then returns all rows on those pages, and has the advantage in running in constant time regardless of the size of the table. PostgreSQL 9.5 will also ship with the BERNOULLI sampling method, which is more rigorously random, but will take longer the larger the table is."

no1youknowz · on Aug 15, 2015

Cannot wait for 9.5. Migrated from MySQL. So happy I did.

kodablah · on Aug 15, 2015

I would love to go back to relational, but I can't get over the linear scaling and HA ease of Cassandra and I'm willing to write my applications to work around the data model limitations. Is there any open source SQL database (i.e. "newsql") that can linearly scale with Cassandra's ease right now? I look forward to CockroachDB but it isn't done yet.

ozgune · on Aug 15, 2015

(Disclaimer: I'm one of the authors of pg_shard.)

pg_shard extends PostgreSQL without forking it, and it's worth a look: https://github.com/citusdata/pg_shard

pg_shard v2.0 will target linear scaling and higher HA through metadata replication. If you have any questions or feedback for the project, we'd be happy to hear from you.

If you're on AWS, you can already get linear scale and full HA. This post shows a proof-of-concept: https://goo.gl/3c2GYc

lukaslalinsky · on Aug 15, 2015

There are some SQL databases build on top of HBase. HBase and the whole Hadoop ecosystem is not what I'd call easy, but once you are familiar with it, scaling it is relatively easy (probably still not as easy as Cassandra, thanks to the fact that Cassandra always hashes things and does not allow sorted range scans).

I'm currently experimenting with Phoenix, because it's a relatively thin layer above HBase, but it doesn't provide many features that you expect in relational databases like transactions. SELECT support is quite complete, it can do JOINs, subqueries, etc.

I have recently discovered Trafodion which looks like a much more complete relational database, but at the expense of no longer being just a HBase client.

http://phoenix.apache.org/ https://wiki.trafodion.org/wiki/index.php/Main_Page https://vimeo.com/128190098

ericfrenkiel · on Aug 15, 2015

It's commercial, but MemSQL Community Edition scales horizontally. It's free and there are no capacity or scaling limits.

ddorian43 · on Aug 16, 2015

Community edition doesn't have HA

Joyfield · on Aug 15, 2015

I should really learn some PG and move away from MySQL. But at 40 years old i have a problem learning how to sit properly. sigh

chucksmash · on Aug 15, 2015

It's certainly possible to use Postgres productively without ever doing anything advanced.

My use case can pretty well be summed up as "things an ORM can do for you out of the box." As a developer living in the shallow end of the RDBMS pool, switching a hobby project over is a matter of hours. If your usage is similar, learning it well enough to get by could just be a matter of:

  apt-get install postgres

smacktoward · on Aug 16, 2015

For longtime MySQL users (like myself, ahem), I've found the biggest initial hurdle to be the different syntax the stock MySQL and Postgres clients use. When dropping into a Postgres environment I have to constantly look up what the Postgres commands are to do simple stuff like list all tables or describe a table's structure. It's not that the Postgres commands are particularly abstruse or anything, just that they're non-obvious if you're unfortunate enough to have spent 99% of your life marinating in MySQL.

tajen · on Aug 16, 2015

Postgres is often closer to the SQL standard. For example escaping column names is specified by SQL-99 with double-quotes (") whereas MySQL by default refuses this and requires backticks (`, same as Oracle).

Compared to MySQL, the presence of schemas is a big win for me (same as 'user' in Oracle). Also, I can't move away from Postgres because it's the only DBMS which supports transactional DDL, which the ability to roll back an "add column" as part of a transaction.

http://stackoverflow.com/questions/2901453/sql-standard-to-e...

McGlockenshire · on Aug 16, 2015

Parent seems to be discussing the cli toolset, where many commands are composed of a backslash and one or two letters, whereas the mysql cli expects verbose commands, like SHOW TABLES and DESCRIBE tablename.

tajen · on Aug 16, 2015

It's true that the lack of SHOW and DESCRIBE in Postgres is annoying. Makes it difficult to get the same result from a SQL client.

minimaxir · on Aug 15, 2015

Syntax-wise, Postgres is not much different from MySQL. You just have to learn how to use window functions which takes a bit.

mdaniel · on Aug 15, 2015

I don't think people have a concern that Postgres is the one with the wacky syntax. And it's been my experience that unless the team is very disciplined, if a language offers a thing, you'll find it in the codebase. Thus, ripping out all the backticks where you've used "select" or "from" as a column name isn't a postgres problem.

To really go out on a limb, I would posit that MySQL's view on RDBMS functionality influences its use, which I bet does not square with more traditional systems. Anytime one finds a shift in world view, the code becomes hard to rewrite and still behave the same way.

Qantourisc · on Aug 15, 2015

I just hope they retrain the option/default to strcoll for the final sort. For internal parts it doesn't matter.

polskibus · on Aug 15, 2015

Does anyone know about TABLESAMPLE equivalents in other RDBMS, for example in MS SQL?

pfg · on Aug 15, 2015

MS SQL has TABLESAMPLE support: https://technet.microsoft.com/en-us/library/ms189108(v=sql.1...

I don't think it has something like TABLESAMPLE BERNOULLI.

bmh100 · on Aug 15, 2015

It's nice to PostgreSQL adding all of these analytics and business intelligence features. BI platforms have had these sorts of features for a least a decade.

curiousjorge · on Aug 15, 2015

just seems like postgresql is constantly coming up with new things that just sort of works on top of a solid and reliable platform.

threeseed · on Aug 15, 2015

These aren't "big data" features but rather EDW/BI features.

If PostgreSQL can't trivially scale horizontally like Cassandra or Riak or integrate nicely with Spark/Hadoop like MongoDB then it isn't particularly useful in typical big data roles.

Twisell · on Aug 16, 2015

Well something not mentioned in this article is the that 9.5 should also introduce inheritance from/to Foreign Tables which is one step closer to nice and easy horizontal scaling.

Found this detailed example : http://michael.otacoo.com/postgresql-2/postgres-9-5-feature-...

Maybe thats not a No-SQL trendy way of doing things, but for I am more used to PostgreSQL this look pretty neat. I think as the old and reliable project it is it will come to BigData at its own pace and quietly... but its already moving toward this and FDW was the first step.