
Making PostgreSQL Scale Hadoop-style: Benchmark Numbers - ranvir
http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
======
joshhart
I wish this were open-source. Citus could certainly still make money hosting
or supporting the code.

But lack of sharing is what we get when major open-source projects do not
choose the GPL.

~~~
norkakn
It's Apache V2, which is a lot better than GPL for many of us.

[https://github.com/citusdata/cstore_fdw/blob/master/LICENSE](https://github.com/citusdata/cstore_fdw/blob/master/LICENSE)

They're really nice talented people, and a great example of a company giving
back to the community.

~~~
chaostheory
It seems that only the storage portion is open source. The portion that scales
Postgresql horizontally isn't. Am I wrong?

------
berns
Pricing pages without prices. I hate them.

~~~
snarfy
"How much?"

"How much you got?"

~~~
kofejnik
"How about three-fiddy?"

~~~
umur
three-fiddy it is! (Umur from Citus here)

I hear you, and we are working on fixing that even as it might take some time.
The challenge for us is that for an enterprise, alternatives could cost
literally in the millions (see Oracle pricing at $100k's for just a single
8-core commodity machine). For start-ups, we have offered Citus for prices
lower than $5k per node in the past, and we provide an entirely free community
version as well.

Essentially, our take is to not have pricing be what stops you from using
CitusDB. And if you are an enterprise, the value you get from using Citus
should far exceed that you'd get from any other alternative out there.

~~~
sitkack
Pricing is hard, especially on truly high tech product. It is always sold for
less than it is truly worth, a hit you take for the art.

~~~
justincormack
No thats not really true, everything is sold for less than it is "truly"
worth. Price discrimination works both ways, no reason to assume the seller
will capture it all.

~~~
sitkack
Most of the time when garage based uber hackers make a product they lose
proportionally more. Not that this is bad, but higher tech doesn't mean
correspondingly more profit.

------
ddorian43
Monetdb vs citusdb vs postgresql:

[https://www.monetdb.org/content/citusdb-postgresql-column-
st...](https://www.monetdb.org/content/citusdb-postgresql-column-store-vs-
monetdb-tpc-h-shootout)

~~~
ozgune
(Ozgun from Citus Data)

This benchmark confuses CitusDB with PostgreSQL + cstore_fdw extension.
CitusDB scales out PostgreSQL to multiple machines, and cstore is a columnar
store for PostgreSQL. The author has a clarification posted at the end.

For single node Postgres + cstore numbers on TPC-H, we found that a few simple
changes notably help. 1/ Analyze on foreign tables + increasing work_mem helps
join queries by 2-4x, and 2/ Using the double precision instead of the numeric
type increases aggregate function performance by 6x.

Lastly, we agree that vectorized execution can result in notable performance
wins! See
[https://github.com/citusdata/postgres_vectorization_test](https://github.com/citusdata/postgres_vectorization_test)
for some initial work. We hope to incorporate some of MonetDB's vectorized
execution features in cstore_fdw in the future.

------
mbubb
Instead of seeing this asa Hadoop alternative - this might be a better
alternative to the clunky data warehousing options like Vertica, Netezza,
Greenplum, etc.

The Citus vs Hadoop comparison feels a little apples vs oranges as presented.

I worked a bit with Netezza appliances which use an older version of postgres
which can spread queries across a Bladecenter ... I wonder how this compares.

The downside of the Netezza (beside the huge cost) is that it is not
expandable at all - to get more Netezza you need to buy another multirack
system.

Also there is a bottleneck getting data in and out as there are individual
host servers that you launch jobs through (ibm x3650s if I remember
correctly).

Hadoop does a significantly better job than something like Netezza in those 2
areas.

I guess the head to head comparison would be Citus vs Impala/ Hbase? That is
probably where a 'massively parallel' postgres setup that can scale
horizontally would out perform its hadoop counterpart.

~~~
twic
I don't know much about the practical operation of this kind of software. What
is it that makes Citus a better alternative to, say, Greenplum? Both of them
are PostgreSQL-derived parallel column-store databases, right? What is Citus's
USP?

------
flavor8
I would love to see a comparison to a cost-matched Redshift cluster,
especially since this test is running on Amazon's hardware.

------
gopalv
Neat. Postgres has always had a kick-ass I/O layer - particularly on ext4.

I think showing Q2 and Q11 numbers would've been great, because for something
like Tez, this is how those plans look in Hive (before the cost-based
optimizer work)

[http://people.apache.org/~gopalv/tpch-
plans/q2_minimum_cost_...](http://people.apache.org/~gopalv/tpch-
plans/q2_minimum_cost_supplier.svg)

[http://people.apache.org/~gopalv/tpch-
plans/q11_important_st...](http://people.apache.org/~gopalv/tpch-
plans/q11_important_stock.svg)

Postgres's query planner should shine for those.

~~~
getsat
You've seen better performance on ext4 than XFS? The opposite has been my
experience (mainly on 1tb data across 100 million rows, 20,000 queries/sec).
btrfs + compression was 5x faster than XFS, but btrfs has nasty kernel
deadlock bugs when the disk is almost full.

------
digitalzombie
I wish postgresql was easy to cluster.

I tried google'n for tutorials but there are none.

There are no books on clustering or sharding postgresql too? At least I
haven't found any.

~~~
brianwawok
It is tricky. It is also hard to make a real FT postgresql instance, as most
tutorials have a single pgpool node doing the load balancing, which shifts the
SPOF to the pgpool node. You can do it more or less with a virtual IP ala
[http://www.pgpool.net/pgpool-
web/contrib_docs/watchdog_maste...](http://www.pgpool.net/pgpool-
web/contrib_docs/watchdog_master_slave/en.html)

To add sharding on top of that is a similar tutorial, but even more
complicated.

------
chaostheory
So what's the difference between Citus and Greenplum?

------
covi
There are some basic SparkSQL configs not discussed in the blog post; see more
here: [http://apache-spark-developers-
list.1001551.n3.nabble.com/Su...](http://apache-spark-developers-
list.1001551.n3.nabble.com/Surprising-Spark-SQL-benchmark-tt9041.html#a9042)

------
arthursilva
Great results. Kudos to Citus team.

------
untitledwiz
What about Hive?

~~~
lern_too_spel
I don't think they make monitors wide enough to show Hive results on the same
graphs.

