
Pivotal Greenplum Database has been open sourced - snaga
https://github.com/greenplum-db/gpdb
======
jacques_chester
Edit: There's a shiny website too --
[http://greenplum.org/](http://greenplum.org/)

We (Pivotal, for whom I don't speak in any official capacity) have also
opensourced Apache HAWQ (incubating)[1], which is an SQL front-end for Hadoop
that was extracted from Greenplum, as well as Apache Geode (incubating)[2]
which was based on GemFire.

This was part of a general announcement we made in February that our intention
was to opensource our data products as it became possible to do so.

Incidentally, we are hiring engineers across our entire data suite, including
PostgreSQL specialists. The Data division solves seriously heavyweight
distributed problems. You can look up the listings on
[http://pivotal.io/careers](http://pivotal.io/careers), or if you like helping
a fellow engineer collect a referral, email me: jchester@pivotal.io and I'll
work out which office and division to send you to.

[1] [http://hawq.incubator.apache.org/](http://hawq.incubator.apache.org/)

[2] [http://geode.incubator.apache.org/](http://geode.incubator.apache.org/)

~~~
iheartmemcache
Oh man this is huge. Are you guys opening Chorus and the Pivotal HD stuff too?
I'd be shorting HP stock right now, because Vertica just lost all of its
appeal. Teradata and it's Hadoop H-SQL or whatever must be shaking in their
boots too. Are you guys going to attempt to upstream this or is it forked to
the point of no return?

I'd love to work on a project like this but I've got no PGSQL experience
(though I have worked on HFT both with KDB+ and writing a fairly decent
functional knock-off operating in production and I bet the distribution and
data set problems were similar). How is EMC to work for as a parent company?

~~~
mhw
Chorus already is open source:
[https://github.com/Chorus/chorus](https://github.com/Chorus/chorus)

------
djokkataja
"Greenplum Database is based on PostgreSQL 8.2 with a few features added in
from the 8.3 release. To support the distributed nature and typical workload
of a Greenplum Database system, some SQL commands have been added or modified,
and there are a few PostgreSQL features that are not supported. Greenplum has
also added features not found in PostgreSQL, such as physical data
distribution, parallel query optimization, external tables, resource queues
for workload management and enhanced table partitioning."

[http://gpdb.docs.pivotal.io/4360/ref_guide/feature_summary.h...](http://gpdb.docs.pivotal.io/4360/ref_guide/feature_summary.html#topic8)

Are there plans to eventually have full feature parity with ongoing PostgreSQL
development (as in adding features from versions of PostgreSQL newer than
8.2), or is Greenplum going to mostly be taking its own trajectory?

~~~
jacques_chester
> _Are there plans to eventually have full feature parity with ongoing
> PostgreSQL development, or is Greenplum going to mostly be taking its own
> trajectory?_

I asked about this today while I was at work.

The gist was: it will depend on a lot of factors, so it'd be unwise to nail
any documents to any doors. But watch this space for further developments.

In the mean time, Greenplum speaks the PostgreSQL wire protocol. All the tools
that can speak to PostgreSQL can speak seamlessly to Greenplum.

Disclaimer: While I work for Pivotal, I work in Pivotal Labs. I'm just an
engineer and I don't have any input into product direction for our data
products, which is done in another division. If pain persists, consult your
doctor.

------
lobster_johnson
Look like a worthy competitor to Elasticsearch for analytics: it has custom
partitioning strategies, true parallel querying, and support for columnar
table storage, and has much of Postgres' rich SQL implementation.

The parallel loading looks good, but I'm concerned that the single master
means it's a bottleneck for writes -- is this the case, or is there a way to
distribute writes across segments without involving the master?

The other concern is that this isn't just "horizontally partitioned Postgres".
It's forked from a very old version (8.2) of Postgres, and so doesn't have
things like hot standby, streaming replication, the JSON datatype, GIN indexes
(does it have GiST?), arrays, etc. It looks like it's optimized for parallel
workloads, not for general use.

Anyone here with any experience with Greenplum who can perhaps speak about
their work and the things that Greenplum is good at?

~~~
nl
Greenplum is designed for datawarehousing.

The _analytics_ it is built for aren't the kind that ElasticSearch is
typically used for.

Greenplum/datawarehouse style analytical are things like: What is the average
spend of female customers with 2 children who live in postcode AAAA or BBBB.
Break it down by day of week and group by their marital status. Now for the
top cohort of buyers on Mondays, give me a breakdown of the 3 most popular
products and our profit margin on each one. This is often called "Business
Analytics" (BI)

You could make ElasticSearch do that of course, but it wouldn't be much fun.

Given that context, the single master limitation turns out not be be a huge
problem. Typically Greenplum is setup to load data from the OLAP-style online
system, and the amount of data loaded is very predictable.

Same with things like hot standby and replication. In the Datawarehouse world
it isn't uncommon for there to be nightly periods where a batch dataload
occurs and/or nightly reporting is done, and the availability of external
interfaces to the DB during that time maybe restricted.

~~~
lobster_johnson
Thanks, that makes a lot of sense. We have those types of queries, too, which
is where ES definitely breaks down. We wouldn't mind switching those parts
(the "BI") of the analytics into something with higher latency.

~~~
threeseed
One option there is to use Spark.

You can then write SQL, Scala, Python, R to interact with ElasticSearch. I
can't recall the performance but against Cassandra, HDFS, HBase, MongoDB etc.
it is very fast.

------
parasubvert
It's been a long time coming for this database niche to reach open source,
which is the tech behind the big analytical prowess of countless companies
(before and after Hadoop). Teradata, the first massively parallel database,
was released in 1984.

A brief overview on MPP databases / Greenplum here:
[https://dwarehouse.wordpress.com/2012/12/28/introduction-
to-...](https://dwarehouse.wordpress.com/2012/12/28/introduction-to-massively-
parallel-processing-mpp-database/)

------
mhw
Hmm; this might make it more attractive to investigate Chorus, the Rails-based
analytics application that Pivotal also open-sourced
([https://github.com/Chorus/chorus](https://github.com/Chorus/chorus)).

It was always awkward to figure out what components you needed to build a
development environment for it. It depended on having a virtual machine with a
specific version of Greenplum Database running in it, but that version of the
database wasn't easy to find. An open source Greenplum Database might make it
easier to get started.

------
polskibus
What's the story behind greenplum? Is it an old startup that has been bought
by pivotal some time ago? I found an article about greenplum raising 20 mln $
in 2008 [1]

I was wondering whether anyone used greenplum in production and with what kind
of loads. With today's in-memory fad, I am also interested in whether gp model
supports loading everything into memory in a MOLAP fashion.

[1] [http://techcrunch.com/2008/01/21/greenplum-
takes-27-million-...](http://techcrunch.com/2008/01/21/greenplum-
takes-27-million-series-c/)

~~~
Herald_MJ
Greenplum was a startup that was acquired by EMC in 2010. Pivotal is partially
an EMC venture, so I guess that's how it ended up in their hands.

------
fl0wenol
This is exciting news. Greenplum's is one of the few instances of a Postgres-
based MPP that I've had good experiences with.

The experiences and tuning in the query plan re-writer for MPP might be
helpful in rolling in parallel operations support to the base. Not necessarily
the "how" since the GP strategy won't necessarily mesh, but the "what".

------
tlrobinson
This might pair nicely with Metabase, our open source BI tool:
[http://www.metabase.com/](http://www.metabase.com/)

~~~
kfk
I am looking for a new report solution for my company, but it seems you guys
have a different set of users in mind. But your dashboard building solution
looks great, I wish BI suites out there would be that easy

------
__david__
What's up with just a single commit since 2006?

[https://github.com/greenplum-
db/gpdb/commits/master](https://github.com/greenplum-db/gpdb/commits/master)

Why include all the really old commits, while squashing the most recent 10
years of commits into a single commit (6b0e52bead)?

~~~
tlrobinson
The old source history is Postgres itself, the latest single commit is
Greenplum.

I wonder why it's based on Postgres 8.2 rather than a newer version?

~~~
petepete
That's probably why it was open-sourced. It'd take a huge effort to 'move'
something as entrenched as Greenplum onto PostgreSQL 9.x and the fact that so
many features and enhancements are missing from 8.2, it's putting off
potential users.

------
uberneo
looks very similar to citus data
[https://www.citusdata.com/](https://www.citusdata.com/) which has also MPP
architecture and based on PostgreSql

------
piggybox
would like to see some benchmark vs Redshift, though the latter is a blackbox

~~~
LittlePeter
What do you mean blackbox? It has explain analyze, it has rich query meta
info, it has a web interface to query stats. You even know on what hardware it
runs. Genuinely not sure what's blackbox about Redshift.

~~~
fidget
We had a cluster last week spend 7 hours in a DB health 'unknown' state.
Response from AWS? Spin up a new cluster from snapshot.

~~~
vgt
I suggest you check out BigQuery. Redshift is not truly "fully managed", and
it's not really HA/Durable, as your experience indicates.

------
snaga
FYI.

[HACKERS] Patent warning about the Greenplum source code [https://www.mail-
archive.com/pgsql-hackers@postgresql.org/ms...](https://www.mail-
archive.com/pgsql-hackers@postgresql.org/msg272048.html)

You may have to check this message before diving into the code.

------
nickpeterson
I don't see any information on the site about limitations for enterprise use,
is this free as in beer?

~~~
lobster_johnson
Apache license, so yes, free to use and modify for whatever purpose.

~~~
james2vegas
except putting the code back into mainline postgres

