
PostgreSQL, pg_shard, and what we learned from our failures - craigkerstiens
https://www.citusdata.com/blog/19-ozgun/265-postgresql-pgshard-and-what-we-learned-our-failures
======
jabbernotty
I find the proposed solution for 'SELECT COUNT(customer_id) FROM orders'
confusing. They suggest placing the 3 billion order rows into three sets of 1
billion, to be distributed among three machines. But summing the number of
distinct customer IDs in each bucket doesn't give you the number of distinct
customer IDs in the orders table because an ID could be in more than one
bucket. What am I missing?

~~~
ozgune
This isn't clear in the blog post; I'll try to clarify.

The buckets correspond to hash token ranges. For example, the database takes
one row from orders_2013, hashes the customer_id value, and gets a hashed
value of 1.5B. It then puts it into the second bucket orders_2013_[1B-2B[.

After the distributed shuffle, you end up with orders_cust_id[1B-2B[ on
machine 2. So all orders that belong to a particular / disjoint set of
customers live on one machine. You can then push down the count(distinct
customer_id) to each node, and then add up the results.

We use hash token ranges in pg_shard. I'll include this clarification to the
blog post.

------
mercurial
I think PG is a great database, but I don't see what makes it a good fit for a
distributed system. You're going to either give up too much performance or a
lot of the guarantees that make standard SQL databases interesting in the
first place.

~~~
ozgune
(Ozgun from Citus Data)

It's Postgres' Extension APIs. They enable you to extend, cooperate with, or
override any database submodule's behavior without forking the database:
[https://goo.gl/rr2EIm](https://goo.gl/rr2EIm)

As an example, let's say you leveraged planner hooks to parallelize PostgreSQL
queries. When a new feature such as JSONB gets checked into Postgres, it
becomes immediately available for parallel querying.

To my knowledge, no other database provides the means to extend the database
to such an extent. These APIs became official in PG 9.0, and two example links
from the PostgreSQL documentation are:

[http://www.postgresql.org/docs/9.4/static/extend.html](http://www.postgresql.org/docs/9.4/static/extend.html)
[http://www.postgresql.org/docs/9.4/static/spi.html](http://www.postgresql.org/docs/9.4/static/spi.html)

~~~
mercurial
This doesn't really answer my question. I'm not asking "how can you turn PG
into a distributed database", I'm asking "what do I get with that that I don't
get with, eg, Cassandra"?

~~~
tracker1
Structured Query Language ... that's the biggest thing you get. Also database
resolved relational data.

It also allows you to present said information from other stores through a
well defined interface, such as via stored proceedures that can be locked down
via login/identity/application and other features typically in SQL, while
being able to act as a gateway for more optimized data stores for specific
types of data. Not to mention that JSON data in pgsql has gotten pretty
good... I think with plv8 it gets event better. Where I think pg falls is in
terms of replica sets, hot failover and automated recovery options which many
nosql servers give you in the box. With PG you pretty much have to pay for a
decent HA solution.

That said, I honestly really like document stores, and RethinkDB as an example
offers some of the relational features of a typical SQL database while being
mainly available as a document store.

~~~
threeseed
You can get SQL over Cassandra or almost everything other database (even
MongoDB) via Spark. And CQLSH is SQL-like at least.

And personally I have always thought putting business logic in stored
procedures is a truly dumb idea. You forever lock yourself to that database
and there is always a point that you can exceed its capabilities and need to
implement functionality in the application layer anyway.

~~~
ozgune
I think it's the extent of SQL coverage. As an example, PostgreSQL covers all
join types (inner, outer, anti, lateral) that I can think of.

Postgres has been building its SQL coverage for over a decade, and compares
favorably to others databases:
[https://news.ycombinator.com/item?id=9018129](https://news.ycombinator.com/item?id=9018129)

~~~
mercurial
Well, of course it does, but "join" in a distributed database is notoriously
different (and often not present at all) from "join" in a non-distributed
scenario, so this doesn't make it a strong argument for Postgres.

~~~
mslot
The problem with joins and reason why most distributed databases don't support
them is primarily that the local machinery is complicated to implement.

Given local join machinery, a join between 2 distributed tables can be
achieved by joining the partitions of the tables locally after ensuring that
tables are co-partitioned and partitions are co-located. That is, if
performing A JOIN B ON (A.X = B.Y), ensure that A is partitioned by X and B is
partitioned by Y with the same partitioning scheme such that every partition
in A has a corresponding partition in B with which it can be joined. A
distributed join can then be performed (in parallel) by joining each pair of
partitions and concatenating the results.

Ideally, tables are co-partitioned and partitions are co-located in advance.
The article describes scenarios in which the tables need to be repartitioned
during the query, which can cause a lot of network traffic. However, the
amount of traffic is limited in practice since filters and projections can be
applied before shuffling any data.

The great thing about implementing a distributed database on top of PostgreSQL
is that the local machinery already exists, and it is very good.

------
dharbin
Should Citus reevaluate cstore_fdw in light of this extensions API? I wonder
if a foreign data wrapper is similarly ill-fitted for this application.

~~~
mslot
For basic reads and writes the FDW APIs are quite suitable and they work well
for cstore_fdw. However, to get the most out of columnar storage you also need
vectorized execution that can calculate aggregates on blocks of values. The
FDW APIs are not well-suited for this purpose, since they pull one row of data
at a time. We've done some experiments in which we used the executor hooks to
implement a vectorized executor for cstore_fdw [1] and it can give dramatic
performance gains. However, the SQL support of this solution is limited as the
APIs force you to build a new executor from scratch.

With sharding, the distributed executor can work on the SQL level and
translate one SQL query to many SQL queries. In that case your distributed
executor can focus on issues such as partitioning and merging data, while the
low-level work is still being done by the regular PostgreSQL executor, so you
don't have to implement it from scratch.

[1]
[https://github.com/citusdata/postgres_vectorization_test](https://github.com/citusdata/postgres_vectorization_test)

------
rch
> The problem was that we were trying to retrofit [the Foreign Data Wrapper]
> API for something that it was fundamentally not designed to do.

> The PostgreSQL Extension APIs ... enable you to extend, cooperate with, or
> override any database submodule's behavior. We could therefore change our
> design to use the planner and executor hook APIs, and we found that things
> followed nicely from there.

~~~
tinco
Unfortunately this comment is really a good summary of the article. I guess it
makes marketing sense to cut up big blog posts into two, but as it is the
article really is vapid, while it could easily have been the article that
makes me consider CitusDB over MongoDB for a particular application.

~~~
rch
To be fair, the details of _why_ statement one is true are still interesting
though.

