
Scaling out Postgres, Part 1: Partitioning - craigkerstiens
http://petrohi.me/post/30848036722/scaling-out-postgres-partitioning
======
simple10
Craig, thanks for sharing Peter's article. As a side note, we started on
Heroku (on the cedar stack as well as actually working out of the Heroku
office -- we're big fans). We're now hosting with Liquid Web on dedicated db
boxes due to the large RAM and IO speed requirements of our app.

We also initially wrote the app to use a mix of Postgres and MongoDB but ended
up moving everything into Postgres.

~~~
stoked
What were the reasons for moving away from MongoDB?

~~~
simple10
Our app does a lot of writing. We were on Mongo 1.8 which has a global write
lock. So concurrency was a big problem under load. We also have a lot of users
with networks of 200k+ people – not even including followers. This meant that
we had to have arrays in the User collection with 200k+ IDs to perform a lot
of our queries. The RAM needed for indexes wasn't really worth the cost and
performance started to become a big concern. In the end, our data makes more
sense in SQL with a cache layer (Redis in our case) rather than NoSQL.

Our experience thus far in Mongo vs Postgres: go with what you know. We
probably could have made Mongo work for us (especially Mongo 2.0+), but it was
easier for us to work out the performance issues with Postgres. Plus, things
made more sense in SQL.

If you're using Mongo and need a cloud provider, I can recommend the people at
MongoLab. They really went above and beyond helping us get up and running and
are really cool people.

~~~
stoked
Thanks for the summary.

------
rbranson
Maybe I'm not understanding something, but there seems to be an implicit
assumption that the message broker (RabbitMQ) is more reliable than the
database. This seems to be a common fallacy.

~~~
petrohi
Yes, I can see where you're coming from. Message queue in the article is more
of an abstract service that provides certain guarantees and less of a specific
RabbitMQ implementation. There are examples of message queues implementations
on top of database--the resulting tradeoffs can be an interesting discussion.

------
bbromhead
Glad to see you are having "good problems".

Didn't like parallel queries using pgpool2?

~~~
petrohi
We are planning to use pgpool2, but only for connection pooling. In our case
the subset of postgres databases to execute a distributed query is dependent
on the input parameters, current sharding configuration and cached routes.
Cached routes are used to handle adding new shards without repartitioning--I
will cover this in the second part.

I guess we could use pgpool2 parallel queries, but this would likely require
some adjustments in how they are implemented.

------
taligent
Having to implement your own sharding in your application layer for a database
with 3 tables is ridiculous.

No wonder people are moving to products like Cassandra, HBase, MongoDB,
CouchDB etc which automatically and transparently shard across as many nodes
as you like.

~~~
rekoros
I don't have any experience with HBase and CouchDB, but Cassandra and MongoDB
will only shard automatically and transparently if your application does not
do much writing. :-P

Also, the number of tables has nothing to with the problem of sharding, it's
the amount of data you're trying to store.

~~~
taligent
I suppose you would have an issue with MongoDB specifically if you were trying
to rebalance from day one. But generally you wouldn't. I know Foursquare is
using MongoDB's Auto Sharding in production and they would definitely have a
lot of writes.

And the number of tables DOES matter for sharding if you are implementing it
yourself.

~~~
rekoros
Foursquare's experience with sharding MongoDB has, as far as I understand,
been far from automatic and transparent (based on their blog posts and the
word on the street).

~~~
taligent
I've heard Foursquare having a lot of problems with the terrible behaviour of
MongoDB once your indexes go above 95% of the machine's memory.

Never with the sharding part though.

