
Migrating from RethinkDB to Postgres – An Experience Report - okket
https://medium.com/fuzzy-sharp/migrating-to-postgres-2dc1519a6dc7
======
firasd
Interesting quote: "we decided to compute all statistics on demand. This was
something we previously tried in RethinkDB, but the results were not good...
When we tried implementing statistics as SQL queries in Postgres, we were
amazed by the performance. We could implement complex statistics involving
data from many tables."

I think standard line "use right tool for the job" is still the ultimate
answer. Data in most applications is relational, and you need to query it in
different ways that weren't anticipated at the beginning, hence the longevity
of SQL.

That said, I too often see HN commentators say something like "this data was
only 100 GB? Why didn't they just put it in Postgres?" which is not as clever
as the writer may think. Try doing text search on a few million SQL rows, or
generating product recommendations, or finding trending topics...
Elasticsearch and other 'big data' tools will do it much quicker than SQL
because its a different category of problem. It's not about the data size,
it's about the type of processing required. (Edited my last line here a bit
based on replies below.)

~~~
qaq
We do exactly what you describe on about 80TB dataset stored across a number
of PG instances. The 100GB comments are extremely clever because running a
query against a dataset that fully fits in RAM will be blazingly fast. "Try
doing text search on a few million SQL rows" we are doing it on billions of
rows.

~~~
firasd
Do you have FULLTEXT indexes on those rows? I find it hard to believe that
searching something like e.g. a fragment of text in Youtube comments can be as
fast in a SQL system (even in RAM) as in Elasticsearch.

As for the other stuff I mentioned (recommendations, etc.) I'm not just basing
it on my personal experience--here's a write-up from Pinterest about having to
dump all their MySQL data to Hadoop to drive various types of analysis. I
doubt they would do it it if just putting the SQL DBs in RAM was adequate!
[https://medium.com/@Pinterest_Engineering/tracker-
ingesting-...](https://medium.com/@Pinterest_Engineering/tracker-ingesting-
mysql-data-at-scale-part-1-424cf43fa7c3)

~~~
brightball
PG is something of a different beast too. With MySQL the standard mode of
operation for most non-basic functionality is to dump the data in another
tool.

Full text with Postgres is pretty fantastic and configurable. Not putting the
data somewhere else keeps you from having to maintain a second system, keep
things in sync, etc.

People jump to putting data in a search tool because it's a search tool
waaaaay too quickly IMHO. If the use case justifies it, go for it...but don't
add uneccessary complexity unless you have to.

~~~
ris
> Not putting the data somewhere else keeps you from having to maintain a
> second system, keep things in sync, etc.

To be fair there is some amount of "keeping things in sync" one has to do with
postgres, even if it's just setting up the right triggers to update the FTS
index on updates.

~~~
brightball
True, but it's transaction safe so should be impossible to get out of sync
which is what I was really getting at.

------
weeksie
Document databases are for data you know will be hierarchical and you know you
won't have to do much cross referencing on. That's not as broadly useful as it
might sound and then once requirements change you end up reimplementing a
RDBMS in your NoSQL. I don't hate doc dbs, I just don't think they're as
generally useful as they're made out to be.

Relational is well understood. The modeling problem is well understood so you
don't have to guess too much about how your data should be structured. By
default I start projects with a rdbms and then carve out the portions that
truly are hierarchy-only.

That said, I'm starting to investigate graph databases more closely and what I
like about them is what I like about relational: queryability. It's so easy to
write a short query and extract your data. I like the model quite a bit too
because graphs are a pain to shoehorn into a relational db. I'm still not
entirely sold but I would love to find more good resources on graph dbs.

~~~
philliphaydon
But data you store in a relational database is also hierachical...

Order / OrderLine

In a document db we store it as a single document because an order is the root
aggregate and the order line cannot exist without an order.

It relates to other objects in the database in the sense that it may relate
back to a User...

We hack together our "understood" objects to fit into a relational database.

------
sagefy
Interesting timing... literally just just finished moving from RethinkDB to
Posgres myself
[https://github.com/heiskr/sagefy](https://github.com/heiskr/sagefy) .

When I was first building the project, I didn't totally know what schema I
would need. I started on Postgres. But having to constantly do database
migrations while I was developing was a pain. RethinkDB was great for
development while I was figuring out what schema I needed. But now that the
website is live [https://sagefy.org](https://sagefy.org) , the schema started
getting stable.

I'm not using perfect third-normal form, but in many cases I have no query
needs there so JSONB makes sense. JSONB columns can actually remove much of
the need for NoSQL. The biggest gains from moving were: not having to do
foreign key validations myself (yay!) and about 2/3 filesize reduction to the
database. My database isn't large enough for performance to matter, but I'm
sure it would eventually.

If I were to do this again, I would probably start again with a NoSQL database
(maybe Redis or ES, or just files even) until I figured out the schema and
then move to Postgres before launching. I'm sure there's smarter people out
there who can start with SQL and predict easily what the feature set is going
to be be before they start, but I don't usually have that foresight. `ALTER
TABLE ...` gets old really quickly.

I'm keeping an eye on CockroachDB too... if they can really make something
similar to Postgres but easily scales horizontally... that would be amazing.

------
gilbetron
After starting up a big data timeseries project at work for the past several
years, and spending a big chunk of that looking into and testing various DBs,
whenever I am asked, "What database I should use for X?" I tell them Postgres,
unless you have an well-researched and proven reason not to. I picked Druid
for our project, and am very happy with the result. However, for some of our
smaller customers, Postgres would have been a much better choice. Only because
most of our big, important customers have data needs starting at Petabytes am
I happy we went with Druid. Well, that and a bunch of other reasons ;)

Much of the "reasoning" I see for people choosing Elasticsearch or InfluxDB or
MongoDB seems to come down to, "It showed up a lot on Hackernews".

~~~
Chickenosaurus
I am currently working on a thesis paper evaluating persistence technologies
regarding their fitness for a petabyte-scale sensor data analytics product. I
am interested to learn more about your experience with Druid.

I would be grateful if you contact me at 8nvyve+wj7zvh7q5e0@sharklasers.com
(it's a throwaway email as I avoid posting my private email publicly). Thank
you.

------
thecopy
I am beginning to see a trend on HN. Teams are moving away from new and young
databases to old and "boring" databases as their need for stability and
correctness increases.

~~~
dm3
Would be interesting to hear how they arrived at the RethinkDB/Elasticsearch
combination over Postgres at the start of the project. It doesn't seem they
lost any features after the transition. Was something lacking in Postgres at
the time of initial implementation?

~~~
chris_st
However, they do say:

> _leaning heavily on Haskell in order to fill in some of the gaps quickly._

So it looks like they had to do some work to cover some features from
rethinkdb/elasticsearch.

~~~
fmap
The code in the post is just used to migrate from an untyped interface to a
typed one. Unless I'm missing something, there seems to be no mention of any
missing features in postgres.

~~~
Yeroc
They did mention missing JSON schema validation in Postgres.

~~~
billrobertson42
And also many mentions of no schema validation in Rethink, so that's probably
not a gap.

------
hmottestad
"Postgres is basically going to be around forever". My favourite quote from
the article and a really great point. Even with RethinkDB going opensource
there's only been one single release in the last year, and it was a bug fix
release.

~~~
noncoml
With RethinkDB and Basho going out of business, it has an interesting year for
databases. I think we are seeing the dawn of the new databases golden era, and
the world is converging to 3-4 databases, that is Postgres, Cassandra and
Mongo and ElasticSearch.

~~~
paxy
MySQL still has massive usage, and isn't going anywhere.

~~~
noncoml
Agreed, I meant other than Oracle, MySQL/MariaDB and MSSQL, which are the top
3 anyway.

------
coolsunglasses
I talked to Phil about this migration, I maintain two of the libraries
mentioned in the post. I mentioned the Elasticsearch client (which I actually
wrote, unlike the SQL library) in this comment:
[https://news.ycombinator.com/item?id=15241886](https://news.ycombinator.com/item?id=15241886)

Maintaining each library has enhanced my appreciation and respect for how the
Postgres people do their thing.

------
jeyoor
In addition to the Postgres knowledge, this post also provided me with
tremendous help in building a sane data architecture for RESTful web services
in Haskell.

------
_asummers
We solved this differently. We needed to maintain a two DB solution (two-state
solution as some of my team has called it). For this, we took advantage of
RethinkDB's changefeeds to deal with the foreign key constraints. We have some
Elixir supervisors that link to it's direct dependencies. So e.g. a Comment
can't exist without a Post, so the Comment changefeeds don't start until the
Posts insert changefeed caught up with at least upserting the IDs. We then
have a followup task that makes sure the data is also up to date. This has
allowed us to keep our FKs, while allowing functionality to be moved over as
we needed it to. As new_app became the writer, changefeed was removed, and the
process continued. If one dies for whatever reason, it kills the dependent
changefeed listener processess and the VM restarts the process at that point.

------
elvinyung
I really wish that there was a column-oriented storage native to Postgres. My
impression is that people are forced to abandon Postgres for analytical
processing way too early, at relatively small scales.

~~~
anarazel
FWIW, I think currently column vs. row-store is not in the critical path of
making postgres more suitable analytical query processing. The CPU bottlenecks
are just elsewhere - and we (various postgres devs) are working on them. The
storage density parts of column stores can currently partially be achieved by
using a filesystem that does compression - far from great, but not disastrous
either.

~~~
elvinyung
I thought the bigger advantage was that you could do sequential-only I/O on
projections?

Or are you saying that you're only CPU-bottlenecked?

~~~
anarazel
> I thought the bigger advantage was that you could do sequential-only I/O on
> projections?

Not quite following - a column store will usually not have more sequential IO
than a row-store. Often enough to the contrary, because you have to combine
column[-groups], for some queries. What you get is: Higher compression ratios,
better IO & cache access patterns for filter-heavy queries, easier to
vectorize computations. Especially if you either filter heavily or aggregate
only a few rows, you can do a lot less overall IO in total, but the
sequential-ness doesn't really improve.

> Or are you saying that you're only CPU-bottlenecked?

Oftentimes, yes. You might be storage _space_ constrained, but storage speeds
for individual sequential-IO type queries are usually fast enough. Parallelism
helps with that (if you can push down enough work, a lot of it added in 9.6 &
10), plain old code optimizations (better hash-tables, new expression
evaluation framework, both in 10), as does JITing parts of the query
processing (WIP, patches posted for 11).

------
schuyler2d
Wow, we had to do the same thing recently. Rather than replace all the model
queries across a complex codebase, we created a rethink-(rdbms=knexjs in node)
adapter layer which gave us a nice 80/20 solution.

[https://github.com/MoveOnOrg/rethink-knex-
adapter](https://github.com/MoveOnOrg/rethink-knex-adapter)

------
cyberferret
> Rethink is no longer commercially maintained.

Just wondering, seeing as am not a Postgress user - is Postgress commercially
maintained? Or is it pure open source (like I believe RethinkDB is currently)?

~~~
sbuttgereit
There is no single company behind PostgreSQL, there's a foundation which
manages development, but there are a number of well established businesses
offering commercial grade support. Some of these companies employ some of the
major contributors to PostgreSQL.

~~~
rwmj
Wasn't there in fact an agreement amongst the PostgreSQL contributors that
they'd never all go and work for the same company?

~~~
sbuttgereit
I'm not sure (others here, naturally, may be in the know). Having said that,
there do seem to be some larger concentrations of them in a few companies. I
don't keep up too much with that either, but it seems like EnterpriseDB was in
this camp (and logically so).

I do think there are enough companies now that develop PostgreSQL based
products (EnterpriseDB, Citus Data, Greenplum, etc) along with the larger
PostgreSQL consultancies which would probably raise their hand if they thought
one player or another was becoming dominant in some way hostile to the others.

All of this is outside observer speculation, but it's the way I read the tea
leaves.

~~~
anarazel
>> Wasn't there in fact an agreement amongst the PostgreSQL contributors that
they'd never all go and work for the same company?

At least some of us have that understanding - and looking at where at least
the committers work it's fairly well distributed.

> Having said that, there do seem to be some larger concentrations of them in
> a few companies. I don't keep up too much with that either, but it seems
> like EnterpriseDB was in this camp (and logically so).

There is some concentration, but if you look at the list of committers
(smaller number, I don't have to look up affiliations), they're fairly well
distributed across the the larger players (alphabetically 2ndQuadrant, Crunchy
Data, EnterpriseDB) and various other orgs with some.

------
fiokoden
Postgres is always the answer.

~~~
virmundi
No it's not. It's question dependent. "Would you like a burrito?" "Postgres".
See? It doesn't work.

~~~
jacquesc
Works for me. I'd probably lose weight if I started using that answer.

------
yipopov
I've had a prodigal son moment with Postgres myself. I found that even for
many tasks that seem to favour NoSQL (time series, etc.) you get lot more bang
for the buck in terms of performance for the same resource usage with
Postgres.

