
How to Migrate 50M Records from Mongo to PostgreSQL in Less Than a Day - zenogais
http://tech.plug.dj/2014/09/24/how-to-migrate-50mm-records-from-mongo-to-postgresql-in-less-than-a-day/
======
fleitz
Umm... Isn't a 50 million row export / import usually an hour long task...
tops.. I mean assuming you have the scripts?

Dump the first DB, import into the second? Maybe a couple ETL scripts in
between to massage the data?

This task sounds like monodb_dump export_db (or whatever they call their
backup tool) | etl.sh | psql -d import_db. Of course shell scripts probably
aren't 'webscale' enough to handle 150 GB of data.

Protip: When IO is saturated, make the IO sequential before ever even thinking
about threads. Exactly how are more CPUs supposed to make the disk faster? The
best you'll do is exploit the command queue more fully.

~~~
rjbrock
It does seem like it should be faster, but this isn't a vanilla export/import.
He is going from unstructured to (I assume) structured data. So he needs to
deduplicate and possibly do other operations on the data before importing

~~~
collyw
What about importing the unstructured data into one table in Postgres and
querying that table to populate a structured schema on the same Postgres
server? It shouldn't be as IO bound, as its copying data from Postgres to
another table rather than from Mongo. Can anyone tell me if that would likely
bring a performance improvement?

~~~
fleitz
Probably, when dealing with IO bound problems it's important to think of IO as
a really long pipe extremely fat pipe. Think oil pipeline.

The key is keeping the pipe full, requesting a piece of data based on what's
in the pipe will mean that it's going to be hard to keep the pipe full. You
want to request the data long before you're ever going to need it.

If your SQL queries are done in such a way that they don't depend on data in
the pipe they'll work fast, if they aren't you'll be not much better off than
whatever was currently in place.

------
icebraining
This requires you to wait for the whole batch to complete, instead of starting
a new process as soon as one finishes.

For these cases, I started using GNU parallel instead of plain shell. It does
the heavy concurrency lifting for you, and even allow almost seamless
parallelization across different machines.

For example, your final code could be written as such:

    
    
      parallel ./settings.sh python bin/migrate_media.py {} $LIMIT ::: $(seq 0 $LIMIT $NUMPLAYLISTS)
    

[http://www.gnu.org/software/parallel/](http://www.gnu.org/software/parallel/)

~~~
zenogais
This is great! Appreciate the tip

------
rektide
Seems like this would've been a completely ideal place to rock some mongo_fdw,
which would give postgres the ability to query and extract data directly from
mongo.
[https://github.com/citusdata/mongo_fdw](https://github.com/citusdata/mongo_fdw)

~~~
rjbrock
That adds quite a bit of latency to your database, which is already frequently
the bottleneck.

In reality, most of what Mongo does can be done just as well if not better in
Postgresql 9.3 (and I hear good things about 9.4). Then you can simultaneously
have structured data via postgres

~~~
Sanddancer
I don't think the parent's talking about doing all operations via the fdw, but
rather using the foreign data wrapper to fill the table. So instead of pulling
all the data into a python script and then pumping it out into the postgresql
database, you're using postgresql itself to do the fill operation.
Additionally, by using the foreign data wrapper, you could also create a
materialized view based on the results, if you still want most activities
using the mongodb instances for whatever reason.

------
jtwebman
This is a perfect example of doing something row by row instead of in a data
sets is bad for anything with over a million rows. Bulk Dump / Copy Load and
then run the rules to clean up data.

~~~
zenogais
This would have been great, but the lack of a 1-to-1 mapping between data
models complicated this. To migrate a single "record" required several queries
in the old database, some restructuring, and then several inserts into the new
database. We were restructuring nested records in mongo into several separate
relations in Postgres.

~~~
jtwebman
You could still copy raw and then run selecting inserts to load the real
tables. This could have been done in under an hour I am sure of it.

~~~
zenogais
Would honestly be great to know about, but how do you raw copy a nested mongo
document? The big hurdle was the lack of 1-to-1 mapping between primitive
types in the databases.

~~~
zenogais
Totally agree that would've likely worked too, though I'm not convinced it
would have been significantly faster doing a dump, import, and data
processing. Dumps from mongo typically took upwards of two hours not to
mention transferring the file and then importing it then processing the data.
Just a different way to skin the cat.

------
sokoloff
"We were mostly I/O bound so we knew we’d like to use gevent’s quick task
switching so we could write the maximal amount of data per-process. A single
process really constrained us to one vCPU. Therefore, to speed it up meant
running more than one process."

Wait. What?! The response to being I/O bound is not using more processes...

~~~
U2EF1
So he really means they were "waiting for IO" bound.

------
theforgottenone
why use python for real work?

For reference, I was handling 30M daily records in about an hour on hardware
15 years ago. This performance is absurd.

------
dr_faustus
I'd be more interested, why you moved from Mongo to Postgres.

~~~
fein
Probably because postgres can do what mongo does just as well, and is a full
RDBMS.

Take a look at postgres hstore with a gin index.

~~~
bjt
hstore isn't really as flexible as a Mongo document, given that both keys and
values must be strings, and you can't query into a document nested inside an
hstore value.

Now, with the jsonb support coming in 9.4, PG gets pretty much everything I
once liked Mongo for.

[http://www.postgresql.org/message-id/E1WRpmB-0002et-
MT@gemul...](http://www.postgresql.org/message-id/E1WRpmB-0002et-
MT@gemulon.postgresql.org)

------
errorrrr
somewhere around 2004 we pump into Oracle 5 million customers & product data
(basically whole telecom company customers portfolio, quite complicated
relational data) in around half an hour. Table partitioning and tiny java
program to manage parallel pl/sql scripts running and migration steps
transitions.

