
Show HN: ToroDB Stampede: Automagical MongoDB to PostgreSQL, 100x Faster Queries - ahachete
https://www.torodb.com/stampede/
======
ahachete
Hi, Álvaro here, from 8Kdata, the company behind ToroDB.

Please let us know if you have any questions or comments about ToroDB. We will
be happy to answer them :)

Enjoy!

~~~
javiermaestro
Hi there!

Stampede looks nice, congrats!

Quick question about the 100x performance claim and the benchmarks in
[https://www.8kdata.com/blog/announcing-torodb-
stampede-1-0-b...](https://www.8kdata.com/blog/announcing-torodb-
stampede-1-0-beta/):

\- I don't see any specs and/or methodology published for any of the
benchmarks. I'd like to see some specs for the servers used (especially RAM
and what HDD or storage type was used).

\- For the 500GB dataset, I assume that it didn't fit in memory, but the 100GB
could be "easily" fit in memory. I'd like to see if that's the case, and how
does it compare when the dataset is all in memory (I'm anticipating that
Mongodb still sucks big time, but it's nice to see a clear apples to apples
comparison).

Anyway, kudos for the awesome work!

~~~
gortiz
Hi javiermaestro!

All benchmarks have been done on AWS i2.xlarge instances. They have 4 vCPUs,
30GBs of RAM and 800GB local SSDs.

It is true that we are not doing an apples to apples comparison on the 100GBs
set. I think is just the opposite! We are helping MongoDB by giving it three
machines (and therefore 90GB RAM!). This is specially true when the benchmark
is using indexes because in this case the index can be always on RAM.

MongoDB is very fast when it has to retrieve a single document because, by
design, it has an amazing spatial locality (the whole document is usually on
the same page). But this feature is a weakness on aggregation queries, as they
usually only care about a small subset of the document. ToroDB Stampede change
that by storing your data on a relational way. Of course, as you said, MongoDB
performance is horrible when it has to fetch documents from disk, but even if
the documents are in memory, the same effect is expected (on aggregation
queries) when data has to be move to the CPUs caches.

~~~
javiermaestro
LOL I re-read the article and found the specs. I really read it but somehow
managed to skip the paragraph or something :-?

Anyway, my point stands. I'd use a single instance in which the dataset fits
in memory, just for completeness. Then, you can compare one mongo with the
full dataset to one stampede.

As you said, the aggregated data will still make Mongo suffer, but it will be
a better comparison. I still like the 3-shard setup, though. It's also a good
reference point.

~~~
ahachete
So far we have benchmarked situations where dataset > RAM or >> RAM. I think
it is an interesting point to also analyze the case when dataset < RAM, to see
how efficiently both systems manage the caches, query planning etc. Stay tuned
and thanks for the suggestion! :)

------
postila
It's so funny to hear sometimes that noSQL is better that SQL databases and
see how much more powerful tools can be built using old good relational
engines.

Also, this story a year ago story
[https://www.linkedin.com/pulse/mongodb-32-now-powered-
postgr...](https://www.linkedin.com/pulse/mongodb-32-now-powered-postgresql-
john-de-goes) was simply epic.

~~~
ahachete
Indeed, MongoDB BI connector v1 was based on PostgreSQL. However, they used
9.4's foreign data wrappers and since those could not push down query clauses,
the connector itself introduced a significant performance degradation (on top
of MongoDB's current performance difference when compared to
Stampede/PostgreSQL).

~~~
eb0la
I've tried that connector and decided against using it. The problem was you
had to define your schema by hand and in mongo sometimes you have legacy data
records with different semantics.

In my case it crashed Tableau data import after 45 minutes because it blindly
sent data that shul have been NULLIFIED.

Will take a look at the way Stampede does the transformation. It might suit
us.

~~~
gortiz
Great! We are looking forward to hear from your experience! Open a ticket on
github or email us if you find problems or do you think there is something we
can improve!

------
stephenr
FYI: myself (and I believe a reasonable number of other software developers)
treat the word "Automatic" as a red flag. Magic means you don't understand how
something works. Auto(matic) means you don't control something.

Something you don't control and don't understand is dangerous IMO.

~~~
ahachete
Thanks for the feedback.

I agree sometimes myself think the same way too. In this case, it conveys a
non-dangerous, powerful message: rather than having to design your DDL, and
have that DDL updated every single time data with a different structure
appears in your source stream (MongoDB), that DDL is designed for you. It is
automatic because you don't need to do anything. It is "magic" as if ToroDB
would be designing the DDL for you, real time, and this is quite disruptive
IMHO.

Now, there's no danger: no data is ever lost or mapped incorrectly. Just if
you don't like the generated DDL, create some views and done! All data shaped
exactly as you want :)

------
kasano
One of our engineer's recently floated up the idea of using ToroDB Stampede
for replacing our MongoDB -> PostgreSQL ETL, since it's merely a set of Python
scripts parsing JSON into tables.

Have you seen use cases of Stampede being implemented on existing
databases/schemas, rather than an entirely new DB?

~~~
ahachete
Hi kasano.

One of the most relevant use cases of ToroDB Stampede is precisely what you
say, replacing MongoDB to PostgreSQL ETLs, where you have to design the
schema, solve data type conflicts, maybe flatten or discard data, etc. And
other problems like real-time replication, managing HA, etc. Stampede
addresses all these problems.

While most people may want to use an empty, dedicated database for Stampede,
it is not required to do so. Stampede will generate the tables under a schema
name that matches the collection name. So as long as there are no name
conflicts, you can happily have ToroDB-generated tables alongside your own.
Needless to say, this gives you the ability to JOIN information from different
data sources.

So sure, go and give it a try! :)

------
3manuek
Impressive performance, even compared against a Mongo shard with 3 nodes!

