
How to make MongoDB not suck for analytics - ayw
https://www.scaleapi.com/blog/athena#asdf
======
stickfigure
I tried using MongoDB for the customer-facing analytics of a large e-commerce
marketplace. It didn't work very well. The problem is that at some point you
end up wanting joins.

MongoDB was actually the third try. My first two attempts were BigQuery and
Keen, neither of which worked out because they support only one index - time.
Users want to slice and dice by various axes! And there's an obvious
additional index you need - "merchant" \- which column stores usually say
propose setting up isolated partitions for. If you do that, you can't ask
questions across the whole system!

We ended up with Postgres. It was actually faster than MongoDB for simple
aggregations, and joins made it much better/faster for complicated queries. Of
course it only works quickly if your dataset fits in RAM, but terabyte-size
instances are pretty affordable and give you a lot of headroom.

That was a couple years ago. I don't know what they're using now, probably the
same. It was a frantic few weeks figuring out what was going to work - each of
those systems made it to production and quickly discovered to be inadequate in
vivo. If you're in a startup, even if you're using exotic NoSQL systems like
Google Cloud Datastore or DynamoDB - just use Postgres or MySQL for analytics.
It will work long enough for you to figure out something else when you need
it.

~~~
threeseed
You are completely contradicting yourself.

On one hand you complain about using technologies before you have done a
prototype and evaluated the product. Then you blindly tell startups to just
use MySQL/PostgreSQL without having any idea of their use case or whether it
matches their query patterns.

If you are a startup the right way to go is to document your use case,
understand what queries those use cases demand and then find the right
database that satisfies it e.g. don't pick MongoDB if you are doing lots of
joins and don't pick PostgreSQL if you are doing wide-table feature
engineering type analytics.

Right tool for the right job.

~~~
lomnakkus
Without wishing to put words in their mouth, I think parent poster's point
might be that PostgreSQL will do at least a _decent_ job at most things you'll
want to throw at it.

This is not the case for most of the NoSQL databases where you'll pay for lack
of certain features either by a) having to write a lot of code, or b) bad-to-
crippling performance for use cases it wasn't meant to solve.

So, unless you're already _very_ clear on what your exact use case is going
why the spend time analysing before even getting your project off the ground?

~~~
198394549
>This is not the case for most of the NoSQL databases where you'll pay for
lack of certain features either by a) having to write a lot of code, or b)
bad-to-crippling performance for use cases it wasn't meant to solve.

Can you give a common example of these? This article is referring to issues
related to row vs column data stores, not sql vs nosql.

~~~
stickfigure
Having implemented effectively the same customer-facing analytics problem in
BQ, Keen, Mongo, and Postgres, I'll tell you specifically:

* Column stores like BQ and Keen don't let you efficiently slice and dice data by factors other than time. If you're slicing by customer or product, your queries become incredibly slow and expensive. You start writing hacky shit like figuring out when your customer's first sale was so you can narrow the time slightly, but that barely helps.

* MongoDB doesn't do joins. So you denormalize big chunks of your data, and now you have update problems because 1) you have to hunt all that down and 2) you don't have transactions that span collections. Also the aggregation language is tedious compared to SQL, requiring you to do most of the work of a query planner yourself.

* Some other person in this thread said MongoDB was faster than Postgres, but I found quite the opposite to be true. For the same real-world workload, basic aggregations on an index, we found Postgres to be much faster than Mongo. No idea what that other person is talking about.

~~~
lomnakkus
Very well put... and this _was_ the point I was making about "decent"
performance. If you have super-special requirements (you don't), you'll
probably discover it along the way to SUCCESS. If you don't any old SQL
database will probably be more than sufficient _AND_ it will be flexible
enough to allow you to evolve your schema along the way.

------
larrydag
This is a huge concern for me at my current organization. Dev has decided to
put all data into mongoDB. Yet all decisions are based on that data and the
tools we have do not allow for seamless flow (ETL) from mongoDB. That data is
important for deriving decisions that affect revenue and costs. Where are
solutions for the data analysts and scientists? Frankly I'm pretty sick of
hearing it can just be automated.

In my mind there has to be a decent "business intelligence stack". I'm not
sure I'm coining that because I didn't get good search results from that
phrase. Believe me I've been trying to find solutions. I believe there is big
opportunity in building out this sort of stack that bridges data management
and data analysis. Sure you can call IBM, Microsoft, Dell, HP but be prepared
for big costs and huge software bloat. I would like simplified solutions and
options that can fit with most industry standard tools.

I'm also willing to work with anyone on this as well.

~~~
mindingdata
Others are sharing out of the box solutions.

But I will say that many moons ago when I did actually write stuff for Mongo.
The oplog was a god send. You can "tail" the oplog, and get every transaction
in near real time. We used this for updating Elasticsearch indexes etc in what
is basically realtime, without having to poll or modify existing code at all.

~~~
chiaolun
The oplog is awesome! It provides an immutable record which is really useful -
we materialize the oplog directly in Athena to get a time-travelling database
for debugging purposes.

~~~
kalendos
Would you mind sharing the process? I'm curious which flow you use to
materialize the oplog in Athena/S3.

------
codingdave
I'm not aware of any analytics platform that runs directly from the source
data. There is just about always some kind of ETL process, or at the very
least, a data transformation process to shape the data as needed, to provide
data that works well for the reporting. So while information on making MongoDB
performant for such things is mildly interesting... it just isn't how
analytics are generally architected.

------
jrochkind1
What is the benefit of having it in mongo in the first place, in this
scenario?

~~~
twblalock
The people who write the business logic and the people who do the analytics
have different concerns. It's sometimes better to make different database
choices for those two systems and just copy the data into the analytics
system, rather than make a substandard choice of database to try to
accommodate both.

If the devs want to use Mongo, it's their problem -- it shouldn't matter much
to the analytics people, because they can just copy the data into a different
database that fits their needs.

~~~
dizzystar
Except that devs have to do ETL every day so analysts can do their query work.

~~~
twblalock
That can be automated.

~~~
dizzystar
In theory, yes; in practice, not really.

~~~
sdoering
I tend to disagree. Having multiple automated ETL processes running for
different projects/clients/colleagues I see that the code does not change as
often, as I had anticipated.

Automation here (in my case) is a net win on time.

------
eddd
I kind of a hoped it'll end up a joke saying "Don't use mongo". Last time I
used it was 2.4 and it was the worst db experience ever. Back then It was more
sane to craft a solution with PG and HSTORE. Now, I think RedShift does the
job, why would anyone use mongo on production for anything today?

~~~
nickserv
It's not too far from that joke.

It's like if you ask "how do I drive my car downtown" and I answer, "Easy,
just park at the station and take the train".

To answer your other question, their marketing goes a long way. I recently
started at a new company, and the lead was proudly telling me how the project
was developed using Mongo... So I start explaining how it's basically shit
after using it professionally for a few years. His answer? But SQL doesn't
scale well enough!

~~~
198394549
Why is it basically shit? It appears to store and retrieve the data as per my
instructions.

~~~
nickserv
Except when it doesn't. We've had data corruption issues related to oplog, out
of sync secondaries and excessive resource usage on the primary. As far as
major problems. There were also a bunch of smaller problems but in fairness
those were on the nodejs/mongoose side of things. Would not recommend.

------
sztanko
Just try this out: [https://github.com/EXASOL/docker-
db](https://github.com/EXASOL/docker-db) and you will be impressed. This is an
embryo of a real analytical database.

Pros:

\- an 8 CPU installation with 64gb memory will probably be hundred times
faster then postgres.

-it supports full sql

\- It is super stable, even as docker container

Cons:

\- it does not support nested data

\- once you reach volumes of around 2Tb, you will probably have to switch to a
paid version (I mean, you still can continue running on a 200gb ram box, but
it will be suboptimal)

P.s. I am not affiliated with Exasol.

~~~
arghwhat
> an 8 CPU installation with 64gb memory will probably be hundred times faster
> then postgres.

"Probably" not.

The way this usually goes down is that there may be a few synthetic benchmarks
show a large performance benefit over existing established databases (x2, not
x100), with any non-synthetic benchmark showing very poor performance (1/10th,
1/100th, sometimes even worse), and also often very unstable performance.

The product is then also usually beta quality, as it is hard to compete with
the 36 years Postgres has been in development since its inception in 1982 (and
that's not counting the 9 years of Ingres development, which Postgres—"Post-
Ingres"—spawned from). Important features are usually also quite lacking.

If someone claims x10 or x100 performance improvement over established
databases, they better have published a few papers about all the computer
science research they must necessarily have done to get there.

~~~
DrummerDaveS
Full disclosure - I currently work for Exasol.. but I thought I'd just clarify
that Exasol has been around for over 15 years and is far from 'beta'
(currently on version 6 with hundreds of production installations worldwide).
I've also been in the industry for > 40 years and worked with many database
products (including Ingres and Postgres) - and all I can say is download the
free community edition from the Exasol website or the Docker image as
described above and try it for yourself - you will be up and running very
quickly and I think you will be pleasantly surprised regarding both
functionality and performance.

~~~
arghwhat
My comment was more general in the sense that such a grand performance
statement needs some serious backing, and new products claiming to be several
orders of magnitude faster than established products are _usually_ unable to
deliver anything at all.

Would you mind sharing some of the differences to, say, Postgres, and what to
expect if moving from Postgres to Exasol? Porting my applications to Exasol to
benchmark would be time consuming (synthetic benchmarks are very
uninteresting), and without any information about what to expect, it simply
wouldn't be sensible.

I tried to look at the website, but I am not interested in accepting a privacy
policy just to get a white-paper, which frankly leaves me with no usable
information at all. The rest of the website is basically empty, short of
graphs without data and marketing "You want to do X? We can do that too! <no
additional info>". The only real thing I could extract was "in-memory
database".

To me, "in-memory database" would appear to be the catch that makes it an
entirely different product than Postgres, catering to an entirely different
payload with different pros and cons, rather than an faster all-round product.
None of my tables fit in RAM anyway.

------
georgewfraser
There are several companies, including mine (Fivetran) that will replicate
MongoDB into a columnar data warehouse for analytics. For most people, a
commercial replication tool + a commercial columnar data warehouse is the best
trade off of cost/ease of use. Commercial DWHs deal with all the details of
patching columnar formats under-the-hood, and commercial replication tools
like us will deal with all the complexity of things like the mongo oplog. For
not that much $ you can have a working system in like a day.

------
jrs95
Okay, we get it, Mongo sucks. Or at least that seems to be the consensus. From
what I can tell it seems they've improved their tech _a lot_ though, and I
have to wonder if a lot of the "mongo sucks" sentiment comes from either 1.
Using early versions of Mongo that really did suck or 2. people having used
Mongo at companies where nobody really knew how to use Mongo that well.

------
riboflavin
Dremio helps with a lot of this, particularly the speed aspect – uses Parquet
as well as Apache Arrow. (I work at Dremio.) Speeding things up:
[https://docs.dremio.com/acceleration/reflections.html](https://docs.dremio.com/acceleration/reflections.html)

~~~
nevi-me
Dremio quickly becomes useless with MongoDB given that for a while it's not
been possible to join data from two MongoDB collections by their object IDs.
Last time I checked, Dremio mangled the id into some string that can't even be
matched to the same id on a separate collection.

I had data in PG and Mongo, but couldn't join it together. I asked about this
on the forum, was told it's a known issue; and it seemed to end there.

I resorted to doing my analytics by hand in the end, MongoDB's aggregation
framework is good enough. Create views from aggregation queries, and it
becomes easier

The downside is that one needs a business license to use the BI connector.

~~~
oneweekwonder
> The downside is that one needs a business license to use the BI connector.

Have you looked the postgres mongo fdw[0] before?

[0]:
[https://github.com/EnterpriseDB/mongo_fdw](https://github.com/EnterpriseDB/mongo_fdw)

------
squirrelicus
Okay so... To make MongoDB not suck for analytics, ETL it in a different
format. For engineers trained in backed systems, this is pretty obvious. After
reading this, I also don't know why I'd choose Pequot things over any other
thing.

Baby's first ETL -- just scan the db with a cursor and analyze the data in a
script -- tends to cover 90% of the use cases for BI db analytics with almost
zero resource consumption anyway. Point being don't write a query to do
analytics if your db can't answer your questions performantly, and don't build
[latent, stale, slow] Enterprise ETL unless you really need it.

~~~
mac01021
As someone who grew up around the home if the Pequot tribe, I'm amused by the
choice made here by your input device's autocorrect feature.

------
kockic
I see that most of the `don't use mongodb for analytics` are being down-voted,
however I tend to agree with them. For all the people out there looking for
the database for analytics please check Clickhouse from Yandex, it's easy to
get started, amazingly fast and open source.

Disclaimer: I am not affiliated with Yandex in anyway, just a happy customer

------
minitoar
We use a similar technique at Interana. Our DB is a column store, but we break
things up over the time dimension to keep file sizes of individual columns
reasonable. One of these time buckets is essentially analogous to a single
parquet file. In addition we split/sort these buckets into smaller buckets as
more events are added.

------
manigandham
This is called ETL, to a data warehouse.

Regardless of the choice of primary database, this is nothing new and just
shows how a lot of startup technical talent seems to be discovering the same
things all the time, usually with needlessly convoluted approaches, and
writing blog posts about it.

------
notoriousp
Little bit offtopic but what product did you use to create those
visualizations?

------
drej
For those seeking tl;dr: The answer is not to use MongoDB.

~~~
twblalock
That doesn't get you out of having to face the problem. This is not a
challenge unique to MongoDB or other NoSQL databases. Oracle or Postgres might
be ideal for your transactional data store, and a columnar database might be
ideal for your analytics.

I suppose you could choose one of those options and sacrifice either your
customer experience or your analytics, but it's probably better to use the
best database for each use case.

------
dmitriid
> How to make MongoDB not suck for analytics

Easy: you don't use Mongo

------
endymi0n
Protip: MongoDB works absolutely best for analytics when it is replaced with a
sane and scaleable column-oriented database like Redshift or BigQuery right
before serving that report.

