
Unlocking New Features in Moz Pro with a Database-Free Architecture (2016) - myth_drannon
https://moz.com/devblog/moz-analytics-db-free
======
boxcarr
I'm the person who did the original prototype in Python/Pandas. The purpose of
the prototype was to prove that with the right data presentation, you could
process more data and cut processing and storage by over two orders of
magnitude. I picked search rankings given the size of the data and the
struggle the legacy system had in terms of showing more than a limited amount
of data.

Previously, the MySQL solution had many rows for each ranking spanning several
tables with all kinds of other data not related to showing the aggregate
statistics that were relevant to Moz's users.

The solution was processing the raw data in batch and applying categorization
to have an integer representation for all values. These changes led to a very
compact representation that could quickly be loaded into memory and then
filtered/aggregated.

Storing the results as a CSV wasn't important. It just turned out that having
the static data allowed me to effortlessly scale-out serving since the data
was only updated once a day or once a week, and it was append-only.

How big of a difference did it make? All of the CSVs individually compressed
was less than 20GB. The production system served all user data off of a ~60
node MySQL cluster, with rankings being the most costly in terms of processing
and 2nd in terms of disk usage (from what I remember).

Also, Pandas was blazing fast at loading several megabytes of CSV data (< 80ms
@ the time). If I had to do it again today, I'd probably use Apache Parquet
instead.

The most important insight that carried through the various solutions was to
pre-process the data so that it was easily consumable for the task at hand.
The languages (Python/Elixir) didn't make a difference, in my opinion. That
said Pandas is fantastic, it made working with that data in-memory very easy,
at least for this prototype.

