
Ingesting MySQL data at scale – Part 1 - samber
https://engineering.pinterest.com/blog/tracker-ingesting-mysql-data-scale-part-1
======
netcraft
> Every day we collect more than 100 terabytes of data from MySQL databases
> that is an amazing amount of data - I dont use pinterest but had no idea
> they were that large.

~~~
firasd
They have over 100 million MAUs so even 1000 UTF-8 characters for each each
active user (email, first name, last name, location, tokens, other stuff)
would be up to 0.4 TB. Then add all the inactive users, and pins, tags, friend
connections, and all the other data they'll process for
recommendations/discovery.

~~~
delta1
MAU == Monthly Active User for those that don't know the acronym like me

------
pbreit
Curious if Postgres would be up to the task here?

~~~
Thaxll
Who do people feel the need to mention Postgres on every post talking about
MySQL... It reminds me the same problem with Linux and *BSD.

~~~
infamouscow
What drives me mad is when people claim they choose PostgreSQL over MySQL
because data integrity concerns, yet they run PostgreSQL on Linux as opposed
to FreeBSD or Illumos using ZFS.

~~~
dijit
Postgres' MVCC is almost like copy-on-write by itself, so anything more than a
journal is overkill in any realistic scenarios.

Personally we tried ZFS with PostgreSQL and we managed to fill the WAL drive
before the vacuum could clear it out with our update rate.

So we went back to XFS on the more commonly adopted (in our company); CentOS

------
shenli3514
If there is a RDMS with great scale capacity, it will save the effort of data
migration. Spark/Hadoop could read data from the RDMS directly.

------
spudlyo
I found this article very unsatisfying due to the lack of key details. This
reads like a blog post tailored for an audience of folks with pointy hair.

As far as I can tell, they went from Hadoop mappers pulling logical backups
from the DBs using python and mysqldump to ... Hadoop mappers pulling logical
backups from S3, which were pushed there by scripts running on the DBs,
probably still using mysqldump. Although I have no idea, since there are no
details. Are these backups physical? Logical? How is the 12 hour big table
problem solved by this approach? Why was there a limitation on the number of
mappers usable by the old approach?

And what of the old system's DB failover problem? Nothing that can't be solved
with a script that is _failover aware_! Nice. No reason to have dedicated
ingestion slaves that aren't, you know, _master candidates_ , when you have
scripts that can restart themselves.

~~~
haney
As someone with pointy hair I had no idea this was a stereotype. Should I
migrate to a more web scale haircut?

