

Jetpants: a MySQL toolkit for managing billions of rows and hundreds of DBs - evanelias
http://engineering.tumblr.com/post/24612921290/jetpants-a-toolkit-for-huge-mysql-topologies

======
notJim
Slightly OT, but Tumblr gave a great talk about their sharding architecture
here: [http://engineering.tumblr.com/post/12652551894/slides-
from-o...](http://engineering.tumblr.com/post/12652551894/slides-from-our-
velocity-europe-talk-on-mysql-sharding).

It's really quite a good intro to the subject, I think.

------
salimmadjd
This is awesome! Also, I was at MySql event at Oracle two days ago and I
overheard MySql guys talking with the pinterest folks about their sharding and
how MySql team was going to announce something soon and wanted to get the
pinterest's team feedback. MySql is doing their scripting on Python. So for a
python shop, their release might be more interesting.

~~~
evanelias
Cool, looking forward to seeing that! I love Python too, and readily admit
it's perhaps a more frequent choice for this type of automation.

That said, I really grew to love Ruby over the course of this project, which
is actually my first in the language. Ruby's open classes allowed me to write
a pretty flexible plugin/callback system with very little code. Jetpants
allows you to hook arbitrary methods in before or after any method in any
Jetpants class, and these callbacks "stack" (with support for different
priorities) so multiple plugins can hook-in to the same place.

Because every large site seems to tackle sharding slightly differently, I
figured a nice plugin system was pretty important in order for anyone else to
be able to use this :)

------
ErrantX
This is excellent stuff; it fits a niggling problem I have (i.e. I'm a better
programmer than sysadmin, so managing massive data shards is a pain)

It's an elegant implementation. From a ~30 minute read through I reckon I can
use it to replace our current "hacked up" solution in just a few hours.

Kudos Tumblr.

------
iuguy
It looks interesting but one small thing. I looked into the transferring large
files quickly link[1] and saw that they were using netcat and tar to transfer
files. This is not necessarily optimal[2], and applying some compression can
go a long way, although this will be dependent on the use case. Compression
also has the added bonus of transferring less data across the network and (if
you don't uncompress) less space at the other end.

SSH is also an option (a slow option, but an option) that provides certain
things (encryption, authentication) that make it ideally suited for transfers
across network boundaries.

[1] - [http://engineering.tumblr.com/post/7658008285/efficiently-
co...](http://engineering.tumblr.com/post/7658008285/efficiently-copying-
files-to-multiple-destinations)

[2] - [http://www.ndchost.com/wiki/server-administration/netcat-
ove...](http://www.ndchost.com/wiki/server-administration/netcat-over-ssh)

~~~
ralph
They're using pigz on [1] so they are compressing?

I thought it wasn't a great idea. OK for an on-the-fly solution to the problem
but bittorrent or multicast would seem better; the serial route between
machines isn't very fault tolerant requiring a start-from-scratch on failure.

socat > nc BTW, and does multicast.

As for ssh, it's a shame the "no encryption" option was removed.

~~~
evanelias
re: fault tolerance, it's a fair point. Although in practice I've never had
this fail part-way on me, and I've used it a couple hundred times with >600GB
transfers.

We usually use this to copy to 2 or 3 machines at once; it's rare that we'd
need to bring up 4+ slaves simultaneously, or split a shard into 4+ pieces.
Most Linux distributions already have all the software needed except pigz,
which is tiny and available in several packaging systems.

I'll definitely give socat a look though, thanks for the tip.

~~~
ralph
Oh, OK, I agree, for 2-3 machines reliability isn't an issue, I was thinking
more dozens.

------
danmaz74
It's so good to see companies open sourcing some of their technologies. They
probably do so mostly to improve their image, but who cares, and I have to
say... it works with me! Thanks tumblr.

~~~
mokkai
No. They mostly do it to ensure that the software is actively maintained. When
a home grown software reaches a certain level of maturity it makes sense to
set a roadmap and release it in the public domain. More users + developers =
profit for both the company and the public.

~~~
danmaz74
Yes, if you get there that's a win-win. I wonder though how many projects
really get a community support after they are open sourced by a company.

------
Andys
My quick reading of this is that its suited for databases that don't change
much (or at all) once the data is inserted, and not as much for apps that need
to keep strong ACID compliance with guaranteed referential integrity.

~~~
evanelias
We handle many thousands of write queries per second at Tumblr, and we use
Jetpants to manage our entire MySQL topology, so trust me when I say the data
changes quite often :) You can edit your existing posts on Tumblr, unlike on
several other prominent social sites.

Definitely please let me know how you got that impression though -- I'm happy
to improve confusing things in the docs.

As for ACID compliance: Jetpants is a toolkit for MySQL / InnoDB, and doesn't
really impact the referential integrity guarantees of those systems any more
or less than other partitioning schemes. MySQL is inherently not a distributed
system, for better or worse.

~~~
Andys
I got the impression because I didn't see any discussion about the handling of
bringing new slaves online (other than how to make it fast). Do you pause one
of the slaves to get a consistent dump?

~~~
evanelias
Slave cloning is performed by shutting down mysqld on a standby slave and
copying its raw data files. There's no dump involved. This is widely regarded
to be the fastest possible way to clone a slave in MySQL.

This is explained in the deeper doc files -- didn't want to bog down the top-
level README with implementation details.

Meanwhile, data exporting (for shard splits or table defragmentation) is done
on a standby slave with replication stopped.

------
joshjhacker
Is there a mailing list for jetpants so we can ask questions? For example how
should my application connect to mysql? Should it connect directly to mysql or
to a jetpants server and how?

~~~
evanelias
Good call -- we'll definitely set one up if there's sufficient need. Until
then, feel free to email me; my email address is in the gemspec. I'll write up
an FAQ once I have enough questions answered.

re: your immediate question, you still connect to MySQL as normal. Jetpants
isn't a server, middleware, framework, or ORM. Rather, it's a toolkit.
Jetpants has a command suite that you can use to run its built-in
functionality, but it's also a Ruby gem that you can integrate into custom
scripts however you'd like.

The functionality is all geared towards managing large DBs
(importing/exporting lots of data quickly, copying files quickly, etc) and
managing large numbers of servers (promoting/demoting masters and slaves,
adding new machines to a pool, rebalancing a shard).

~~~
joshjhacker
Thanks!

Another question - we are using django so are considering Postgres since there
is a python connection pool available. Could Jetpants be potentially used for
Postgres? ie how much of the functionality is Mysql specific?

~~~
evanelias
The core functionality is currently very MySQL-specific. In theory a plugin
could override a bunch of methods to target Postgres, and maybe even Redis or
other persistent data stores with replication and import/export functionality.
It would be a lot of work though.

I also made the mistake of putting "mysql" in the names of a few methods. At
some point soon I'll change those to more generic names, and alias the old
names to the new generic ones.

~~~
plasma
I'd love for this to do PgSql too.

Thanks very much Evan as well for our chats ages ago (Andrew here), was happy
to see you release this tool!

------
zerop
Sounds quite useful. Was looking for some good tools for sharding and replicas
for mysql.

