
Petabyte-Scale Data Pipelines with Docker, Luigi and Elastic Spot Instances - vtuulos
http://tech.adroll.com/blog/data/2015/09/22/data-pipelines-docker.html
======
ngould
Cool stuff. Curious to know what systems these containerized tasks are pulling
data from. Does adroll consider let those containers access production
database instances? Or are they backed by non-production systems? (EDIT: I
understand that it's S3 for intermediate steps. But I'm curious where the data
comes from initially.)

~~~
vtuulos
We have a few cases of containers pulling (meta)data from databases. We use
Postgres on RDS, so setting up a read-replica is very easy. Containers can
safely access these replicas.

Raw data is collected by our bidders and adservers that push data to S3 and
Kinesis for real time consumers
([http://tech.adroll.com/blog/data/2015/06/26/kinesis.html](http://tech.adroll.com/blog/data/2015/06/26/kinesis.html))

------
samkone
Interesting, we're doing similar things with Spark, Cassandra and Mesos. By
scaling out Mesos with spot instances for its agents.

------
dkroy
What are you using for scheduling your jobs?

~~~
oavdeev
We wrote our own scheduler, called Quentin, which we may opensource.
Essentially it is a simple queue that takes care of running containers and
managing pools of instances that containers run on by integrating with AWS
autoscaling.

It is a very straightforward service, for example it doesn't do retries (done
by Luigi), it is not highly available and not distributed. That makes things
much easier to maintain, and we don't need HA since jobs inputs/outputs are in
S3 anyway. Luigi makes it easy to restart parts of the pipeline if scheduler
ever goes down.

