

Behind the Pins: Building Analytics - tongbo
http://engineering.pinterest.com/post/96104443004/behind-the-pins-building-analytics

======
tongbo
Hi bcbrown,

Thank you for reading the post! We use AWS as our processing cluster, so we
can easily scale according to our needs. The condition job here is a way to
check cross workflow dependencies. When each job succeeded and emitted files,
it will create a success file within the output directory. The condition job
just looks for the existence of the success file, which is fast and reliable.

-Tongbo

------
bcbrown
A 19 hour, 100-job Hadoop pipeline! Are you running up against the limits of
your architecture, or do you think there's room to grow?

"The MapReduce pipeline starts to process data as soon as the data is
available. It’s triggered by the condition jobs which periodically check if
the data is available on S3." What do you mean by that? What's the trigger
condition, when the number of logged events reaches a threshold? I was a
little surprised to hear it's dynamically triggered, and not on a scheduled
clock.

Cool stuff, thanks for posting it.

