
Pinterest open sources Pinball – a flexible data workflow manager - llaxsll
http://engineering.pinterest.com/post/113376157699/open-sourcing-pinball
======
chatmasta
There's something amazing about how Pintrest -- a seemingly simple social
media app, roughly replicable with a few hours of CRUD framework programming
-- takes on a life of its own when infused with venture capital funding and a
strong engineering team. The core of the web product is so simple as to be
almost trivial: show a grid of images and links to each user. Indeed, when
Pintrest first started I'm sure the logic entailed little more than that. Now,
billions of page views and dozens of engineering hires later, a once-simple
app becomes a behemoth force, crunching data on the order of petabytes per
day.

How do the operations powering an app like Pintrest evolve from simple to so
complex? Do the complexities emerge from necessity, or simply from idle time
on behalf of the engineers, who naturally crave hard problems to solve?

It's a fascinating meta-commentary on our industry that simple web apps grow
to become such complex operations. A business can survive on the kernel of its
core competency -- in this case, photo grids -- but to thrive, it requires
careful attention to petabytes of peripheral decisions. Indeed, it seems such
an evolutionary process is advantageous for a web startup. Friendster and
MySpace may well have failed because they mistook their problems for simple
ones. They were able to solve the core problem of a social network, but not
the many peripheral ones of operating that network at scale. It's the ability
to do the latter that sets apart the major successful startups from the also-
rans.

------
jsmeaton
I have a workflow that I'd really like to automate/rewrite. A wav file is
generated on a remote server. That server will rsync/scp it to a processing
node. The processing node will query a database, and write out a text file
with parts of that file to remove. It'll then convert it to mp3 (using sox and
lame) with those parts removed. Another job will then pick up the mp3 file,
query another database, and if it gets a hit it will sync that file to s3.

Is this a kind of workflow that would run with pinball? Can you move files
around with it, or do you use the file system and pass filepaths around?
Ideally, the workflow job would hold onto the wav/mp3 and the associated
database fields that are returned so I don't have to juggle weird directories
around (and have to sync access to them).

I'm not familiar with any other workflow engines, so I'm unsure if this is the
kind of thing that would traditionally run on one. I looked at the user guide
but it's currently barren.

~~~
maoyesf
Pinball is good for this use case. You can build a workflow include a few
jobs,

job1. generate a wav file, and put it somewhere say, s3://wav.file

job2 (run after job1): pick the wav file from the location s3://wav.file

you need to know the contract between the parent and child jobs from the
business logic. In this example, when you implement job 1 and job 2, you need
to have protocol for them to produce store and consume the wav.file..

~~~
jsmeaton
Thanks for the reply. I'm wondering how you would share the location of the
file between jobs though. Can job 1 output a file location that job 2 accepts
as an input?

I see there are plans to write up some documentation, but are there any
timelines that you're aiming to have those written?

Also, the README calls out mysql as being required. I assume that this, being
a django project, will work with other backends too. Is there anything, to
your knowledge, that would prevent a different backend being used (like
postgres or oracle)?

------
solve
I've badly been wanting one of these since I used a great one at my last job
in 2010. Are there more of these now that I don't know about?

~~~
unode
yep, check out Spotify's Luigi project. Probably the most widely adopted OSS
one [https://github.com/spotify/luigi](https://github.com/spotify/luigi)

~~~
andy_wrote
Are there people who have more experience with comparative workflow managers
who can quickly see the pros and cons of Pinball vs. Luigi? Perhaps someone at
Pinterest who tried out other systems, as was mentioned in the post? (Though
maybe Luigi wasn't available to the public when this comparison happened.)

~~~
maoyesf
Luigi was not available in public, when Pinball starts. So not sure the pros
and cons between Pinball and Luigi.

When we build pinball, we aim to build a scalable and flexible workflow
manager to satisfy the the following requirements (I just name a few here).

1\. easy system upgrade - when we fix bug or adding new features, there should
be no interruption for current running workflow and jobs. 2\. easy add/test
workflow - end user can easily add new jobs and workflows into pinball system,
without affecting other running jobs and workflows. 3\. extensibility - a
workflow manager should be easy to extended. As the company and business
grows, there will be a lot new requirements and features needed. And also we
love your contributions as well. 4\. flexible workflow scheduling policy, easy
failure handling. 5\. We provide rich UI for you to easily manage your
workflows \- auto retry failed job, \- you can retry failed job, can skip some
job, can select a subset of jobs of a workflow to run (all from UI) \- you can
easily access all the running history of your job, and also get the stderr,
stdout logs of your jobs \- you can also explore the topology of your
workflow, and also support easy search. 6\. Pinball is very generic can
support different kind platform, you can use different hadoop clusters,e.g.,
quoble cluster, emr cluster. You can write different kind of jobs, e.g.,
hadoop streaming, cascading, hive, pig, spark, python ...

There are a lot interesting things built in Pinball, and you probably want to
have a try!

------
asmosoinio
Does anyone know how this compares to full Business process modeling
platforms, such as Activiti, jBPM, or Bonita?

------
ecesena
Does anyone know how this compares to celery?

~~~
maoyesf
[http://www.celeryproject.org/](http://www.celeryproject.org/) celery is a
Distributed Task Queue. Pinball has the concept of workflow and in a workflow
there are many jobs. Pinball handles helps translate a lot application logics
like workflow, schedule, jobs into its system, and provides a lot function for
end user to manage their workflow jobs.

We do compare Pinball with Apache oozie and azkaban when we start this
project.

~~~
ecesena
Thanks for the details! I will look into these resources.

------
mohap
are there any open source toolkits that handle user workflows well?

------
jellyroll
Wow this is awesome

