Pinterest open sources Pinball – a flexible data workflow manager

chatmasta · on March 12, 2015

There's something amazing about how Pintrest -- a seemingly simple social media app, roughly replicable with a few hours of CRUD framework programming -- takes on a life of its own when infused with venture capital funding and a strong engineering team. The core of the web product is so simple as to be almost trivial: show a grid of images and links to each user. Indeed, when Pintrest first started I'm sure the logic entailed little more than that. Now, billions of page views and dozens of engineering hires later, a once-simple app becomes a behemoth force, crunching data on the order of petabytes per day.

How do the operations powering an app like Pintrest evolve from simple to so complex? Do the complexities emerge from necessity, or simply from idle time on behalf of the engineers, who naturally crave hard problems to solve?

It's a fascinating meta-commentary on our industry that simple web apps grow to become such complex operations. A business can survive on the kernel of its core competency -- in this case, photo grids -- but to thrive, it requires careful attention to petabytes of peripheral decisions. Indeed, it seems such an evolutionary process is advantageous for a web startup. Friendster and MySpace may well have failed because they mistook their problems for simple ones. They were able to solve the core problem of a social network, but not the many peripheral ones of operating that network at scale. It's the ability to do the latter that sets apart the major successful startups from the also-rans.

jsmeaton · on March 12, 2015

I have a workflow that I'd really like to automate/rewrite. A wav file is generated on a remote server. That server will rsync/scp it to a processing node. The processing node will query a database, and write out a text file with parts of that file to remove. It'll then convert it to mp3 (using sox and lame) with those parts removed. Another job will then pick up the mp3 file, query another database, and if it gets a hit it will sync that file to s3.

Is this a kind of workflow that would run with pinball? Can you move files around with it, or do you use the file system and pass filepaths around? Ideally, the workflow job would hold onto the wav/mp3 and the associated database fields that are returned so I don't have to juggle weird directories around (and have to sync access to them).

I'm not familiar with any other workflow engines, so I'm unsure if this is the kind of thing that would traditionally run on one. I looked at the user guide but it's currently barren.

maoyesf · on March 12, 2015

Pinball is good for this use case. You can build a workflow include a few jobs,

job1. generate a wav file, and put it somewhere say, s3://wav.file

job2 (run after job1): pick the wav file from the location s3://wav.file

you need to know the contract between the parent and child jobs from the business logic. In this example, when you implement job 1 and job 2, you need to have protocol for them to produce store and consume the wav.file..

jsmeaton · on March 14, 2015

Thanks for the reply. I'm wondering how you would share the location of the file between jobs though. Can job 1 output a file location that job 2 accepts as an input?

I see there are plans to write up some documentation, but are there any timelines that you're aiming to have those written?

Also, the README calls out mysql as being required. I assume that this, being a django project, will work with other backends too. Is there anything, to your knowledge, that would prevent a different backend being used (like postgres or oracle)?

solve · on March 12, 2015

I've badly been wanting one of these since I used a great one at my last job in 2010. Are there more of these now that I don't know about?

unode · on March 12, 2015

yep, check out Spotify's Luigi project. Probably the most widely adopted OSS one https://github.com/spotify/luigi

andy_wrote · on March 12, 2015

Are there people who have more experience with comparative workflow managers who can quickly see the pros and cons of Pinball vs. Luigi? Perhaps someone at Pinterest who tried out other systems, as was mentioned in the post? (Though maybe Luigi wasn't available to the public when this comparison happened.)

maoyesf · on March 12, 2015

Luigi was not available in public, when Pinball starts. So not sure the pros and cons between Pinball and Luigi.

When we build pinball, we aim to build a scalable and flexible workflow manager to satisfy the the following requirements (I just name a few here).

1. easy system upgrade - when we fix bug or adding new features, there should be no interruption for current running workflow and jobs. 2. easy add/test workflow - end user can easily add new jobs and workflows into pinball system, without affecting other running jobs and workflows. 3. extensibility - a workflow manager should be easy to extended. As the company and business grows, there will be a lot new requirements and features needed. And also we love your contributions as well. 4. flexible workflow scheduling policy, easy failure handling. 5. We provide rich UI for you to easily manage your workflows - auto retry failed job, - you can retry failed job, can skip some job, can select a subset of jobs of a workflow to run (all from UI) - you can easily access all the running history of your job, and also get the stderr, stdout logs of your jobs - you can also explore the topology of your workflow, and also support easy search. 6. Pinball is very generic can support different kind platform, you can use different hadoop clusters,e.g., quoble cluster, emr cluster. You can write different kind of jobs, e.g., hadoop streaming, cascading, hive, pig, spark, python ...

There are a lot interesting things built in Pinball, and you probably want to have a try!

vvladymyrov · on March 12, 2015

We are heavy users of Luigi in my company. Its central scheduler process is also UI and sometimes UI stuck for us.

Luigi though has a lot of pipeline building blocks - it provides api to access HDFS, S3, write/read from it etc. They are very useful, but they are executed in the same Python process as the rest of Job - which heavily loads the machine where Job is executed (in our case - same server where luigid scheduler runs).

I'm excited about Pinball architecture. I'd try to use Pinball as scheduler to execute existing Luigi task classes instances on multiple servers.

estefan · on March 12, 2015

I've ported several reasonably complex jobs (files delivered to FTP at arbitrary times to be run through several Hadoop jobs) to luigi and it's been very good. Much more resilient than trying to use something that can only schedule jobs at specific times of the day.

It also has few dependencies and is lightweight (i.e. it's all python, so no JVM tying up resources).

haksmak · on March 13, 2015

fwiw, it is not the case that pinball can schedule jobs only at specific times of the day. In fact the scheduler is merely a special type of worker that happens to start new workflows. It is totally doable to kick off a new workflow at any point in time, bypassing the scheduler.

Also, Pinball is also all Python but it currently has a dependency on mysql so it is definitely not as a lightweight as a standalone tool as luigi but it also offers much more in terms of the available features.

Blackthorn · on March 12, 2015

Likewise (though I'm still at that company). It's truly amazing how many tasks can be broken down into this paradigm.

I think my favorite non-obvious aspect is how it allows you to write each component in a different language.

asmosoinio · on March 12, 2015

Does anyone know how this compares to full Business process modeling platforms, such as Activiti, jBPM, or Bonita?

ecesena · on March 12, 2015

Does anyone know how this compares to celery?

maoyesf · on March 12, 2015

http://www.celeryproject.org/ celery is a Distributed Task Queue. Pinball has the concept of workflow and in a workflow there are many jobs. Pinball handles helps translate a lot application logics like workflow, schedule, jobs into its system, and provides a lot function for end user to manage their workflow jobs.

We do compare Pinball with Apache oozie and azkaban when we start this project.

ecesena · on March 12, 2015

Thanks for the details! I will look into these resources.

mohap · on March 12, 2015

are there any open source toolkits that handle user workflows well?

jellyroll · on March 12, 2015

Wow this is awesome