
Dynein: Building an Open-Source Distributed Delayed Job Queueing System - prostoalex
https://medium.com/airbnb-engineering/dynein-building-a-distributed-delayed-job-queueing-system-93ab10f05f99
======
polskibus
I wish new projects that have huge reuse potential like this one, did not
start as single cloud provider-oriented.

~~~
uttpal
Checkout Clockwork for cloud provider independent distributed event scheduler
[https://cynic.dev/posts/clockwork-scalable-job-
scheduler/](https://cynic.dev/posts/clockwork-scalable-job-scheduler/)

------
andrewstuart
Airbnb has build a needlessly complex solution that uses more technologies and
layers than needed. Postgres as a job queue would have met all the
requirements as described in this article without going to the extent of
building SQS plus DynamodDB systems and jumping through weird hoops to do
delayed job processing - SQL gets you all that for virtually nothing.

I needed a global distributed global queueing and scheduling system.

Like Airbnb I also thought SQS looked ideal.

But I needed near instantaneous response time and I found the jobs given to
SQS might take tens of seconds until clients started reading them - put
something into SQS and there was no way to guarantee clients would start
processing the job instantly.

I also wanted sorting and priority queueing and the ability to do scheduling.

But all of the other queueing systems were very heavyweight and complex and
needed dedicated libraries that were often patchy and out of date and
unreliable and a problem.

I also absolutely required HTTP access to the queueing system, and I liked the
efficiency of SQS's long polling over HTTP and I wanted that too.

So I wrote my own global queueing and scheduling system in a few handfuls of
lines of code using Python, Postgres SKIP LOCKED and also an async web server.

I can't easily show you the async server code, but it's really trivial -
anyone can write a simple web server to sit in front of code that does
Postgres SKIP LOCKED. I've posted my Python SKIP LOCKED queueing code here on
HN before and I'll do it again now for anyone wanting to do the same thing.

Put the following code into a function in a web server that supports long
polling and you've solved all of the goals that Airbnb faced without needing
to use SQS or the other systems that they built to make a global distributed
queueing system. This solution is also lightning fast.

Here is a complete implementation - nothing needed but Postgres, Python and
psycopg2 driver:

    
    
        import psycopg2
        import psycopg2.extras
        import random
    
        db_params = {
            'database': 'jobs',
            'user': 'jobsuser',
            'password': 'superSecret',
            'host': '127.0.0.1',
            'port': '5432',
        }
        
        conn = psycopg2.connect(**db_params)
        cur = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
        
        def do_some_work(job_data):
            if random.choice([True, False]):
                print('do_some_work FAILED')
                raise Exception
            else:
                print('do_some_work SUCCESS')
        
        def process_job():
        
            sql = """DELETE FROM message_queue 
        WHERE id = (
          SELECT id
          FROM message_queue
          WHERE status = 'new'
          ORDER BY created ASC 
          FOR UPDATE SKIP LOCKED
          LIMIT 1
        )
        RETURNING *;
        """
            cur.execute(sql)
            queue_item = cur.fetchone()
            print('message_queue says to process job id: ', queue_item['target_id'])
            sql = """SELECT * FROM jobs WHERE id =%s AND status='new_waiting' AND attempts <= 3 FOR UPDATE;"""
            cur.execute(sql, (queue_item['target_id'],))
            job_data = cur.fetchone()
            if job_data:
                try:
                    do_some_work(job_data)
                    sql = """UPDATE jobs SET status = 'complete' WHERE id =%s;"""
                    cur.execute(sql, (queue_item['target_id'],))
                except Exception as e:
                    sql = """UPDATE jobs SET status = 'failed', attempts = attempts + 1 WHERE id =%s;"""
                    # if we want the job to run again, insert a new item to the message queue with this job id
                    cur.execute(sql, (queue_item['target_id'],))
            else:
                print('no job found, did not get job id: ', queue_item['target_id'])
            conn.commit()
        
        process_job()
        cur.close()
        conn.close()

~~~
hartzell
[yeah, not a frequent poster, apologies for the markdown markup]

Thanks for sharing this, it's been fun to play with. I've been poking around a
bit, looks like `skip locked` was added back in v9.5.

For those who are interested in how it works, here are a few links that might
be useful:

\- [What is SKIP LOCKED for in PostgreSQL 9.5?][2ndquadrant]

\- [Postgres 9.5 feature highlight - SKIP LOCKED for row-level
locking][otacoo]

\- [The Skip Locked feature in Postgres 9.5][pgcasts]

The 2ndquadrant post points out that Oracle and SQL Server have similar
functionality. Mysql didn't at the time, but [seems to have added it in
8.0.1][mysql]

[2ndquadrant]: [https://www.2ndquadrant.com/en/blog/what-is-select-skip-
lock...](https://www.2ndquadrant.com/en/blog/what-is-select-skip-locked-for-
in-postgresql-9-5/)

[mysql]: [https://mysqlserverteam.com/mysql-8-0-1-using-skip-locked-
an...](https://mysqlserverteam.com/mysql-8-0-1-using-skip-locked-and-nowait-
to-handle-hot-rows/)

[otacoo]:
[https://web.archive.org/web/20160626090321/http://michael.ot...](https://web.archive.org/web/20160626090321/http://michael.otacoo.com/postgresql-2/postgres-9-5-feature-
highlight-skip-locked-row-level/)

[pgcasts]: [https://www.pgcasts.com/episodes/the-skip-locked-feature-
in-...](https://www.pgcasts.com/episodes/the-skip-locked-feature-in-
postgres-9-5/)

