
Show HN: 1M rows/s from Postgres to Python - 1st1
http://magic.io/blog/asyncpg-1m-rows-from-postgres-to-python/
======
userbinator
At 3.7GHz that's roughly 3700 clock cycles per row. With something that low,
I'd be really interested to see a breakdown of how many cycles were spent
doing what.

~~~
RedCrowbar
In the 1M benchmark roughly half is spent decoding rows (in [1]). The rest is
spread over interpreter overhead, GC in particular.

[1]
[https://github.com/MagicStack/asyncpg/blob/master/asyncpg/pr...](https://github.com/MagicStack/asyncpg/blob/master/asyncpg/protocol/prepared_stmt.pyx#L192)

------
tjpaudio
This is really awesome, I'm going to give it a try in my latest project. Has
there been any apps/sites with large amounts of i/o running this yet? Curious
how reliable it is at this stage.

~~~
1st1
We've open sourced asyncpg just a couple of weeks ago, and right now it has a
beta status. Not that we're aware of any bugs in it or plan to add a lot of
new features. So I guess you should give it a try, carefully ;)

------
jargino
> _For fairness, all tests were run in a single-thread (GOMAXPROCS=1 for Go
> code) in async mode._

Isn't this just hobbling Go so you can beat it? Doesn't seem like something
done "for fairness" at all.

~~~
1st1
Python and nodejs applications are usually deployed in a multiprocess
configuration (and Go apps without GOMAXPROCS). So in production, applications
in all of those languages use all CPU cores.

For benchmarks, we didn't want to complicate things with multi-process setup,
the idea was to compare the raw performance of all drivers.

~~~
Thaxll
"usually deployed" where? Every Python app I've seen in Prod don't use multi
process to bypass the limitation of GIL/#cores.

~~~
nhumrich
GIL only applies to multi-threaded, not multi-process. Multi-process is a
great way to overcome the GIL

~~~
mirekrusin
At a cost of memory.

------
Dowwie
Does this library address any of the issues/concerns that Mike Bayer discussed
in this earlier blog post?

[http://techspot.zzzeek.org/2015/02/15/asynchronous-python-
an...](http://techspot.zzzeek.org/2015/02/15/asynchronous-python-and-
databases/)

~~~
zzzeek
the 1M row/sec part is easy to explain:

> asyncpg extensively uses PostgreSQL prepared statements. This is an
> important optimization feature, as it allows to avoid repeated parsing,
> analysis, and planning of queries. Additionally, asyncpg caches the data I/O
> pipeline for each prepared statement.

it would be great if psycopg2 did this, at least within the context of
executemany(). You don't need async to use prepared statements. But even if
they did this, it wouldn't be this amazing speed-everywhere kind of thing,
just for when you do a bulk INSERT/UPDATE/DELETE. The prepared statement is
linked to a cursor and a statement handle so it's not like you can have
hundreds of these just queued up in a cache without a great deal of
complexity.

The TL;DR; for my blog post you mention is that async itself usually has
little to do with speed when talking to a database, because the database is
typically on the same network as your application, you're using typically no
more than a few dozen connections per process (postgresql uses a process per
connection, so using thousands of connections isn't feasible anyway), so the
overhead of async and especially Python 3's async IO slows things down
considerably without offering any opportunity to regain that speed. The GIL
doesn't block on IO so threads work just fine, and most applications are CRUD
/ transactional applications which means you can't spawn off several dozen
queries within a transaction at the same time anyway, you need to wait for
each one to complete serially regardless.

I'd be very wary of a new PG driver especially one that implements the wire
protocol itself. pg8000 does this, and also does prepared statements, and
while it doesn't have the speed approach here, it also has serious problems
supporting all the features and datatypes that psycopg2 does. It's been around
for years and reliably breaks on many Postgresql features (and Postgresql
introduces new features like a firehose). Writing and maintaining a driver for
Postgresql is _not_ an easy task.

~~~
RedCrowbar
> the 1M row/sec part is easy to explain:

Prepared statements were used in all benchmarks where supported. Please don't
dismiss the results so easily. The source of performance increases here is not
the use of prepared statements, but efficient implementation of the binary
type I/O and the protocol itself.

> I'd be very wary of a new PG driver especially one that implements the wire
> protocol itself. > pg8000 does this, and also does prepared statements, and
> while it doesn't have the speed approach here, it also has serious problems
> supporting all the features and datatypes that psycopg2 does.

asyncpg supports 100% of pg_catalog types (we have a test for that.)
Additionally, psycopg2 doesn't _really_ support all the features and datatypes
and simply returns unparsed text in many cases. You have to write lots of
typecasters to convince it otherwise.

> It's been around for years and reliably breaks on many Postgresql features
> (and Postgresql introduces new features like a firehose). Writing and
> maintaining a driver for Postgresql is not an easy task.

Please. Simply because something has been around is not the reason to stiffle
new development and innovation.

~~~
zzzeek
I have no intention to stifle anything, I'm only raising the point that this
seems like a really new driver and it's likely that it would take a while
before it approaches the stability of psycopg2. That is, the stability of
psycogp2 is not to be so lightly thrown in the trash.

Does your driver seamlessly handle reading and writing of three-dimensional
arrays of JSONB structures, and arrays of custom enumerated types ?

edit:

> but efficient implementation of the binary type I/O and the protocol itself.

that's great too! you don't need async for that either.

I guess I am frustrated that you folks put all this work into a great driver
and then locked it in the largely useless dungeon of asyncio, where all the
libraries are totally useless to those of us who don't think node.js is the
programming model for all cases.

~~~
RedCrowbar
> I have no intention to stifle anything, I'm only raising the point that this
> seems like a really new driver and it's likely that it would take a while
> before it approaches the stability of psycopg2. That is, the stability of
> psycogp2 is not to be so lightly thrown in the trash.

We in no way question the merits and the stability if psycopg2. We created
asyncpg to solve the challenges we face in the development of EdgeDB, it's not
just a toy project for fun.

> Does your driver seamlessly handle reading and writing of three-dimensional
> arrays of JSONB structures, and arrays of custom enumerated types ?

Yes it does.

> edit:

> > but efficient implementation of the binary type I/O and the protocol
> itself.

> that's great too! you don't need async for that either.

> I guess I am frustrated that you folks put all this work into a great driver
> and then locked it in the largely useless dungeon of asyncio, where all the
> libraries are totally useless to those of us who don't think node.js is the
> programming model for all cases.

I'm sorry to hear that. Nobody forces you to use asyncio or asyncpg. But
please don't try to claim that something is a "useless dungeon" simply because
you disagree.

~~~
sametmax
The tone used is indeed not the proper one, but while I'm a big asyncio
advocate, I think it would be beneficial if you would split your project in 2,
separating the IO from the rest (like hyper does
[https://github.com/Lukasa/hyper](https://github.com/Lukasa/hyper)).

The rational being:

\- Python is not just async. It currently majoritarly sync. A lot of the
community would benefit from your awesome project; \- You'd have contrib from
the async and sync community; \- async may sometime not be worth it for db
([http://techspot.zzzeek.org/2015/02/15/asynchronous-python-
an...](http://techspot.zzzeek.org/2015/02/15/asynchronous-python-and-
databases/\);) \- some people will prefer to use threads or gevent; \- you
will see compat plugin created for current ORM;

That aside, I will use your project as is, since I'm deep in asyncio these
days. Thank you.

~~~
1st1
> I think it would be beneficial if you would split your project in 2,
> separating the IO from the rest (like hyper does
> [https://github.com/Lukasa/hyper](https://github.com/Lukasa/hyper)).

I'd like to do that, but we don't have resources. If we had a couple capable
devs volunteering on the GH to do this, I'd be glad to review patches and
guide the development.

~~~
sametmax
I get that. I have the same problem. And it's not just the ressources, what
you need are skilled enough ressources, and that's even harder.

------
1st1
asyncpg was developed by @redcrowbar and me. Ask us anything ;)

~~~
rdtsc
Great work, thanks for sharing. Very impressive speed.

I didn't know about uvloop, that's a cool project:

[http://magic.io/blog/uvloop-blazing-fast-python-
networking/](http://magic.io/blog/uvloop-blazing-fast-python-networking/)

~~~
1st1
Yep, it's also much more stable now, 3 months after that article was posted. I
plan to make a new blog post about it soon.

------
rambos
Fastest setup of a PG client, to date, on OSX for me. No nuances or other
package installers like Fink necessary.

------
pxlfkr
Using Cython means you don't/can't use Pypy? Was that something you
considered, e.g. with psycopg2cffi or a pure-python driver? It seems to be the
defacto performant runtime for Python, and since performance is something you
guys obviously care a lot about, I'm surprised it hasn't been mentioned in
this thread.

Very excited to try out your code - thanks for sharing it.

~~~
RedCrowbar
A pure-Python implementation would probably never reach this level of
performance, even under PyPy. Cython really makes it possible to basically
drop down to C and write very efficient code.

That said, once PyPy supports 3.5, we will consider making a CFFI binding.

------
spriggan3
Why is node pg so slow relative to go and asyncpg drivers ? I'm sure there is
room for improvement here.

~~~
also_on_sunday
Primary author of node-postgres (both the JavaScript & libpq versions) here.
Very good question. I'd say part of the relative speed is because it's been a
mostly 1 person (me!) project for a loooong time & I haven't had a lot of time
to do perf analysis to find where exactly to make it faster. I would
_absolutely love_ some help! My time for open source is pretty limited, and I
try to spend most of it focused on triaging issues. If you'd like to help out
please don't hesitate to open an issue, a pull request, or send me a gchat
(contact info is in node-postgres readme) and we can discuss!

------
merb
would be cool to also make jmh tests against jdbc and async pg on java.

actually what i also found is that python wasn't slow in these cases but it
was slow when you do map operations on the lists and reoder the data or
convert it to other structures.

~~~
RedCrowbar
> would be cool to also make jmh tests against jdbc and async pg on java.

Pull requests to
[https://github.com/MagicStack/pgbench](https://github.com/MagicStack/pgbench)
are welcome :-)

------
Fiahil
We've been using Python (2.7) and Postgres, where I work, and I must say I
really miss the async feature. It's really convenient to be able to
parallelize and saturate I/O very quickly. Right now, we're using a
combination of threads and processes without proper messaging, which leads to
deadlocks and other nasty things. That's one thing I've been quite pleased
with, when using Scala.

------
thang
Awesome work! Can't wait to see more about EdgeDB.

------
bedros
would this work with django?

~~~
1st1
asyncpg was designed for asyncio (and async/await). So it can be used in any
asyncio, Tornado (and soon Twisted) program.

Django doesn't support asyncio, it's built for blocking IO model. So,
unfortunately, it's not possible to use asyncpg with it.

~~~
mamcx
But could be if the code is made to lock? ie, as if the async part not matter?

~~~
1st1
Theoretically yes, but it would require is to seriously re-architecture the
protocol implementation to support several IO modes.

------
davidw
Wonder how this (Erlang driver) would stack up in the benchmarks (devel
branch): [https://github.com/epgsql/epgsql](https://github.com/epgsql/epgsql)

~~~
1st1
Consider making a PR to
[https://github.com/MagicStack/pgbench](https://github.com/MagicStack/pgbench)
:)

------
denfromufa
How do you develop your Cython code? Do you use PyCharm? Did you start off
from pure Python code? How easy is debugging and refactoring?

~~~
1st1
I use SublimeText with a Cython highlighter; @redcrowbar uses Atom. I started
coding uvloop & asyncpg directly in Cython, basically from scratch.

Debugging & refactoring are pretty much as easy as for pure Python code (which
is a huge win over C). Sometimes when you're writing low-level code you can
get a segfault, in which case lldb (or gdb) help you understand what's going
on relatively quickly.

~~~
denfromufa
So do you use cygdb to debug Cython code or directly debug the generated C-API
code?

~~~
1st1
I usually debug the generated C code directly (usually it's quite easy to
understand the bug once you see where the program crashes.)

------
wampler
Cool! Wish there would be a Django wrapper, but then realised it is designed
for asyncio

------
amelius
What about the complexity of the rows? For instance, how many columns per row?

~~~
RedCrowbar
12 in the particular case of 1M rows/s. The details are available in the
benchmark report: [http://magic.io/blog/asyncpg-1m-rows-from-postgres-to-
python...](http://magic.io/blog/asyncpg-1m-rows-from-postgres-to-
python/report.html#bench1)

------
jamiesonbecker
Would this work/be binary-compatible with Amazon Red Shift?

~~~
1st1
Should be. AFAIK psycopg2 works with Redshift just fine, so should asyncpg.

The only thing I'm not sure about is what kind of authentication mode Redshift
is using. asyncpg doesn't support all of them, but that's easy to fix.

------
xgess
i'd love to see raw, command-line psql added to these benchmarks

------
landmark3
time for a rollback uUber

