

Massively faster DB with async batching - willvarfar
http://williamedwardscoder.tumblr.com/post/16516763725/how-i-got-massively-faster-db-with-async-batching

======
steder
It's a great idea except: You're now eventually consistent. You've
reimplemented a ton of database logic in his application. The database already
reorders inserts and checks for conflicts. The database is also already doing
its best to be durable (write-ahead logs for example). You can't run more than
one server due to the custom batching and reordering logic and in memory
caching and a hardware failure will result in loss of data.

You've put all your data in one basket. Anything happening to _the_
application server results in data loss. Of course hardware failures happen
and all the batching behavior here seems like it exacerbates the problem. A
single server handling all writes and reads also makes it particularly
difficult to update that server at any point in time. When is it safe to
restart this server to apply an update or fix a bug?

Basically the author has avoided using a NoSQL database by reinvented a bunch
of things to provide all of the downsides of a NoSQL database (eventual
consistency, no ACID semantics) without any of the upsides (high availability,
replication).

Alternative solutions to the "handle many requests" on top of a SQL server
problem?

Have more application servers. You don't need to service all user requests
from a single machine. You don't need to have user 10 wait on user 1s database
write if they're independent. Multiple servers have other benefits like
rolling updates...

Use memcache.

Use a (durable) message queue to offload work from a the web servers and do
non-essential updates/inserts offline.

And of course: Just use a NoSQL database and design your application
appropriately.

~~~
willvarfar
Fair enough viewpoint. I think its more nuanced than that.

The DB queue may be a buffer but the callbacks are not triggered until
success/failure (and when caching like I did for the SELECTs that doesn't
happen until then too). This is just the same as the twisted/tornado db
connection pool approach to getting the DB access async, but is getting much
better performance out of it by the merging client-side logic; its still the
same schematics and there's no risk of a request succeeding but the DB not.

~~~
willvarfar
I have put some sketches into my post to describe when things hit disk and
when the client making the request continues. I hope this clarifies all the
ACID questions.

~~~
steder
Ah, you don't respond to the user until the DB interactions for a given user
have been processed as a batch. I was under the impression that on a write you
were: updating the in memory dataset so that the user could immediately read
their data after the write, queuing a write to MySQL, and then responding to
the user (200 OK) even though the data had not been committed.

So my concern regarding consistency was that during this period after
responding to the user but before your queue committed data that a bug or
hardware failure would result in some users thinking that there requests were
successful but none of their work would be saved.

As long as the user doesn't receive a response until data has been committed
to your DB this makes sense.

Now that I understand the flow a bit better I agree that it makes sense. This
is similar to what folks were trying to do with MySQL-Proxy. I still think
that you're asking for trouble by coupling your components together in the
same process and would really encourage you to take it to the next level by
factoring out caching and proxying/batching of db access into their own
services.

The more I think about this the more it seems like it would be reasonable to
build as a separate twisted process and exposed through a api compatible with
twisted.enterprise.adbapi... Figuring out reasonable (and general) ways to
determine which statements can be safely batched could be quite challenging
though.

~~~
willvarfar
I happen to be using Python's built-in new multiprocessing module. It has a
suitable queue. I am using the same logic to talk to rabbitmq. I would avoid
tcp.

~~~
steder
Where's rabbitmq in your architecture? What's your concern with TCP?

~~~
willvarfar
I meant I I'd rather use Queue.Queue or multiprocessing.Queue to talk to a
worker than TCP. Its just my preference.

Its generally the principle of least-power is that you'd use a thread (nice
for the callback handling, and for 'interactions') or a process (if you worry
about the GIL, but it complicates callbacks and interactions), and only jump
to a TCP server if you really really could justify it.

I also meant that the approach of workers that do coalescing is useful for
non-DB things too; we're using a worker to buffer rabbitmq, for example.

------
skrebbel
This appears to me as a ridiculously good idea. I've no clue whether it's
really novel, but to me it is.

Anyone got a clue about the implications? Also, how difficult to do? (besides
than the author's own impressions) Would it be possible make a more generic
solution?

~~~
julochrobak
What do you find so rediculously good about this? I find it complicated. If I
understood it properly, it's just making inserts asynchrounous (removing
durability) and batches data into single statement (removing atomicity). Or
did I miss something?

~~~
willvarfar
Its still ACID. Its the same callee code as you use in, for example, twisted
already to have DB async. It just gets way better performance because the DB
speed is measured in requests/sec and not how big those requests are. So
merging many requests into a single one where possible means you get more
throughput.

~~~
marshray
But it's not ACID. For example, if the server or process running his database
worker queue were to crash, records would be lost.

Of course, it's quite likely an acceptable trade-off for for these records.
Relational DBs sold everyone on ACID in the 90's, but a lot of data doesn't
really deserve its cost. He hasn't found a way around Brewer's Conjecture (and
I wish I had a nickel for every time it's rediscovered :-).

~~~
willvarfar
But then the callback to the request would timeout and the callee code waiting
on the web request would be in no misunderstanding that the thing was
committed or not.

Things in the queue get acknowledged - and sometimes resultsets or IDs
returned, in the case of SELECT and INSERTs that return auto keys.

~~~
marshray
OK, I see. So your batching related inserts to cut down on the per-transaction
overhead, but not actually returning anything to the client until the commit
is confirmed.

I think this is a good method, as long as you're careful about these things.
(I assume you've taken care of them, just talking about this method in
general).

1) If there were multiple transactions in the web request, previously the
second ones likely wouldn't have been run if the first were to fail. This
would likely change that and it could even necessitate having rollback for the
second transaction.

2) Now there is a potential "thundering herd" problem. When the transaction
batch commit notification is received, all 1000 web requests made over that
100 ms period become ready to execute and start competing for
network/disk/database resources _simultaneously_. A disk that was just sitting
idle now has 1000 different things to do at once. If everybody hits the
network at once it can cause temporary packet loss. When packets are lost,
connections can be dropped or delayed by retransmits. Sometimes client retries
and retransmits can happen synchronously. In the worst case, such a system can
end up oscillating wildly between excessive load and excessive idle.

~~~
willvarfar
Yes this is what twisted's "interactions" are and how I talk about throwing
rollbacking-exceptions and such in the article.

Yes, you do have to be careful about saturating links generally. The
thundering herd thing is a rich man's problem :)

------
rkalla
Anyone have a good idea of what threshold should be introduced into the
Coalescing Queue?

For example, on a slow server you could imagine a command coming in every
second; you likely don't want to hold onto those commands for a second
artificially to find out there are no new methods to try and combine it with;
so you just want to answer right away.

Now your server is getting busier... 4 commands a second... do you want to
artificially hold commands for 250ms+ or longer to see if there are new
commands to combine with it?

I imagine not...

In the scenario where you are getting 500+ req/sec it makes a lot of sense to
coalesce them, but at some point your queue needs to decide when to make a cut
in the queue, take that slice of commands, process/combine them THEN send them
to the server and it is that heuristic I'm curious about.

~~~
willvarfar
I put in the batching point at: 1) over a certain time-limit, e.g. 100ms 2)
over a certain number of commands, e.g. 2000 3) encountering a specific
'flush' command, that I injected sometimes to simply the logic

Those were the numbers I, by trial and error, determined for my server. But
remember, I didn't do SELECTs; I put everything I might SELECT into memory. So
my numbers are rather application-specific perhaps.

Its important to measure the average and worst case response times, not the
best-case. Under light load, best-case times are from blocking calls, which is
why conventional ORM looks good enough in prototyping ;)

~~~
rkalla
Thanks Will; I missed all that for some reason. Appreciate you sharing your
actual threshold numbers (would have been my second question).

------
dkhenry
I think the problem being described here is he is using a database where he
doesn't need to. This is the problem things like cassandra were made to solve.
Effectivally he is not storing things to the DB he is logging them to the DB
in case of a program crash. To speed things up he stores his entire data set
in RAM client side. Writes are buffered and written to the DB, after they have
modified the client side data. I assume the point is when the program crashes
or dies it just rebuilds the client side data from the DB and then proceed to
continue to log to the DB. He can get rid of a lot of his code and add in
features but using something like MongoDB, Casandra , or Riak

~~~
jamii
> Effectivally he is not storing things to the DB he is logging them to the DB
> in case of a program crash. To speed things up he stores his entire data set
> in RAM client side.

At this point its not worth even using a database. I'm currently working on an
experimental framework at Smarkets where we just write events directly to disk
and do everything else in memory (with the occasional snapshot to speed
recovery). If all goes well it will be open-sourced in a few weeks.

LMAX have taken this idea to extremes and have some mind-blowing benchmarks -
<http://martinfowler.com/articles/lmax.html>

~~~
willvarfar
I think the merging and reordering stands alone as a solution; the extension
where, in my particular project, I cache to avoid some common SELECTs is a
distinct other subject. They complement each other, and are in the same post,
is all.

------
rarrrrrr
If you're hitting the database concurrently with many sessions, you can get
much of the same benefit by just adjusting the PostgreSQL commit_delay and
commit_siblings paramaters, so the database batches things into fewer flushes
and fsyncs for you.

~~~
willvarfar
don't you want to know that the statement hit disk before continuing? If
you're a web request thing that is, say, submitting an order, don't you want
that order to hit disk before the web request continues, retrieving the
order#?

~~~
rarrrrrr
It will hit the disk before continuing. It just causes server side batching of
more operations into fewer flushes and fsyncs, often improving overall
throughput in the case that you have several connections concurrently hitting
the DB.

Note also that if you do want to avoid atomic commit for particular
transactions (less important stuff), you can do that by issuing SET LOCAL
synchronous_commit TO OFF within the transaction. But that's not what
commit_delay does.

[http://www.postgresql.org/docs/9.1/static/wal-
configuration....](http://www.postgresql.org/docs/9.1/static/wal-
configuration.html)

[http://www.postgresql.org/docs/9.1/static/runtime-config-
wal...](http://www.postgresql.org/docs/9.1/static/runtime-config-wal.html#GUC-
SYNCHRONOUS-COMMIT)

~~~
willvarfar
And doing the same client-side is often improving overall throughput in the a
few connections multiplexed between a large number of requests in the same
manner.

------
b0b0b0b
Do you have a cluster of application servers? What transaction isolation level
do you use? Will User A mind that his update is no longer a single transaction
and that it may be commingled with txn's of Users B&C?

------
kermitthehermit
This is pretty cool.

Do you plan to continue working on hellepoll? I liked the idea and would love
to see it developed further.

~~~
willvarfar
Most of what's built on hellepoll has been proprietary.

If hellepoll had a DB layer, it might well come with an merging reordering
worker queue though ;)

~~~
kermitthehermit
I see. It would be really nice to have a community which develops hellepoll.

After all, having a maintained alternative to node.js written in C++ would be
very nice.

I'm not saying node.js isn't OK or that those who use it are doing something
wrong. It's good to have many tools for different scenarios and different
preferences.

Would you like to start such a community for hellepoll? I'd be the first one
to join you.

Thank you for your reply and for publishing hellepoll.

~~~
willvarfar
thank you for the kind words.

two does not a crowd make, sadly

------
blibble
wouldn't it be easier (and have the same effect) if you just wrapped 100
statements (or say 60 seconds) of delay-able INSERTS into a transaction with
the correct isolation value set?

or use INSERT DELAYED and let the database do it itself.

~~~
necro
Author states that he is using innodb and insert delayed is a myisam
mechanism. Of course in innodb the buffer pool / insert buffers can be thought
of the same way. The delay the author is noting is probably due to flush of
the log file data.

If you configure innodb correctly it comes down how much IO you can do on your
log fs, to make sure you can sustain some update rate to this sequential log.
Then nest is the IO to the db fs, so you can sustain the rate of the flushes.
You can tune some of this much better with some percona mysql patches. Innodb
does insert batching for you to minimize IO, and if you are doing updates to
the same data it may even buffer the changes in buffer pool.

------
thesorrow
tldr: Group your inserts ! "Together we are stronger" – Spartacus

~~~
rkalla
A bit more than that: "Group, convert and collapse your DB commands!"

    
    
      insert 1, insert 2, insert 3 == insert [1,2,3]
      insert 1, delete 1 == [no-op]
    

As the author pointed out you need to know a bit more about your domain to
make this really efficient, for example:

    
    
      insert user1.dob, delete user2.dob = [2 discrete ops]
    

while

    
    
      update user1.dob, delete user1.dob = [1 op - delete user1.dob]
    

Cool idea; lots of opportunities for enhancement. The one thing you would have
to look out for is effectively re-implementing a query optimizer inside of
your application code and eating up _too_ much CPU time on a busy system
prepping calls.

If you are not I/O bound, then this strategy isn't going to help you that
much. If you are I/O bound, it is a great way to help get yourself out of that
hole.

------
est
offtopic: are there any direct sql to binlog async translator?

