
Testing a Distributed System - alanfranzoni
http://queue.acm.org/detail.cfm?ref=rss&id=2800697
======
bitL
From an experience of working on a distributed enterprise messaging server
(one of the top sellers with some super advanced stuff like automatic
failover, a client opaque to errors, TX and HA), the conclusion is that what
could go wrong will go wrong in a distributed system. My favorite were
distributed deadlocks; losing messages with certain cluster configurations
under specific conditions was close second. If you support multiple OSes, they
are all very different at the very bottom level and sometimes socket states
and their transitions are just plainly weird (and undocumented). Fun is also
caused by the "observer effect" \- if you try to maximize throughput and
something goes wrong, once you enable logging, the problem disappears (and of
course throughput goes way down). Transaction processing is also yummy, 2PC
obviously can't be recovered in all possible failure cases, 3PC is sensitive
to network partitioning, checkpointing might run into local storage problem
etc. We had our own distributed testing suite as well as standard conformance
tests, yet these were badly insufficient and were more like firing off a
shotgun and hoping it would help.

------
maramono
>> A "hybrid" solution may work well for you—group your tests into small sets,
then run the sets sequentially, doing as much of each set in parallel as
possible.

Very true. In fact, the Ortask scheduling engine takes this approach for
automatically planning test cases, but it goes a bit further: it re-orders the
test cases in each set so that they all run in a more optimal way.

Relevant details start at minute 1:35 here
[http://youtu.be/n_yOU11J7as](http://youtu.be/n_yOU11J7as)

------
tlarkworthy
The iptables trick Aphyr uses is great for testing unclean networking
failures. Closing network connections uncleanly is not something that is
usually supported at the language level.

------
amelius
Let's say you have triggered an error-condition. How would you
deterministically replay the events that lead to that situation?

