
Sometimes Kill -9 Isn't Enough - tylertreat
http://www.bravenewgeek.com/sometimes-kill-9-isnt-enough/
======
rdtsc
Was going to make a pun on the title "... because uninterruptable sleep is a
bitch", but it doesn't talk about that.

Going back to the topic there are great points there. Remember discovering "tc
qdisc" and playing with it. Really nice tool.

But another thing to learn perhaps, is to try to avoid the gray zone by going
to either the "black zone" = dead, or "white zone" = working fine. That is, if
a node/process/VM/disk start showing signs of failure above a threshold,
something else should kill/disable it or restart it.

Think of it as trying to go to stable known states. "Machine is up, running,
serving data, etc", "Machine is taken offline". If you can try to avoid in-
between "gray states" \-- "Some processes are working, some are not", "swap is
full and running out of memory, oomkiller is going to town, some some services
kinda work" and so on. There are just too many degrees of freedom and it is
hard to test against all of them. Obviously somethings like network issues
cannot be fixed with a simple restart so those have to be tested.

~~~
Rapzid
I totally thought this was going to talk about non-interruptible process
states. Like the dreaded D. D is for "your reboot will fail, hope you have
ILO".

~~~
anon4
Oh man, I hate that.

I've dreamed of patching the kernel and writing two utilities - twim
(terminate without mercy) and uwep (unmount with extreme prejudice) that
simply remove a process along with all threads, or destroy a mountpoint and
drop all associated resources (all filehandles become closed, etc.). Lack of
time has mostly stopped me from attempting it, and I'm quite sure it won't be
at all trivial.

~~~
Rapzid
Yeah.. Not sure if the root cause was ever determined but at my previous job
we had issues with Xen guests shutting down but the blkback device would go D
and never quit. This would prevent the VM from starting because the LV was
busy. lvm commands would freeze. And of course the system would end up needing
a hard reboot because the lvm teardown scripts would not complete on shutdown
due to the busy device. Good times :|

------
artursapek
"Comcast" is pretty hilarious.
[https://github.com/tylertreat/Comcast](https://github.com/tylertreat/Comcast)

~~~
masklinn
Note that OSX has an apple-provided Network Link Conditioner to configure
bandwidth/delay/drop. Even better, it's built into iOS devices set up for
development.

------
ReidZB
If you'd like to simulate network crappiness on OS X, you can use the Network
Link Conditioner from Apple themselves: [http://nshipster.com/network-link-
conditioner/](http://nshipster.com/network-link-conditioner/)

I was very impressed with its feature-set (for what it is). On our team, we
use it to see how our iOS app will react to severe network problems (via
testing in the simulator, mostly, though it's also available on iOS devices
themselves as explained in the above article).

------
peterwwillis
This is the "I don't know how my network works, so let's throw a wrench into
the works and see what happens, fix it, rinse, repeat" form of network and
systems engineering. It's certainly useful at various points in tuning
performance, but it doesn't replace actually designing your system to resist
these problems to begin with.

Even if you introduce these network performance issues, the results are
meaningless if you don't have instrumentation ready to capture metrics on the
results throughout the network/systems. Everyone wants to write about what
happened when they partitioned their network. But you notice how nobody writes
about the netflows, the taps, the service monitors, the interface stats, the
app performance stats, the query run times, host connection state stats,
miscellaneous network error stats, transaction benchmark stats, and hundreds
of other data sources that are required to analyze the resulting network
congestion.

To me it's much more vital that I can correlate events to track down an issue
in real-time. You will never be able to identify all possible failure types by
making random things fail, but you can improve the process by which you
identify a random problem and fix it quickly.

------
GuiA
kill -9, no more CPU time

[https://m.youtube.com/watch?v=Fow7iUaKrq4](https://m.youtube.com/watch?v=Fow7iUaKrq4)

------
eosis
You have to be careful using iptables dropping rules on the OUTPUT tables, as
this manifests itself ( at least on our systems ) as failed send socket calls
(which are often retried by the application), rather than true packetloss.
Netem tends to work as expected.

------
zorbo
This focuses mostly on simulating unreliabable networking. Is there a tool,
perhaps some LD_PRELOAD wrapper, that can simulate unreliable everything? I'm
talking memory errors, disks going away, fake high I/O load, etc?

I once wrote a library for python that injected itself into the main modules
(os, sys, etc) and generated random failures all over the place. It worked
very well for writing reliable applications, but it only worked for pure
python code. I don't own the code, so I can't open source it unfortunately.

~~~
groks
[http://cwrap.org/](http://cwrap.org/)

------
tlarkworthy
I recognise those commands ...

[http://stackoverflow.com/questions/614795/simulate-
delayed-a...](http://stackoverflow.com/questions/614795/simulate-delayed-and-
dropped-packets-on-linux)

I am still trying to work out how I not knobble my DB connection when trying
to simulate client errors on a single dev machine.

~~~
jaggederest
You can specify rules for a single port, I believe, using ipfw or equivalent.

------
noonespecial
Brings back horrible memories of writing tc scripts to simulate VSAT and rural
dsl back in the bad old days. We bundled them up on a Soekris box and called
it the "DSLow" (as in DSL-oh) box.

------
mu_killnine
I find this article offensive ;)

