
The Discovery of Apache ZooKeeper’s Poison Packet - gregone
http://www.pagerduty.com/blog/the-discovery-of-apache-zookeepers-poison-packet/
======
mhw
What strikes me about this investigation is that it clearly demonstrates the
value of having access to the source code for your full stack, and also
openness in the specification of how the protocols work and how everything
fits together. Count the number of different software components from separate
organisations that this investigation required information from:

* Apache ZooKeeper

* The Apache Curator ZooKeeper client library, originally from Netflix

* The Linux kernel

* Arguably the authors of RFC 3948, which specifies the protocol and expected behaviour of the networking components

* The Xen hypervisor

Now imagine that each of these components was closed, proprietary code from a
separate organisation and you had to rely on support from each supplier to get
to the bottom of the issue. It's unlikely that the customer would be able to
successfully identify the issues without access to the source code. But at the
same time it is unlikely that any individual supplier would be able to
identify the problem as none of them can see the full picture either.

~~~
masklinn
> What strikes me about this investigation is that it clearly demonstrates the
> value of having access to the source code for your full stack […] Now
> imagine that each of these components was closed, proprietary code from a
> separate organisation

Please note that "proprietary" and "access to source code" are not exclusive.
You can get source code access licenses to windows (specifically for that kind
of situations) for instance (and that's been possible for decades).

~~~
click170
I am interested in learning more about such arrangements, but the fact remains
that for Foss software the license and freedom to read and change the source
code is there for all from the start, no license negotiations or purchasing
required.

~~~
masklinn
> I am interested in learning more about such arrangements

[http://www.microsoft.com/en-
us/sharedsource/default.aspx](http://www.microsoft.com/en-
us/sharedsource/default.aspx)

Basically:

* be a big/huge company (10k seats under Enterprise Agreement or Select + Software Assurance)

* be a big OEM ("Top Volume", 5k internal seats)

* be an MVP in good standing

* be a governmental entity

In all cases there's an NDA to sign and you must be in an "eligible geographic
market".

> the fact remains that for Foss software the license and freedom to read and
> change the source code is there for all from the start, no license
> negotiations or purchasing required.

Sure, but that's not really my point.

------
bd_at_rivenhill
All kinds of badness here. Bug #2 really reduces the level of comfort I would
have with using ZooKeeper as a tool.

First of all, the default Java policy of terminating the thread, instead of
the process, when a runtime exception is not handled is fully boneheaded and
the first thing you should always do in a server program is to set a default
uncaught exception handler which kills the program. Much better to flame out
spectacularly than to limp along with your fingers crossed hoping for the
best, as this bug amply demonstrates.

On the heels of that, there's this: "Unfortunately, that means the heartbeat
mechanisms would continue to run as well, deceiving the followers into
thinking that the leader is healthy." Major rookie mistake here; the heartbeat
should be generated by the same code (e.g. polling loop) which does the actual
work, or should be conditioned on the progress of such work. There's no
indication that ZooKeeper is bad enough to have a separate thread whose only
responsibility is to periodically generate the heartbeat (a shockingly common
implementation choice), but it is clearly not monitoring the health of the
program effectively.

Suffering a kernel level bug is outside the control of a program, but this
demonstrates a lack of diligence or experience in applying the appropriate
safety mechanisms to construct a properly functioning component of a
distributed system.

~~~
acdha
> this demonstrates a lack of diligence or experience in applying the
> appropriate safety mechanisms to construct a properly functioning component
> of a distributed system

That's every experience I've had with ZooKeeper – gratuitous complexity,
complicated setup and deployment and too many failure modes where it silently
stops working without logging anything helpful. For a system like this
simplicity is a necessity, not a quaint affection.

~~~
wtbob
> gratuitous complexity, complicated setup and deployment and too many failure
> modes where it silently stops working without logging anything helpful

Well, it _is_ a Java program…

------
fragmede
I'm impressed. Not by the first 90%; that's par for the course (and why I'm a
big believer in open source). No, what impresses me is the sharing of their
solution. A solution, which, in their own words, is 'not a proper fix'. The
engineer in me would be embarrassed by the fix but you can't fault that it
works. There's a workaround, so the business perspective is any further time
spent fixing this could be spent working on new features instead.

Still, I have a bunch of unanswered questions. Why not just upgrade all the
hosts to Xen 4.4? Does recompiling the 3.0+ kernel without the bad 'if' in
/net/ipv4/esp4.c make the problem go away? Does the problem happen if there's
only one VM on a host? Of the seven AES-NI instructions, which one is
faulting? How often does it fault? The final question though, is what causes
it to fault?

~~~
DougBarth
I can answer only some of these questions. :)

Why not upgrade Xen? We run on cloud VMs, so we aren't in control of the
version of Xen that's being used. Patch the kernel? Honestly, we don't have
the expertise to take on that level of effort. We're hiring though. ;)

------
deathanatos
When I read reports like this on HN I am absolutely floored by the level of
detail and quality work put into not only writing them, but by getting to the
bottom (well, almost!) of the problem! Fantastic work. How do you do it? My
server-side team is ~5 engineers (and 1 devops), and we struggle just to keep
up with the incoming feature requests, let alone do work on improving the
infrastructure, and even further, let alone have an engineering blog, or do
this kind of research or work. Is there a good way to foster the culture that
this is something that should be held as important?

~~~
fragmede
It's easy to see the finished product and wish for that kind of time, but it's
likely the culmination of several months work of several people just to debug
the problem, followed by the writeup sitting in someone's todo list for
several weeks, with another few weeks of review.

One way to foster this culture is to write blog posts, and close with a "We're
hiring", as Pagerduty has done here.

~~~
DougBarth
It was indeed. We poked at this issue off and on for several months by
multiple engineers before figuring it out.

------
hueving
>While checksumming is a great way to detect in-flight corruption, it can also
be used as a tool to detect corruption during the formation of the packet. It
is the latter point that was overlooked, and this optimization has come to
bite us. Lack of validation here let our mystery corruption through the gate –
giving ZooKeeper bad data which it reasonably believed was protected by TCP.
We claim this is a bug – intentional or not.

This would only be a false sense of security. It only tells you that there was
a bug in the formation of the packet _after_ the checksum was calculated. If
your failure case assumes a system can't put together a packet, how can you
assume that it even makes it to the checksum calculation step correctly?

Edit: There is also another downside to enabling TCP checksumming after
decryption. It eliminates the possibility of hardware TCP checksum offloading
so you would be paying the performance cost of software checksumming every
packet. This is why the RFC was written that way to begin with...

------
userbinator
AES-NI instructions need to use the XMM registers. My guess is that someone
forgot they have to be saved/restored when AES-NI has been used.

There have been a few Xen bugs around saving/restoring state which leads to
information disclosure (one VM can read values in registers from another), but
another manifestation of this type of bug is corrupted registers.

~~~
tedunangst
Yeah. Of the bugs identified, this sounds like the most serious, but received
the least attention.

------
twic
There are a number of alternatives to ZooKeeper (etcd, Consul, etc). There are
a number of systems which specifically require ZooKeeper (Kafka springs to
mind).

How plausible would it be to replace hard dependencies on ZooKeeper with a
dependency which could be fulfilled by any of the alternatives?

For example, could there be a standard protocol? Could someone implement the
ZooKeeper protocol on top of Consul? Could we define a local client API with
pluggable implementations for each alternative (like ODBC etc)?

Or are the semantics of the 'alternatives' just too different?

~~~
andrioni
Oh, I'd love to have something similar, running etcd (for Kubernetes) and
ZooKeeper (for the rest of the services) can be a PITA, and it feels like an
unnecessary duplication of infrastructure.

------
tmd83
A fantastic investigation. These are always fun to read but also scary as in
how bugs at so many stacks combined together to generate a problem. So you
might end up being the only one to be affected by something and if you don't
have the technical expertise to figure it out then you are in a big trouble.

------
Ygor
Zombie ZooKeeper nodes that appear as healthy members of the cluster after an
OOM is something that can cause major problems.

There are two quick solutions on the operational side that can be deployed to
prevent this:

\- Run each zk server node with the JVM OnOutOfMemoryError flag, e.g. like
this: -XX:OnOutOfMemoryError="kill -9 %p"

\- Have your monitoring detect an OOM in the zookeeper.out log, and use your
supervisor to restart the failing ZK node.

ZooKeeper is designed to be fail fast, and any OOM should cause an immediate
process shutdown, ofc continuing with an automatic start of a new process by
whatever is supervising it.

------
thaumasiotes
> Bug #3 – Obscure Behavior

> We claim this is a bug – intentional or not.

I like seeing this described as a bug. Bug #3 is documented behavior (there's
a comment right there in the source calling it out), and it does what it's
intended to do, and the relevant RFC says that it should be doing that. It's
only a "bug" in the sense that the behavior is nevertheless metaphysically
incorrect.

I once wrote java code doing bit shifts of arbitrary length as part of a
programming contest at my school. It failed on a single test case for
mysterious reasons. I eventually discovered that my algorithm was totally
correct -- but I had been assuming that if you shift a 32-bit integer by 32 or
more bits, you'll get zero. In fact, and this is _required_ by the java
specification, you get unpredictable garbage data. More specifically, the java
standard mandates that "a << b" be silently compiled to "a << (b % 32)". So I
had to rewrite code like this:

    
    
        bit_pattern = bit_pattern << potentially_large_value;
    

into something more like this:

    
    
        // if you shift by x, Java will silently rewrite it as a shift by (x % 32)
        while( potentially_large_value > 31 ) {
          bit_pattern = bit_pattern << 31;
          potentially_large_value -= 31;
        }
        bit_pattern = bit_pattern << potentially_large_value;
    

I can imagine no circumstance where this would be useful or helpful in any
way. Even later, I found out that the JVM bytecode for bit shifting uses a
5-bit operand, and I surmise that the standard is written the way it is to
make it slightly (slightly!) simpler to compile a Java shift statement into a
JVM bytecode instruction. I can't call this a bug in javac -- it's doing
exactly what it's required to do! But the only way it makes sense to me to
describe this is as a bug in the Java specification. If I say to do something
45 times, that should be the same thing as doing it 15 times, and then another
15 times, and then 15 more times. It shouldn't instead be the same as doing it
13 times.

~~~
taejo
The x86 shift instructions do the same, so I'd guess that's why the JVM does
it that way.

~~~
thaumasiotes
Sure, the JVM can use a 5-bit value for doing shifts on 32-bit integers,
there's no problem there. But why would Java compile "x << 37" to "x << 5"? If
you're working raw in the JVM, there's no such concept as x << 37, because 37
requires more than 5 bits to specify. Java explicitly allows you to say x <<
37, but secretly defines it to do something insane.

~~~
aftbit
x % 32 === x & 31

So if they use a 5-bit number, they can just stick your second operand into
that slot after it's been ANDed with 31 (which might even be implicit in how
the x86 shift instructions work). To do otherwise would require them to branch
(just like you did in your solution).

~~~
thaumasiotes
So what?

------
karmakaze
TL;DR 4 bugs: 2 in ZK, 2 kernel. IPSec/NAT-T, AES-NI, Xen PV/HVM. A fun (7
scroll) read.

~~~
hwh
I'm not sure if it can be said that there was a bug _in_ IPSec/NAT-T. Caused
by IPSec/NAT-T, maybe. But by documented and specified behaviour. As for aes-
ni-intel (the module/implementation), I'm also not sure. Maybe that would have
to be attributed to Xen - for being a bit careless. Agreedly an interesting
read!

~~~
nl
The IPSec/NAT-T behaviour was documented and specified, true. It is also
_wrong_ for the reasons pointed out in the post.

------
jamesblonde
There are alternatives to Zookeeper. Here's our work on using a NewSQL DB
(NDB) to do leader election. Instead of using ZaB as a replication protocol,
it uses transactions with low TransactionInactive timeouts (5 seconds),
placing a ~5 second upper bound on leader election time. The upper bound
increases dynamically if the number of nodes is >100\.
[http://www.jimdowling.info/sites/default/files/leader_electi...](http://www.jimdowling.info/sites/default/files/leader_election_using_newsql_db.pdf)

~~~
nemothekid
While there are a number of ZK alternatives (although none quite as mature or
battle tested), I've found it can get pretty difficult to eschew the ZK
requirement when using open source software.

Simply put, a large number of projects depend on Zookeeper (and not
surprisingly large number of Apache projects like Hadoop, Kafka, Storm, Spark,
Mesos).

------
noonespecial
Whenever I hear of deep mysterious bugs found in critical encryption systems,
the conspiracy theorist in me can't help but wonder...

Did they stumble upon a very well hidden back door? This might not be the last
we hear of this.

------
an_account_name
Is there a CVE for this? This seems like it probably has security
implications.

~~~
tmd83
You mean the checksum bit right. Letting a corrupted packet pass through?
Funny that it was a corrupted length value that caused the problem which I
seem to recall one of the big attack vector when its intentional.

~~~
hueving
The checksum thing is not a vulnerability. If you can't trust the remote host
to put the correct things in a packet, a checksum will not help you.

~~~
tmd83
I was wondering for a moment if any MITM was possible at some steps. But it
also happens for encrypted packet right and maybe MITM is not possible in that
case. I don't really know the steps involved in creating, encrypting packets
to judge the possibility.

------
DonHopkins
Isn't all that addressed in Zookeeper 2: Zookeepier?

[https://www.youtube.com/watch?v=_F-
RyuDLR4o](https://www.youtube.com/watch?v=_F-RyuDLR4o)

