
Night of a cascading failure - ingve
https://rachelbythebay.com/w/2019/01/20/quiet/
======
aftbit
The most interesting thing here is that this took down the bastion boxes that
allowed PROD SSH access. I would argue that was the biggest red flag by far.
As the author points out, if you can't access the machines somehow, you are
pretty screwed. Great job that you managed to get back in anyway, but really
the production access mechanism should be as dumb, simple, and standalone as
you can make it.

We use multiple bastion hosts with user accounts provisioned by Ansible. There
is one in each data center, plus a "shell" box at AWS that lives outside the
firewall and also has SSH access to everything we need. All the bastions can
access every server.

Plus, some of the mission-critical boxes actually have SSH exposed to the
public internet. I would much rather take the security risk with (key-only)
OpenSSH than the risk that I get locked out of PROD when I need to fix things
after a 2am page.

It is entirely possible to make a box so secure that you yourself cannot
access it, while not actually doing much to defend against more "reasonable"
security risks, like typo-squatting and spear-phishing. It's all about the
threat model.

~~~
RyanShook
Agreed, I wish the author would have explained why the authentication system
was tied in so directly to this bug. Seems like the biggest learning would
have been to separate the two.

~~~
hnzix
TFA did explain it. The boxen were authenticating against a LDAP server in
prod, and prod was dead. This was obviously a huge design flaw and the author
identified it as such.

------
CJefferson
I feel like doing arithmetic with unsigned integers is like doing all your
coding 2 feet from a cliff, in return for more land about 2^63 miles away.
It's fine as long as you never make the tinyest step wrong, and its very
unlikely you'll ever need that other land (and it's usually easy to design
code not to need it).

This doesn't apply if you are using unsized for bit twiddling, but then you
shouldn't be using minus anyway.

~~~
ChrisSD
The irony here is that the CPU will set the underflow making this trivial to
detect in ASM with the `jb` instruction. The equivalent C++ would be something
like:

    
    
        if (new_len > child.length())
    

As stated in the article, checking if an unsigned is less than 0 is a silly
logic error but obviously one that is understandable to make. Still it's the
sort of thing you'd expect automated tools would be able to catch.

~~~
AstralStorm
The problem is being written as <= 0. The compiler could complain "why aren't
you just == 0 because it is unsigned" but that's probably enabled at
-Wpedantic or such level.

------
arama471
I'm surprised compiler warnings haven't come up yet in this discussion.
Checking if an unsigned number is smaller than zero typically triggers one,
and code style tools can be set to treat any such check as a compilation
error/not allow code with such a check to be checked in.

------
Dzugaru
This is why I never ever use size_t over int. I’m a newbie C++ programmer tho,
wrote less than 10kloc of C++ code, but my both feet are bleeding cause of
exactly this.

~~~
mikeash
Be careful, int is usually 32-bit even on 64-bit systems. Using int all over
is a good way to experience mysterious failures when dealing with chunks of
data larger than 2GB, which is definitely a real possibility these days.
Consider ssize_t, intptr_t, or int64_t any time you’re not absolutely sure
you’ll never exceed 2^31-1.

~~~
marcosdumay
Int is only 32 bits on 64-bit Windows, that for some reason didn't go along
with the times. Every other mainstream 64 bits environment has 64 bits ints,
even on ARM.

But it's only guaranteed to be 16 bits anyway, the same guarantee you have for
size_t.

~~~
tedunangst
That's long, not int. Approximately nobody has 64 bit int.

~~~
nwmcsween
That's a good approximation :)

------
xkcd-sucks
should have done it as a postgres ltree haha

