
Docker was unavailable in Ubuntu/Debian repos - pi-squared
https://github.com/docker/docker/issues/23203
======
shykes
Hi, I work at Docker. Here is my reply on the github thread:
[https://github.com/docker/docker/issues/23203#issuecomment-2...](https://github.com/docker/docker/issues/23203#issuecomment-223326996)

I am copying it below:

<<< Hi everyone. I work at Docker.

First, my apologies for the outage. I consider our package infrastructure as
critical infrastructure, both for the free and commercial versions of Docker.
It's true that we offer better support for the commercial version (it's one if
its features), but that should not apply to fundamental things like being able
to download your packages.

The team is working on the issue and will continue to give updates here. We
are taking this seriously.

Some of you pointed out that the response time and use of communication
channels seem inadequate, for example the @dockerststus bot has not mentioned
the issue when it was detected. I share the opinion but I don't know the full
story yet; the post-mortem will tell us for sure what went wrong. At the
moment the team is focusing on fixing the issue and I don't want to distract
them from that.

Once the post-mortem identifies what went wrong, we will take appropriate
corrective action. I suspect part of it will be better coordination between
core engineers and infrastructure engineers (2 distinct groups within Docker).

Thanks and sorry again for the inconvenience. >>>

~~~
falsedan

      > At the moment the team is focusing on fixing the issue
      > and I don't want to distract them from that.
    

That might be ok for feature teams, but for infrastructure tools/services,
it's very frustrating for users (devs) to be kept in the dark on the progress
of the fix.

At work, the incident response starts with identifying Investigators (to find
and fix the problem) and a Communicator (to update channel topics, send the
outage email & periodic updates, field first-line questions about the
incident, and to contact those most affected by the incident so they don't get
surprised/try to fix it themselves). The person who starts the incident is the
Coordinator, who assigns the roles, escalates if more help is needed, tries to
unblock investigations, and turns facts from the investigators into status
updates for the communicator.

~~~
beachstartup
i will provide an opposing viewpoint which i'm sure many people do not agree
with.

if a service i use is down, all i want is an acknowledgement and that "we are
working on it right now with high priority". i want all available resources to
be fixing the problem.

my anxiety over powerlessness in relying on others during a crisis manifests
in other ways, like figuring out why i'm at the mercy of this thing in the
first place, and putting alternatives in place.

but during the crisis i'll just go do something else for an hour and then read
the post mortem when it comes out.

~~~
robryk
Updates that are communicated are not the details of what's happening in the
investigation, but things that are expected to be useful to users/clients.
They are things such as the estimated time for problem resolution, updates on
the scope of the problem (e.g. "this is an instrumentation problem" vs "this
is an actual outage") or mitigation steps that could be applied by the
clients. It is often very useful to know such things during an outage.

~~~
beachstartup
i don't think anyone, anywhere would have given you an accurate estimated
figure of nearly 5 hours to fix this problem.

furthermore, even if you somehow could divine the future, telling the customer
that you think an outage will last over half a working day is going to turn an
extremely shitty situation into something even worse.

------
tsuresh
From the GitHub issue thread, I see a lot of people being angry for their
production deployments failing. If you directly point to an external repo in
your production environment deployments, you better not be surprised when it
goes down. Because shit always happens.

If you want your deployments to be independent of the outside world, design
them that way!

~~~
karterk
Maybe this is the norm in big enterprises, but I have not actually come across
any company which hosts a local package repository for commonly available
packages.

~~~
TillE
You don't need a seamless, robust process for dealing with the occasional
remote failure (especially when there are mirrors), but you can for example
save snapshots of dependencies.

You should be able to do _something_ in an emergency, even if it requires
manual intervention. If you can only shrug and wait, that's bad.

~~~
toomuchtodo
> If you can only shrug and wait, that's bad.

Welcome to cloud computing!

------
justinsaccount
Title is misleading. The 'apt.dockerproject.org' host had a broken release
file. This is not an Ubuntu or Debian maintained repository.

~~~
vox_mollis
Broken, or compromised?

~~~
cjbprime
Probably broken, since a competent attacker would have been able to avoid
creating a checksum mismatch.

My company's actually done the same thing before (same error), by putting
Cloudfront in front of our APT repo -- it cached the main packages file
inappropriately, causing the checksum mismatch.

------
poooogles
>Does this mean that Docker -- a major infrastructure company -- does not have
any on-call engineers available to fix this?

It appear to be that way. Reminds me when all of the reddit admins were stuck
on a plane on the way back from a wedding [1].

Remember kids, improve your bus factor.

[http://highscalability.com/blog/2013/8/26/reddit-lessons-
lea...](http://highscalability.com/blog/2013/8/26/reddit-lessons-learned-from-
mistakes-made-scaling-to-1-billi.html)

~~~
oldmanhorton
There was a comment a bit below that suggested that those who paid for
commercial support got it 24/7, but if that's true, id imagine the fix that
commercial support would have given to paying customers would have fixed it
for everyone else too...

~~~
shykes
Disclaimer: I work at Docker.

I believe commercial releases are downloaded from a separate infrastructure
(to be confirmed).

Either way, the availability of Docker packages, free or commercial, is
critical infrastructure and we should treat it as such. IMO our primary
infrastructure team should have been involved, and someone should be on call
for this. We'll do a post-mortem, find the root cause, and take corrective
action as needed.

Apologies for the inconvenience.

~~~
voltagex_
You may want to lock that GitHub thread soon, it's getting argumentative and
not very helpful.

~~~
shykes
> You may want to lock that GitHub thread soon, it's getting argumentative and
> not very helpful.

In open-source we call that "thursday" :)

~~~
zjaffee
What ever happened to community over code?

~~~
shykes
Part of a healthy community is accepting that people disagree a lot, have
different values, and communicate their ideas in very different ways.

Where we draw the line is if people are being intimidated, bullied, insulted,
or anything that even remotely resembles harassment.

Although I personally feel that some of the comments in that thread are pretty
unfair and poorly informed, they don't seem to violate the social contract.

------
0x0
It's scary how most people in that thread seem to be more concerned about
forcing an installation, rather than pause and consider why the hashes might
be wrong and why it might not be a good idea to install debs with incorrect
hashes.

If the apt repo was compromised (but the signing keys were not), this is very
likely exactly the symptom that would appear.

~~~
cjbprime
> If the apt repo was compromised (but the signing keys were not), this is
> very likely exactly the symptom that would appear.

I don't think that's correct. It would pass a checksum test and fail a
signature test with a "W: GPG Error". The checksum test is not about
cryptographic security, it's just about files referenced by the Packages file
having the same hash that the Packages file declares them to have. You don't
need any signing keys to make that happen.

~~~
0x0
What's more suspicious: Bad hashes or bad signatures? What would an attacker
choose if their goal was to get as many people as possible to force install?

~~~
cjbprime
It's impossible to force install the packages when they have bad hashes (hence
the severe breakage here), and it is possible to install the packages when
they have bad signatures if you didn't import the gpg key or don't run with
signature checking.

So I'd guess a rational attacker would choose a bad signature. But attackers
can be irrational; it doesn't prove it's not an attack. Just not my intuition.

~~~
0x0
That's interesting, I'm assuming you're talking about apt now. I don't think
dpkg checks signatures if you install straight from a .deb. :)

~~~
jwilk
It doesn't, mostly because there are no signatures in a deb. :)

------
mapleoin
This is a really bad title. There is nothing wrong with either Ubuntu's or
Debian's repositories. The problem is with Docker's repositories of
Ubuntu/Debian packages.

------
brazzledazzle
I'm a bit disappointed that people are willing to make public criticisms of
Docker when it's their builds that are failing. They made the decision to
depend on a resource that could be unavailable for a large number of reasons
entirely unrelated to Docker or their infrastructure.

Just like the node builds that failed this should cause you to rethink how you
mirror or cache remote resources not prompt you to complain about your broken
builds on a github issue page. There may be things you'll never be able to
fully mirror or cache (or could just be entirely impractical) but an apt
repository is definitely not one of them.

~~~
willejs
+1 !

------
mschuster91
... which is why the clever sysop mirrors his packages and tests if an update
goes OK before updating the mirror.

If you're running more than three machines or regularly (re)deploy VMs, it is
a sign of civilization to use your mirror instead of putting your load on
(often) donated resources.

It's the same stupid attitude of "hey let's outsource dependency hosting" that
has led to the leftpad NPM desaster and will lead to countless more such
desasters in the future.

People, mirror your dependencies locally, archive their old versions and
always test what happens if the outside Internet breaks down. If your software
fails to build when the NOCs uplink goes down, you've screwed up.

------
ajarmst
I often wonder why the community's response to issues with an
open/free/community package is to give the maintainers a strong argument to
discontinue it in favour of a commercial one, or just abandon it altogether.

~~~
ajarmst
"Why I Haven't Fixed Your Issue" \--- [http://www.brycematheson.io/post/why-i-
havent-fixed-your-iss...](http://www.brycematheson.io/post/why-i-havent-fixed-
your-issue/)

------
therealmarv
I think this is a combination of chains which are dependent on eachother,
especially when you use Travis CI: 1st) apt-get not flexible enough to ignore
that error on apt-get update 2nd) Travis CI having so much external stuff
installed, it's a big big image which has more failure points 3rd) Docker repo
failed.

------
perlgeek
Outages or mis-configurations can happen to pretty much any source of packages
you use, be it debian, pypi, npm, bower or maven repositories, or source
control. Anybody remember left-pad?

So as soon as you depend heavily on external sources, you should start to
think about maintaining your own mirror. Software like pulp and nexus are
pretty versatile, and give you a good amount of control over your upstream
sources.

------
smegel
Sometimes paying for RHEL isn't a bad thing.

