
Kraken down  24 hours on system upgrade - ifdefdebug
https://status.kraken.com/
======
JorgeGT
They erased this glorious status update:

> _I fell asleep at my desk. In a dream, the ghost of a butterfly from the
> future whispered to me "the dollar hit 0 bitcoin". I wake up, people around
> me dabbing the blood from their eyes. Progress has been made but not enough.
> I crack my whip -- "we aren't getting to zero like this!". The flogging will
> continue until morale improves. Expect another update after I make my
> rounds. I'm sorry I've let you down._ Jan 12, 03:11 UTC

But not before archive.org saved it for posterity:
[https://web.archive.org/web/20180112034351/https://status.kr...](https://web.archive.org/web/20180112034351/https://status.kraken.com/)

~~~
donquichotte
It appears that we are at the mercy of mentally unstable amateurs.

~~~
eks
Just because the social media guy likes to play social media'ish doesn't mean
the rest of the team is like that.

But definitely not a good moment to make a joke.

~~~
arethuza
Saying things like "ahead of the team passing out" in their latest status
updates doesn't exactly give confidence that they have either good
communications or engineering leadership.

~~~
bhouston
If everyone has been up 24hrs then you will get status updates like that.

I guess it is all hands on deck and there isn't a large enough team to do
rotations of shifts? Or the issue is very complex and you need all your best
guys around to fix it.

Yikes.

As someone who runs a high availability website, I can feel for them, but it
is really bad to be offline this long. I would look to make some changes to
that team or address an unstaffing issue or something. This should not happen.
This should have been preventable/foreseeable and a risk mitigation strategy
could have been in place.

~~~
drinchev
From what I suspect, we are talking about some backend issue that they are
handling. Throughout my career I've never seen a "rotating shifts team" that
can handle situations like this. It's usually one guy trying to fix it and a
couple of others hanging in the office with a cup of coffee and giving advice
on what should the company do next or being forced to update the status page.

------
whalesalad
This team is in over their heads. If you don’t have stateless infrastructure
with a big red “oops, this isnt right, rollback” button – you’re doing it
wrong. Read the reviews of their iPhone app on the App Store. They are not
equipped to be running a financial company.

Bummer because I’m looking to leave Coinbase. Looks like everyone in this
space is nerf-grade.

~~~
nasalgoat
No kidding - their front page is literally black text on a white background.

They don't have a single person who can use Photoshop and put up a proper
maintenance page? They didn't _already_ have a failover maintenance page? This
shit is Web Hosting 101 stuff from decades ago.

This is new levels of amateur idiocy.

~~~
nasalgoat
I get a downvote for speaking the truth? Someone from the Kraken dev team I
assume.

Feel free to explain to me how the current result from
[http://kraken.com/](http://kraken.com/) is at all even remotely professional.

------
gtsteve
For those wondering, Kraken is a Bitcoin exchange.

[https://en.wikipedia.org/wiki/Kraken_(bitcoin_exchange)](https://en.wikipedia.org/wiki/Kraken_\(bitcoin_exchange\))

~~~
bhouston
How much do they have under management or their daily volume, etc?

~~~
miguelrochefort
Billions.

~~~
dx034
Billions for both or just volume? Money in accounts would be interesting as
that's where regulators (or the FBI) will step in at some point if it looks
like potential fraud.

~~~
forthefuture
No one releases money in accounts numbers.

Kraken was doing $711M in daily volume before the maintenance. [1]

Only the top 8 exchanges have cracked a billion in daily volume. [2]

[1]
[https://coinmarketcap.com/exchanges/kraken/](https://coinmarketcap.com/exchanges/kraken/)

[2]
[https://coinmarketcap.com/exchanges/volume/24-hour/all/](https://coinmarketcap.com/exchanges/volume/24-hour/all/)

------
RafiqM
They didn't communicate that there would even be a maintenance to their users.
I use Kraken every day and did not receive any sort of email and there wasn't
a notice on the site that I could see. Maybe they announced on social media,
but that's about it.

Starting an unannounced maintenance in the middle of the day is crazy, even if
it was only for the original 2 hours. Over 24 hrs now.

I don't know if this is due to incompetence or intentional.

If they announced the site was going down for major upgrade the smart thing
for users to do would be to transfer their crypto to their offline wallets.
Possibly withdraw their fiat also. This is what many do for a major fork of a
single coin.

Separately, there is for sure going to be tons of money lost because Kraken
supports margin trading and the market took a 15% dive right around this time,
I suspect there are at least a few that got a margin call and are screwed.
These people would have been able to manage their positions if the service
didn't have extended and unannounced downtime.

~~~
nasalgoat
It's clearly incompetence. I suspect someone dropped a table they weren't
supposed to and they started scrambling to restore a backup they've never
tested and discovered it wasn't working.

Going to kraken.com tells you all you need to know about these folks, all bad.

------
donquichotte
This is really bad. I hope they can handle everyone withdrawing their funds as
soon as they are back online.

Edit: current status:
[https://status.kraken.com/incidents/nswthr1lyx72](https://status.kraken.com/incidents/nswthr1lyx72)

~~~
bhouston
They are cancelling all pending withdrawal and liquidation requests as well
according to their status updates.

So basically a software upgrade/bug can now cause a bank run? Interesting
times.

------
jatsign
I wrote an open-source cl utility to query the apis of exchanges like kraken,
bittrex, bitfinex, etc, and had to basically abandon the kraken portion back
in November. It started to become incredibly unreliable, even back then.

I moved all my coins off of there a few weeks later, because it was becoming
impossible to get a buy or sell order in.

In other words, to users of the site, this isn't surprising. :(

------
bflesch
I wish them the best, it must be very stressful for an update to go this
south. Their founder Jesse reached out to us in the past and I got a good
impression from him.

Maybe they got so rich from the high crypto valuations that most of the senior
team are on their private islands already?

~~~
Tepix
These exchanges must have such a hard time recruiting new staff. They need to
trust their employees with millions or even hundreds of millions of dollars.
And they are growing at an insane rate. Ouch.

------
libx
I had an account with them some months ago, but their system is like shit.
Completely shit. When I was long, the balance showed the opposite, when the
market was against me my position improved and vice-versa. This had all the
bad consequences that one can imagine, as I couldn't close some positions
because supposedly there were no funds for that, when in fact, there were.

They need a system fix, I really don't trust them.

------
crypticjf
As a long time software developer and manager, it is unimaginable to me that
they had not done proper analysis of the risks and had mitigation strategies
in place. The whole future butterfly comment was bizarre and I saw it when it
happened. I am hoping at the very least they have DB backups but I would think
if they did, they would already have been restored. With each passing hour I
am in fact growing very concerned. Not to mention the run on that place could
likely put them out of business when and if they ever do come back online.
Just look at the owner over there. Smoking too much weed I suspect?

------
mnx
At least they are communicating. Hopefully there will be a postmortem.

~~~
dx034
Not really, just saying that it takes longer than expected. Certainly better
than giving no update at all but that would likely cause panic for anyone with
money on kraken.

------
edf13
Never, ever, not ever...

Run major updates to production...

Near a weekend. Surely they're not going to complete the go-live later today
are they??

I'm assuming of course all of this as been heavily tested on test networks for
a couple weeks... and of course this is just some deployment issue...
(Sarcasm)

~~~
rando444
Speaking in general terms, not running updates near a weekend is only so you
know you have personnel on hand to fix problems.

If you have the people available, then weekends are best, because it's usually
a lower impact for end users.

~~~
liquidise
This varies industry by industry. As someone who runs a dating site, we've all
but outlawed non-trivial Friday releases because Fri-Sun are peak days.

It wouldn't surprise me if crypto volumes saw a similar trend... though in
reality i suspect people account for a minority of volume when compared to
bots.

------
bhouston
When upgrades go this bad, can you not just roll back to the previous version?

I guess this was some type of breaking change, like a DB re-structure that
makes rolling back problematic? And then somehow this wasn't testing enough?

~~~
lykr0n
Some upgrades are a one shot deal. If you refactor a data structure, you have
to plan to unrefactor it. If you don't, then there are two options, push
forward (or hotfix) or restore from backups.

Massive schema migrations would be my guess. And I'm willing to bet that their
Test/QA Environment only has a small set of test data and not a replication of
production

~~~
aneutron
If you run a 700M$ exchange, you should probably have a cold backup of your
database every other hour. I think.

~~~
lykr0n
At the very least full __offline __backup before you start.

------
openplatypus
Oh good. Next stop: upgrade support and compliance. They still didn't work out
how to up date user physical address. You are to send all your documents IN
PLAIN TEXT over email. Good job crypto(sic)exchange!

------
hotsauceror
This feels like a Resume Generating Event for the DBA team. Sounds like their
DR plan doesn't work or doesn't exist. Even if the devs drew up a breaking
schema change that wasn't adequately tested and can't be rolled back, the
shot-caller would have moved them into disaster recovery mode a long time ago.

~~~
lykr0n
That's assuming they had proper DBAs. Modern ORMs, Docker, and other "DevOps"
tools allow you to build stuff without really knowing whats going on.

Edit. An example that comes to mind that you can deploy a mysql cluster in
docker with persistent storage. Playing with fire.

------
miguelrochefort
They said it would take 2 hours. It's been more than 24 hours.

~~~
eks
It's scary. Why not just revert to the old engine when things started to fell
out of place? Having the exchange down for that long is bound to have a big
backlash when it comes back online.

They should have put a Kraken 2.0 trade engine alongside the first one, and
moved people gradually there. It doesn't matter how confident they were with
the upgrade before it happened, it's crypto, everything is new. A few lines of
wrong code and you can lock millions of dollars in multi-sig wallets.

I have most of my funds there because their eur SEPA transfers worked very
well. I really hope they can get back in shape after it comes back online.

~~~
mrweasel
I notice the "upgrade is coming" notice on their site last week, but I don't
have a Kraken account, I was just interested in learning more about crypto-
currency trading. Now they made me curious about their platform, for a purely
technical perspective.

As you say, why didn't they just revert? Are they not able to? What are the
steps in their system upgrade? Are they moving to new hardware? What's their
setup like? What software are they running (custom written surely, but what
language, which database technologies?)

Incidents like this make me curious, and I would love to read the post mortem
on something like this.

~~~
stp-ip
Second the notion of a post mortem, but I assume due to their focus on
security they value security by obscurity as an additional factor.

~~~
dx034
Explaining why an upgrade didn't work doesn't compromise security. I'd guess
moving the data didn't work or they corrupted a database and don't know how to
repair it.. Explaining that you don't have proper data backups in place can be
embarrassing but with a post mortem you can at least get some trust back (like
Gitlab's incident).

Not explaining why you're offline for 24h doesn't help people to trust you

~~~
wonderwonder
If they corrupted a live database and are not able to recover it they are in a
world of hurt. While it is bad form, many people keep their coins on the
exchanges and even if the bulk of an individuals coins are offline, they still
likely have at least a small amount on their for trading.

If a table that connects user accounts to kraken owned wallets is corrupted
and not recoverable people will be out millions. For some that would be the
equivalent of your 401k issuing a post mortem for losing all of your
retirement.

If this worst case scenario happened they are likely in severe damage control.

Most likely explanation though is that things are just taking longer than
expected to upgrade what is by all measures likely a very technical and
convoluted system.

------
tedunangst
The graphs at the bottom seem to be telling a different story?

~~~
dx034
A website with no functionality responds quickly and the API graph can be up
100% if it only returns a static message (didn't test their API). But they
should've removed that API chart, a bit misleading.

