Hacker News new | comments | show | ask | jobs | submit login
Kraken down 24 hours on system upgrade (kraken.com)
63 points by ifdefdebug 8 months ago | hide | past | web | favorite | 57 comments



They erased this glorious status update:

> I fell asleep at my desk. In a dream, the ghost of a butterfly from the future whispered to me "the dollar hit 0 bitcoin". I wake up, people around me dabbing the blood from their eyes. Progress has been made but not enough. I crack my whip -- "we aren't getting to zero like this!". The flogging will continue until morale improves. Expect another update after I make my rounds. I'm sorry I've let you down. Jan 12, 03:11 UTC

But not before archive.org saved it for posterity: https://web.archive.org/web/20180112034351/https://status.kr...


Guess in his dream BTC was over 33.5M and the butterfly was only using float16 precision...


Besides their tendency to create antipodal hurricanes, type safety is a major problem with butterflies.


It appears that we are at the mercy of mentally unstable amateurs.


Just because the social media guy likes to play social media'ish doesn't mean the rest of the team is like that.

But definitely not a good moment to make a joke.


Saying things like "ahead of the team passing out" in their latest status updates doesn't exactly give confidence that they have either good communications or engineering leadership.


If everyone has been up 24hrs then you will get status updates like that.

I guess it is all hands on deck and there isn't a large enough team to do rotations of shifts? Or the issue is very complex and you need all your best guys around to fix it.

Yikes.

As someone who runs a high availability website, I can feel for them, but it is really bad to be offline this long. I would look to make some changes to that team or address an unstaffing issue or something. This should not happen. This should have been preventable/foreseeable and a risk mitigation strategy could have been in place.


From what I suspect, we are talking about some backend issue that they are handling. Throughout my career I've never seen a "rotating shifts team" that can handle situations like this. It's usually one guy trying to fix it and a couple of others hanging in the office with a cup of coffee and giving advice on what should the company do next or being forced to update the status page.


Based on the text-only homepage, I'd say the tech team is much worse - at least the social media guy is amusing.


The amount of downtime and how their wire has been for months says that


You have to differentiate between the social media person and the actual engineers.


Considering some of the comments I wrote at three in the morning, I guess that the actual engineers had a quick break after the social media person went home.


This team is in over their heads. If you don’t have stateless infrastructure with a big red “oops, this isnt right, rollback” button – you’re doing it wrong. Read the reviews of their iPhone app on the App Store. They are not equipped to be running a financial company.

Bummer because I’m looking to leave Coinbase. Looks like everyone in this space is nerf-grade.


I've personally had great experience with GDAX, which is an exchange operated by the folks at Coinbase. Though if your issues with Coinbase extend beyond just pricing, this might not be a very good alternative.


No kidding - their front page is literally black text on a white background.

They don't have a single person who can use Photoshop and put up a proper maintenance page? They didn't already have a failover maintenance page? This shit is Web Hosting 101 stuff from decades ago.

This is new levels of amateur idiocy.


I get a downvote for speaking the truth? Someone from the Kraken dev team I assume.

Feel free to explain to me how the current result from http://kraken.com/ is at all even remotely professional.


For those wondering, Kraken is a Bitcoin exchange.

https://en.wikipedia.org/wiki/Kraken_(bitcoin_exchange)


How much do they have under management or their daily volume, etc?


Billions.


Billions for both or just volume? Money in accounts would be interesting as that's where regulators (or the FBI) will step in at some point if it looks like potential fraud.


No one releases money in accounts numbers.

Kraken was doing $711M in daily volume before the maintenance. [1]

Only the top 8 exchanges have cracked a billion in daily volume. [2]

[1] https://coinmarketcap.com/exchanges/kraken/

[2] https://coinmarketcap.com/exchanges/volume/24-hour/all/


Yet another illustration that bitcoinz is a joke of a ecosystem.


Thanks.


They didn't communicate that there would even be a maintenance to their users. I use Kraken every day and did not receive any sort of email and there wasn't a notice on the site that I could see. Maybe they announced on social media, but that's about it.

Starting an unannounced maintenance in the middle of the day is crazy, even if it was only for the original 2 hours. Over 24 hrs now.

I don't know if this is due to incompetence or intentional.

If they announced the site was going down for major upgrade the smart thing for users to do would be to transfer their crypto to their offline wallets. Possibly withdraw their fiat also. This is what many do for a major fork of a single coin.

Separately, there is for sure going to be tons of money lost because Kraken supports margin trading and the market took a 15% dive right around this time, I suspect there are at least a few that got a margin call and are screwed. These people would have been able to manage their positions if the service didn't have extended and unannounced downtime.


It's clearly incompetence. I suspect someone dropped a table they weren't supposed to and they started scrambling to restore a backup they've never tested and discovered it wasn't working.

Going to kraken.com tells you all you need to know about these folks, all bad.


This is really bad. I hope they can handle everyone withdrawing their funds as soon as they are back online.

Edit: current status: https://status.kraken.com/incidents/nswthr1lyx72


They are cancelling all pending withdrawal and liquidation requests as well according to their status updates.

So basically a software upgrade/bug can now cause a bank run? Interesting times.


"Withdrawals in all currencies will be offline for an additional 2-3 hours after other systems come online. If you submit a withdrawal request while withdrawals are offline, the withdrawal will be sent after withdrawals come online again."

That may help stabilizing a bit.


I wrote an open-source cl utility to query the apis of exchanges like kraken, bittrex, bitfinex, etc, and had to basically abandon the kraken portion back in November. It started to become incredibly unreliable, even back then.

I moved all my coins off of there a few weeks later, because it was becoming impossible to get a buy or sell order in.

In other words, to users of the site, this isn't surprising. :(


I wish them the best, it must be very stressful for an update to go this south. Their founder Jesse reached out to us in the past and I got a good impression from him.

Maybe they got so rich from the high crypto valuations that most of the senior team are on their private islands already?


These exchanges must have such a hard time recruiting new staff. They need to trust their employees with millions or even hundreds of millions of dollars. And they are growing at an insane rate. Ouch.


I had an account with them some months ago, but their system is like shit. Completely shit. When I was long, the balance showed the opposite, when the market was against me my position improved and vice-versa. This had all the bad consequences that one can imagine, as I couldn't close some positions because supposedly there were no funds for that, when in fact, there were.

They need a system fix, I really don't trust them.


As a long time software developer and manager, it is unimaginable to me that they had not done proper analysis of the risks and had mitigation strategies in place. The whole future butterfly comment was bizarre and I saw it when it happened. I am hoping at the very least they have DB backups but I would think if they did, they would already have been restored. With each passing hour I am in fact growing very concerned. Not to mention the run on that place could likely put them out of business when and if they ever do come back online. Just look at the owner over there. Smoking too much weed I suspect?


At least they are communicating. Hopefully there will be a postmortem.


Not really, just saying that it takes longer than expected. Certainly better than giving no update at all but that would likely cause panic for anyone with money on kraken.


Never, ever, not ever...

Run major updates to production...

Near a weekend. Surely they're not going to complete the go-live later today are they??

I'm assuming of course all of this as been heavily tested on test networks for a couple weeks... and of course this is just some deployment issue... (Sarcasm)


I reckon you never worked in a Investment Bank. Weekends are often the only time to run major changes/upgrades.


Speaking in general terms, not running updates near a weekend is only so you know you have personnel on hand to fix problems.

If you have the people available, then weekends are best, because it's usually a lower impact for end users.


This varies industry by industry. As someone who runs a dating site, we've all but outlawed non-trivial Friday releases because Fri-Sun are peak days.

It wouldn't surprise me if crypto volumes saw a similar trend... though in reality i suspect people account for a minority of volume when compared to bots.


When upgrades go this bad, can you not just roll back to the previous version?

I guess this was some type of breaking change, like a DB re-structure that makes rolling back problematic? And then somehow this wasn't testing enough?


Some upgrades are a one shot deal. If you refactor a data structure, you have to plan to unrefactor it. If you don't, then there are two options, push forward (or hotfix) or restore from backups.

Massive schema migrations would be my guess. And I'm willing to bet that their Test/QA Environment only has a small set of test data and not a replication of production


I believe the proper way to upgrade schema is to

1) don't change existing schema (don't alter columns)

2) introduce new schema, side by side (add new columns, all nulls, without defaults)

3) update code to read new schema first, if nothing found, read old schema,

4) create triggers that a write in new or old schema updates the other

5) have a separate process to sequentially migrate data from old schema to new schema

6) once all records are migrated to new schema, start slowly removing code that reads old schema

7) once nothing reads old schema, delete it

Handling schema changes is a tricky process that is spread across multiple different steps, where each step can be rolled back.


If you run a 700M$ exchange, you should probably have a cold backup of your database every other hour. I think.


At the very least full offline backup before you start.


Oh good. Next stop: upgrade support and compliance. They still didn't work out how to up date user physical address. You are to send all your documents IN PLAIN TEXT over email. Good job crypto(sic)exchange!


This feels like a Resume Generating Event for the DBA team. Sounds like their DR plan doesn't work or doesn't exist. Even if the devs drew up a breaking schema change that wasn't adequately tested and can't be rolled back, the shot-caller would have moved them into disaster recovery mode a long time ago.


That's assuming they had proper DBAs. Modern ORMs, Docker, and other "DevOps" tools allow you to build stuff without really knowing whats going on.

Edit. An example that comes to mind that you can deploy a mysql cluster in docker with persistent storage. Playing with fire.


They said it would take 2 hours. It's been more than 24 hours.


It's scary. Why not just revert to the old engine when things started to fell out of place? Having the exchange down for that long is bound to have a big backlash when it comes back online.

They should have put a Kraken 2.0 trade engine alongside the first one, and moved people gradually there. It doesn't matter how confident they were with the upgrade before it happened, it's crypto, everything is new. A few lines of wrong code and you can lock millions of dollars in multi-sig wallets.

I have most of my funds there because their eur SEPA transfers worked very well. I really hope they can get back in shape after it comes back online.


Their downtime and non usable site was so bad already that downtime is almost as inconvenient as the working site was before. Now at least they are fixing it. They held back the update for months due to testing. They have to take the jump and after x hours of downtime the damage is done, so fixing it once and for all instead of rolling back might actually be the better solution.


I notice the "upgrade is coming" notice on their site last week, but I don't have a Kraken account, I was just interested in learning more about crypto-currency trading. Now they made me curious about their platform, for a purely technical perspective.

As you say, why didn't they just revert? Are they not able to? What are the steps in their system upgrade? Are they moving to new hardware? What's their setup like? What software are they running (custom written surely, but what language, which database technologies?)

Incidents like this make me curious, and I would love to read the post mortem on something like this.


Second the notion of a post mortem, but I assume due to their focus on security they value security by obscurity as an additional factor.


Explaining why an upgrade didn't work doesn't compromise security. I'd guess moving the data didn't work or they corrupted a database and don't know how to repair it.. Explaining that you don't have proper data backups in place can be embarrassing but with a post mortem you can at least get some trust back (like Gitlab's incident).

Not explaining why you're offline for 24h doesn't help people to trust you


If they corrupted a live database and are not able to recover it they are in a world of hurt. While it is bad form, many people keep their coins on the exchanges and even if the bulk of an individuals coins are offline, they still likely have at least a small amount on their for trading.

If a table that connects user accounts to kraken owned wallets is corrupted and not recoverable people will be out millions. For some that would be the equivalent of your 401k issuing a post mortem for losing all of your retirement.

If this worst case scenario happened they are likely in severe damage control.

Most likely explanation though is that things are just taking longer than expected to upgrade what is by all measures likely a very technical and convoluted system.


It boggles my mind that it's in the state it is. Some kid in his basement knows you do staging rollouts and build in parallel, this is seriously the most basic IT knowledge in existence - and yet they still screwed it up, now over 24 hours later.

Whoever their CTO is, is clearly the worst kind of incompetent and the team is the most amateur I have ever seen in my 20 years of Internet systems management.


The graphs at the bottom seem to be telling a different story?


A website with no functionality responds quickly and the API graph can be up 100% if it only returns a static message (didn't test their API). But they should've removed that API chart, a bit misleading.




Applications are open for YC Winter 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: