Hacker News new | comments | ask | show | jobs | submit login
Kraken Goes Down for 40 Hours, Drawing Mt. Gox Comparisons (fortune.com)
84 points by yonilevy on Jan 13, 2018 | hide | past | web | favorite | 33 comments

Latest Kraken update says that:

> We are still working to track down an elusive bug which is holding up launch. This bug did not appear in our many weeks of testing and only emerged in the production environment. Unfortunately, it is not consistently reproducible and we cannot launch until this issue is resolved.

Sounds pretty realistic to me. During my 8+ years of programming, I had a lot of pain and sleepless nights trying to catch elusive bugs which can be reproduced only in production servers and only sometimes. Many times, my clueless managers got nervous and demanded estimated time from me. Unfortunately, it's impossible to give realistic estimations. It could take hours, days or weeks. Also, searching for such bugs doesn't scale well. It means if you have 5 devs looking for elusive bug, it doesn't mean they will find a bug 5 times faster.

I even had clueless managers who demanded I explained the technically complex problem to them, because "maybe our insights could shed new light on the problem". They just don't get that best thing to do is leave us alone and let us do our job. They are just nervous because they are totally powerless in such situation.

In a factory, a manager might be smarter than the one doing manual labour. In tech, it's the other way around.

Right idea, wrong reason. You have a tonne of assumptions about how the product works that I don’t. If the bug is that subtle, dollars to Donuts says that one of your assumptions is wrong, but you just haven’t thought to question it yet.

When you explain the issue to me, I’ll ask dumb questions about context I lack, so you’ll explain stuff you take for granted. And it’s answering those questions that will make you figure out where the problem is.

If I explain the technical stuff to another programmer, spot on indeed.

Managers who don't know what a variable is, SQL or TCP or a while loop or whatever, in my experience it is just that: dumb questions.

Something's terribly wrong with the hiring if managers for a tech team are that clueless about tech, though.

Sometimes a good manager can help by expanding the solution space or by shifting the problem somewhere else.

Spot on.

If you assume potential malice (ie a lazy dev just not handling the bug at all) then a manager is required to ask about the bug, if only to try to verify that you've put real effort into the problem (based on say, confidence in whatever you spew). The worst possible scenario is that five months later, it turns out no one actually looked into properly: they just claimed to.

In that case, the manager should have fired that dev a long time ago.

Best managers I worked for had trust in their developers, and asked for estimates on planning (not bugs). The worst managers didn't trust us, and pushed their own planning on us because they thought it would make us work faster. They thought planning can be negotiated.

I can tell you which projects were within 7% of the deadline, with good quality. And which ones we were always putting out fires, and some never finished.

I can't agree more. I always maintained that you should hire devs qualified and motivated enough to be self-managed. In other words, ideally, I would hire people which doesn't require management at all.

I mean everybody wants this but even the greatest ‘self-managers’ need management to do stuff like this:

- put their work in context - remove dumb obstacles for them - ensure that they are working in ways that benefit the team - check they are having an ok time interacting with others and sharing knowledge - check they have not said yes to too many things - ensure they don’t get swayed by better offers - acknowledge their work

Etc etc ... management is not about enforcing deadlines and correcting problems (which you seldom have to do with truly independent people). There needs to be the right level of communication with each employee so you are not in their way but not hanging them out to dry, and this needs to be maintained even in times when there are no problems.

As I see it, managers need to be at the service of their team. But a lot of managers think their team is at the service of them.

I think one problem is people tend to be locally self-managed: they'll do work assigned just to them, or a close-knit team, but as soon as something is blocked by say the dba team, they'll put very little effort into ensuring the work progresses (unless they have a personal interest in seeing that task through). Or they'll prioritize incorrectly: towards their skill/interests, or what their friends are waiting on (as opposed to the whole team/company), or even a simple misunderstanding of importance.

And of course you have things like deadlocks that have to be resolved, ie dba waiting on programmer, and programmer waiting on artist, and artist waiting on dba.

So the manager serves a distinct, and presumably necessary role. And by necessity, he must be invasive: if there was sufficient communication, the deadlock would resolve itself. But if the manager is necessary, there is not sufficient communication: that is, someone is not announcing important information by their own will (perhaps because they aren't aware its worth announcing). It has to be wrung out.

And if you're trying to stop such problems before they occur, then you'll end up asking seemingly redundant or even arbitrary questions. (Because you can't know the correct questions to ask!)

And this is why Revert first, continuous deployment with feature toggles is the most stable way to run production. Deploy the new stuff, add some feature toggles. If the last version is stable, and your code is easy to revert, then you don't have to worry about downtime. If you're doing a major deploy, you can use the feature toggle to slowly turn it on (or off) and even use it test your new code directly in production (the systems I've used in the past allowed a developer to specify a feature toggle to be on for just their session). Databases can be tricky to toggle, but you have to build that into your design.

Pretty irresponsible to completely brick production servers when doing an upgrade.


Sounds like their staff has being going without sleep for a few days now.

> Sounds like their staff has being going without sleep for a few days now.

Given that sleep deprivation has effects similar to being blackout drunk, I'd consider that equally irresponsible[0].

[0]: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1739867/

I've lost count the number of times "watched the sunrise" was the climax of an IT war story. Bonus points for two nights in a row ! :D

Worked for a IT managed services place that made staff pull 24+ hrs non stop regularly. I walked away from that career and spent years recovering from a 38 hr shift where I was begging my managers to get me relief. "Your the only guy that knows this well enough" is a sign you work for a failing organization pinning that shit on you. I learned that the "business" and sales dudes absolutely will trade your health/lively hood for a few extra bucks on commission.

I can't help but feel less stressed about my own job after reading this status page.

It's not an upgrade, it's a bugfix. Probably an important bug, they don't want to risk tens/hundreds of millions of dollars.

They could have left the legacy system running before switching irreversibly to the new bugged system. That way they can rollback the change.

Unless the bug being fixed was some sort of critical security failure that enabled people to game the exchange or manipulate balances. I am guessing this is more of a screw-up by amateurs than a 'take it all down until this hole is patched' situation, but there is a slim possibility that keeping it up was a bigger risk than the reputation hit of extended downtime.

Looks like these are announced improvements to the Exchange engine. Really look forward to a post-mortem on this issue. Curious if they had issues with database migrations or such, that required recovery from backup in case of errors. Plus, for sure there were bugs in the code itself.

No, they couldn't. It's a large trading platform. Any serious bug can lead to the theft of tens of millions of dollars worth of cryptocurrencies.

Upgrade a 750 million dollar a day trading system with no rollback strategy.

Amateur hour, and if this were a "real" exchange, they'd lose their licence and have lawsuits and government agents knocking down their door (real-money MTFs have strict uptime and infrastructure requirements for anyone who didn't know)

They're back with free trading until end of January: https://blog.kraken.com/post/1449/kraken-returns-with-free-t...

Down again. Amateur hour. Lol.

Still no withdraw..

They say the next big thing is here, that the revolution's near, but to me it seems quite clear that it's all just a little bit of history repeating

I tend to trust them for security because they are handling the Gox bankruptcy. For uptime... not so much.

IT sounds like this week is going to be profitable on Kraken if you trust their status reports.

everything is up and trading is free for the rest of the month. </drama>

Kraken is notoriously slow, unresponsive and buggy exchange and frankly I dont know why they get away with it. For months or year(s) you often run into site unavailable errors (load balancing issues?). Or if you are lucky to get to the trading screen you are unable to get trades through because of engine issues (capacity issues again?). Sometimes all you see is an empty red background rectangle where the error text should be indicating no dice but then minutes/hours later you might get one or more of the positions you tried to open despite only having gotten (ambiguous) errors.

Let's hope they finally fixed their engine and purchased more capacity.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact