Hacker News new | past | comments | ask | show | jobs | submit | ShadowRegent's comments login

I'm not so sure. One of the potential benefits of removing ports from the iPhone is improved water resistance (personally, I'd still rather have the port). I don't foresee going swimming with my AirPods case.



Maybe, but I think that their "Informed Speculation" section was probably unnecessary. They may or may not be correct, but give Flexential an opportunity to share what actually happened rather than openly guessing on what might have happened. Instead, state the facts you know and move onto your response and lessons learned.


Yeah, that part really rubbed me the wrong way. If this was a full postmortem published a couple of weeks after the fact and Flexential still wasn't providing details, I could maybe see including it, but this post is the wrong place and time.


I prefer to have their informed speculation here.

Has Flexential provided a similarly detailed, public root cause analysis? If so, maybe we can refer to it. If not, how do you expect us to read it?


It’s only been a couple of business days, and it’s likely that they themselves will need root cause from equipment vendors (and perhaps information from the utility) to fully explain what happened. Perhaps they won’t publish anything, but at least give them an opportunity before trying to do it for them.


I expect them to start reporting out what they know immediately, and update as they learn more. If they're not doing that, and indeed haven't reported anything in days, that is a huge failure.

Imagine if the literal power company failed, and took days to tell people what was going on. You can see why people are reading the postmortem that exists, rather than the one that doesn't.


Interesting choice to spend the bulk of the article publicly shifting blame to a vendor by name and speculating on their root cause. Also an interesting choice to publicly call out that you're a whale in the facility and include an electrical diagram clearly marked Confidential by your vendor in the postmortem.

Honestly, this is rather unprofessional. I understand and support explaining what triggered the event and giving a bit of context, but the focus on your postmortem needs to be on your incident, not your vendor's.

Clearly, a lot went wrong and Flexential needs to do their own postmortem, but Cloudflare doesn't need to make guesses and do it for them, much less publicly.


If Flexential and PGE aren't sharing information or otherwise cooperating as much as Cloudflare might like, then going public with some speculation might be an attempt at applying some pressure to get to the bottom of what happened.

It might also be an effort to get out in front of the story before someone else does the speculating.

In any case, with at least three parties involved, with multiple interconnected systems… if Cloudflare is going to effectively anticipate this cluster of failure modes in future design decisions, it's reasonable for them to want to know what happened all the way down.

Edit to add: I for one am grateful for the information Cloudflare is sharing.


>If Flexential and PGE aren't sharing information or otherwise cooperating as much as Cloudflare might like, then going public with some speculation might be an attempt at applying some pressure to get to the bottom of what happened.

It's been 2 days. I doubt PGE or Flexential even have root caused it yet, and even if they have, good communication takes time.

You don't throw someone under the bus and smear their name publicly just because they haven't replied for two days, and you certainly don't start speculating on their behalf. That's bad partnership.

You also don't publicly share what "Flexential employees shared with us unofficially" (quote from the article) - what a great way to burn trust with people who probably told you stuff in confidence.

>if Cloudflare is going to effectively anticipate this cluster of failure modes in future design decisions, it's reasonable for them to want to know what happened all the way down.

They can do all of that without smearing people on their company blog. In fact, they can do all of that without even knowing what happened to PGE/Flexential, because per their own admission they were already supposed to be anticipating this, but failed at it. Power outages and data center issues are a known thing, and is exactly why HA exists. HA which Cloudflare failed at. This post-mortem should be almost entirely about that failure rather than speculation about a power outage.


> You don't throw someone under the bus and smear their name publicly just because they haven't replied for two days, and you certainly don't start speculating on their behalf. That's bad partnership.

1. When you’re paying them the kind of money I imagine they’re paying and they don’t reply for 2 days, yea that’s crazy if true. I’d expect a client of this size could take to an executive on their personal number.

2. Telling the facts as you know them to be especially regarding very poor communication isn’t a smear.


They aren't telling the facts as they know them. Cloudflare themselves say that the information in the article is "speculation" (the article literally uses that term).

Publicly casting blame based on speculation isn't something you do to someone that you want to have a good working relationship with, no matter how much money you pay them.


That's not true. This is behaviour that would be enough for me to pull the plug working with this DC as this is more than unacceptable.


> if you want to have a good working relationship with

What are you disagreeing with OP ?

He is talking about how to behave if you continue the relationship not whether to continue it .


The post you're replying to is pointing out that multiple days without reporting out a preliminary root cause analysis is so absurdly below the expected level of service here that it would prompt them to reconsider using the service at all.

2 days is outrageous here, I have to imagine whoever thinks that is acceptable is approaching this from the perspective of a company whose downtime doesn't affect profits.


If you actually worked with datacenters you'd understand that what PGE and Flexential is unacceptable as well


Agreed. DC sends us notifications any time power status changes. We had a dark building event once, due actually to some similar sounding thing: power fail over caused some arc fault in HV that took out the fail over switchgear. We received updates frequently.

UPS failing early sounds like it may be a battery maintenance issue.


We have no idea what their contract is. But two business days without a reply isn’t exactly a long time. Especially if they are conducting their own investigation and reproduction steps.


> But two business days without a reply isn’t exactly a long time

What???? We have 4 hour boots on the ground support with Supermicro and that's a few thousand dollars a year lol.

That doesn't make any sense for a customer as big as CF.


My impression from reading the writeup is that CF did receive support and communication from Flexential during the event (although not as much communication as they would have liked), but hasn't received confirmation from Flexential about certain root cause analysis things that would be included in a post-mortem.

Two days without support communications would be a long time, but my original comment about the two day period is about the post-mortem. It's totally reasonable IMO for a company to take longer than two days to gather enough information to correctly communicate a post-mortem for an issue like this, and IMO its unreasonable for CF to try to shame Flexential for that.


Especially since it shouldn't matter why the DC failed — Cloudflare's entire business model is selling services allegedly designed to survive that. 99% of the fault lies with Cloudflare for not being able to do their core job.


In all fairness the rest of the article is about that


So why spend so much time trying to shift blame to the vendor? They could've just started the article with something like:

> Due to circumstances beyond our control the DC lost all power. We are still working with our vendors to investigate the cause. While such a failure should not have been possible, our systems are supposed to tolerate a complete loss of a DC.


I don't think I read it as charged as you did

Here's what happened, here's what went wrong, here's what we did wrong, here's our plans to avoid it happening again

Seems like a standard post mortem tbh


Because a small handful of decisions probably led to the Clickhouse and Kafka services still being non-redundant at the datacenter level, which added up to one mistake. But a small handful of mistakes were made by the vendor. Calling out each one of them was bound to take up more page space.

The ordering that they list the mistakes would be a fair point to make though, in my opinion. They hinted at a mistake they made in their summary, but don't actually tell us point blank what it was until they tell us all the mistakes that their vendor made. I'd argue that was either done to make us feel some empathy for Cloudflare as being victims of the vendor's mistakes, misleading us somewhat. Or it was done that way because it was genuinely embarrassing for the author to write and subconsciously they want us to feel some empathy for them anyway. Or some combination of the two. Either way, I'll grant that I would have preferred to hear what went wrong internally before hearing what went wrong externally.


Slightly less than half, and the bottom half, so that people just skimming over it will mostly remember the DC operators' problems, not Cloudflare's own. This is very deliberately manipulative.


It is of course possible they've shuffled things around since this was posted but it seems that the first part addresses their system failings.

5th paragraph to the 9th are Cloudflare's "we buggered up" before they get to the power segment. They then continue with the "this is our fault for not being fully HA" after the power bit.

Each to their own, I'm going to read it as a regular old post mortem on this one.


Yeah I agree. The data center should be able to blow up without causing any problems. That's what Cloudflare sells and I'm surprised a data center failure can cause such problems.

Going into such depths on the 3rd party just shows how embarrassing this is for them.


You are way off here, this is 100% on Flexential, they have a 100% Power SLA, that means the power will always be available, right? They also clearly hadn't performed any checks on the circuit breakers and this is a NEWER facility for them, they also didn't even have HALF of the 10hours for the batteries to charge the generators, they also DEFINITELY should have fully moved to generators during this maintenance, they clearly couldn't because they were MORE than likely assisting PGE. Cloudflare CEO is right on here, you pay for Data Center services to be full redundant, they have 18MW at this location and from what I can see they have (2) feeds? That I can't find? Do they? If (1) feed goes down the 2N they have should kick in and with generators there should be NO issues.


As far as I'm aware, this is the initial post-mortem to describe the events that took place.

And yes, that also means the initial event description in what they know so far.

Highly likely there will be another one https://twitter.com/eastdakota/status/1720688383607861442?t=...


I actually disagree, and think that the post mortem clearly defines that there were things that were disappointing that happened with the vendor, _as well as_ things that were disappointing that happened internally. I don't think that it's unfair to point out everything in an event that happened; I do think it would be unfair to ignore all the compounding issues that were in the power of the vendor, and just swallow all of the blame for an event, when a huge reason that businesses even go through vendors at all is to have an entity responsible for a certain set of responsibilities that the business in question doesn't feel they have the expertise to do themselves. Which implies a relationship built on trust, and it's fair to call out when trust is lost.

And even though Cloudflare did put some of the blame, as it were, on the vendor, the post mortem recognizes that Cloudflare wasn't doing their due diligence on their vendor's maintenance and upkeep to verify that the state of the vendor's equipment is the same as the day they signed on. And that's ignoring a huge focus of the post mortem where they admit guilt at not knowing or not changing the fact that Kafka and Clickhouse were only in that datacenter.

Furthermore, we do not know that Cloudflare didn't get the vendor's blessing to submit that diagram to their post mortem. You're assuming they didn't. But for what it's worth as someone that has worked in datacenters, none of this is all that proprietary. Their business isn't hurt because this came out. This is a fairly standard (and frankly simplified for business folk) diagram of what any decently engineered datacenter building would operate like. There's no magic sauce in here that other datacenter companies are going to steal to put Flexential out of business. If you work for a datacenter company that doesn't already have any of this, you should write a check to Flexential or their electrical engineers for a consultancy.

And finally, the things that Cloudflare speculated on were things like, to paraphrase, "we know that a transformer failed, and we believe that its purpose was to step down the voltage that the utility company was running into the datacenter." Which, if you have basic electrical engineering knowledge, just makes sense. The utility company is delivering 12470 volts, of course that needs to be stepped down, somewhere along the way, probably multiple times, before it ends up coming through the 210 volt rack PDUs. I'm willing to accept that guess in the absence of facts from the vendor while they're still being tight lipped.

However, that's not to say I'm totally satisfied by this post mortem either. I am also interested in hearing what decisions led to them leaving Kafka and Clickhouse in a state of non-redundancy (at least at the datacenter level) or how they could have not known about it. Detail was left out there, for sure.


That isn't a voltage change where you'd use multiple transformers in sequence generally, let alone if it's at the same site for the main/primary feed. A redundant feed counts the same, just to be clear, it's more that some low-power/"control plane of the electrical switchyard" applications may use a lower voltage if conveniently available, even if that means a second transformation step from the generators/grid to the load.

That said, the existence of the 480V labeled intermediary does suggest they have a 277/480 V outside system, and a 120/208 V rack-side system.


It's replies like these that make companies not want to share detailed postmortems. It's not crazy for many things in a incident to go wrong and for >0 of them to be external. It would be negligent for Cloudflare to not explicate what went wrong with the vendor which, I would note, reflects poorly on them: who picked the vendor? If anything, I would have liked to hear more on how Cloudflare ended up with a subpar vendor.

(none of this takes away from the mistakes that were wholly theirs that shouldn't have happened and that they should fix)


Close, but not quite. The keys were for consumer Microsoft accounts, but accepted for organization accounts as well.


> Am I right to assume that because the attacker had the signing key all of the extra authentication mechanisms that would have been enabled on accounts were bypassed by the attacker...?

That's my understanding.


> Interestingly, GSM does allow for priority of emergency calls (112, 999, etc). IIRC, the SIM can make a priority request to be connected to the emergency services. The network could disconnect a non-emergency call to free up the air interface.

An emergency call on GSM doesn't really "call" a number in the traditional sense-- the number is never even transmitted. This is also why emergency services can be reached by the GSM standard number (112) or other emergency numbers specified by the phone or the SIM itself.


“Twitter, whose press office has been largely destaffed and set to autoreply to requests for comment with a poo emoji, did not acknowledge the reports.”

Of course he did that.


A CA doesn't need to be online for HTTPS to work. If it's offline, it won't be able to issue new certificates, but existing ones will be just fine. In fact, Root CAs are often kept offline as a best practice to limit their exposure.


I stand corrected!

My original cursory understanding of HTTPS -- was that it required SSL/TLS to make it work, which in turn required CA's issuing certificates and the validation of those certificates to make that work...

Now, all of the above is true -- but the subtle distinction that I realized after reading your comment (and performing more subsequent research on the matter) -- was that everything above is (apparently -- based on the best knowledge from the research I've done at this point in time!) based on local Root Certificates whose data is contained in local files (i.e., no need to reach out to a CA server for validation) which act as the "Trust Anchor" for all other SSL/TLS certificates...

I.e., new certificates handed to a user's browser by a new website -- do not need to be validated by making an Internet connection to the CA and asking that the CA validate them -- instead, they are validated using cryptographic hashing techniques against the user's local CA Root Certificates...

These CA Root Certificates are apparently X.509 certificates -- or follow that format:

https://en.wikipedia.org/wiki/X.509

Now, that's a good thing!

It means that there's no dependency (one less point of possible failure) that a CA be up and running -- for all of those HTTPS/TLS/SSL transactions to work...

So, you are correct -- and I stand corrected!


spend less time in the comments section and more time building bro


Question #1: Building what exactly, "bro"?

Question #2: How would that change anything?

Question #3: How do I know that you're not:

a) A GPT-3 or other AI powered chat bot?

b) A Troll?

c) A paid disinformant and/or propagandist and/or someone else with an agenda -- foreign or domestic?

?

Question #4: What value do you genuinely believe your comments add to the discussion?

?


EarthBound took a similar approach with it's anti-piracy measures if you work around the obvious ones. There are far, far more enemies to make the game less enjoyable. They also added random freezes when entering certain areas. If you managed got to the final boss despite everything else, it freezes and deletes your save.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: