Rust needs to get rid of .unwrap() and its kin. They're from pre-1.0 Rust, befor...

svieira · 2025-11-26T19:12:54 1764184374

All that means is that the `Failure` bubbles up to the very top of `main` (in this scenario) because we're only caring about the happy path (because we can't conceive of what the unhappy path should be other than "crash") and then hits the `panic("Well, that's unexpected")` explicitly in Place B rather than Place A (the `.unwrap`). I'm not sure how that's _better_.

jacquesm · 2025-11-26T19:47:50 1764186470

It would not because it would be a compile time error rather than run time error which is a completely different beast if I understand the argument correctly.

Dylan16807 · 2025-11-26T23:26:37 1764199597

What would be a compile time error? The compiler rejecting unwrap? And then you fix that by bubbling the error case up, which fixes the compiler error and leaves you with a runtime error again. But one that's less ergonomic.

You can't force a config file loaded at run time to be correct at compile time. You can only decide what you're going to do about the failure.

jacquesm · 2025-11-26T23:48:00 1764200880

The point they are - trying, apparently - making is that if you had a flag or an annotation that you could make to a function that you do not want that function to be built on top of anything that can 'unwrap' that you can rule out some of these cases of unexpected side effects.

echelon · 2025-11-26T19:50:34 1764186634

Not really. Handler and middleware can handle this without much ceremony. The user gets to, and is informed of and encouraged to, choose.

We also don't get surprised at runtime. It's in the AST and we know at compile time.

The right API signature helps the engineer think about things and puts them in the correct headspace for systems thinking. If something is panicking under the hood, the thought probably doesn't even occur to them.

svieira · 2025-11-27T02:52:06 1764211926

Yes, but my point is that without a reasonable supervision tree and crash boundary the difference between a composition of Result-returning functions that bottoms out in main's implicit panic and an explicit panic is nil operationally.

While lexically the unwrap actually puts the unhandledness of the error case as close to the source of the issue's source as possible. In order to get that lexical goodness you'd need something much more fine grained than Result.

12_throw_away · 2025-11-26T18:30:49 1764181849

> There's no reason to use [panics] as the language provides lots of safer alternatives.

Dunno ... I think runtime assertions and the ability to crash a misbehaving program are a pretty important part of the toolset. If rust required `Result`s to be wired up up and down the entire call tree for the privilege of using a runtime assertion, I think it would be a lot less popular, and probably less safe in practice.

> Alternatively, and perhaps even better, Rust needs a way to mark functions that can panic for any reason other than malloc failures.

I 100% agree that a mechanism to prove that code can or cannot panic would be great, but why would malloc be special here? Folks who are serious about preventing panics will generally use `no-std` in order to prevent malloc in the first place.

zbentley · 2025-11-27T03:51:43 1764215503

> a mechanism to prove that code can or cannot panic would be great

As appealing as the idea of a #[cfg(nopanic)] enforcement mechanism is, I think linting for panic() is the optimum, actually.

With a more rigidly enforced nopanic guarantee, I worry that some code and coders would start to rely on it (informally, accidentally, or out of ignorance) as a guarantee of completion, not return behavior. And that’s bad; adding language features which can easily be misconstrued to obscure the fact that all programs can terminate at any time is dangerous.

Lints, on the other hand, can be loud and enforced (and tools to recursively lint source-available dependencies exist), but few people mistake them for runtime behavior enforcement.

echelon · 2025-11-26T18:43:19 1764182599

> I 100% agree that a mechanism to prove that code can or cannot panic would be great, but why would malloc be special here? Folks who are serious about preventing panics will generally use `no-std` in order to prevent malloc in the first place.

In one of the domains I work in, a malloc failure and OOMkill are equivalent. We just restart the container. I've done all the memory pressure measurement ahead of time and reasonably understand how the system will behave under load. Ideally it should never happen because we pay attention to this and provision with lots of overhead capacity, failover, etc. We have slow spillover rather than instantaneous catastrophe. Then there's instrumentation, metrics, and alerting.

A surprise bug in my code or a dependency that causes an unexpected panic might cause my application or cluster to restart in ways we cannot predict or monitor. And it can happen across hundreds of application instances all at once. There won't be advanced notice, and we won't have a smoking gun. We might waste hours looking for it. It could be as simple as ingesting a pubsub message and calling unwrap(). Imagine an critical service layer doing this all at once, which in turn kills downstream services, thundering herds of flailing services, etc. - now your entire company is on fire, everyone is being paged, and folks are just trying to make sense of it.

The fact is that the type of bugs that might trigger a user-induced panic might be hidden for a long time and then strike immediately with millions of dollars of consequences.

Maybe the team you implemented an RPC for six months ago changes their message protocol by flipping a flag. Or maybe you start publishing keys with encoded data center affinity bytes, but the schema changed, and the library that is supposed to handle routing did an unwrap() against a topology it doesn't understand - oops! Maybe the new version handles it, but you have older versions deployed that won't handle it gracefully.

These failures tend to sneak up on you, then happen all at once, across the entire service, leaving you with no redundancy. If you ingest a message that causes every instance to death spiral, you're screwed. Then you've got to hope your logging can help you find it quickly. And maybe it's not a simple roll back to resolve. And we know how long Rust takes to build...

The best tool for this surely can't be just a lint? In a supposedly "safe" language? And with no way to screen dependencies?

Just because somebody's use case for Rust is okay with this behavior doesn't mean everyone's tolerates this. Distributed systems folks would greatly appreciate some control over this.

All I'm asking for is tools to help us minimize the surface area for panics. We need as much control over this as we can get.

Dylan16807 · 2025-11-26T23:32:18 1764199938

If you replace panic with a bespoke fallback or retry, have you really gained anything? You can still have all your services die at the same time, and you'll have even less of a smoking gun since you won't have a thousand stack traces pointing at the same line.

The core issue is that resilience to errors is hard, and you can't avoid that via choice of panic versus non-panic equivalents.

bigstrat2003 · 2025-11-27T04:55:45 1764219345

Unwrap is not only fine, it's a valuable part of the language. Getting rid of it would be a horrible change. What needs to happen is not using an assert (which is really what unwrap is) if an application can't afford to crash.

jacquesm · 2025-11-26T18:07:04 1764180424

I'd say the equivalent of Erlang's supervisor trees is what is needed but once you go that route you might as well use Erlang.

zbentley · 2025-11-27T03:57:31 1764215851

I’m not sure that panic (speaking generally about the majority of its uses and the spirit of the law; obviously 100% of code does not obey this) is the equivalent of an Erlang process crash in most cases. Rather, I think unwrap()/panic are usually used in ways more similar to erlang:halt/1.

jacquesm · 2025-11-27T13:06:15 1764248775

Exactly, but that is kind of the point here. An Erlang 'halt' is something that most Erlang programmers would twig is not what you want in most cases, in most cases you want your process to crash and for the supervisor to restart it if the error is recoverable.

What happened here is systemic: the config file contained an issue severe enough that it precluded the system from running in the first place and unfortunately that caused a runtime error when in fact that validation should have been separate from the actual use. This is where I see the problem with this particular outage. And that makes it an engineering issue much more than a language issue.

Bad configuration files can and do happen, so you take that eventuality into account during systems design.

ViewTrick1002 · 2025-11-26T18:21:56 1764181316

Or just deploy containers with an orchestrator restarting them when failing?

It is not like an Erlang service would be able to make progress with an invalid config either.

jacquesm · 2025-11-26T18:58:16 1764183496

That's fair, but even there the roll-back would be a lot smoother, besides the supervisor trees are a lot more fine grained than restarting entire containers when they fail.

lenkite · 2025-11-26T18:56:03 1764183363

What happens when they "keep" failing ? You never get to know what is causing your nightmare.

burntsushi · 2025-11-26T19:08:50 1764184130

I'm on libs-api. We will never get rid of unwrap(). It is absolutely okay to use unwrap(). It's just an assertion. Assertions appear in critical code all the time, including the standard library. Just like it's okay to use `slice[i]`.

echelon · 2025-11-26T19:15:32 1764184532

This is the Hundred Billion Dollar unwrap() Bug.

You can keep unwrap() and panics. I just want a static first class method to ensure it never winds up in our code or in the dependencies we consume.

I have personally been involved in nearly a billion dollars of outages myself and am telling you there are simple things the language can do to help users purge their code of this.

This is a Rust foot gun.

A simple annotation and compiler flag to disallow would suffice. It needs to handle both my code and my dependencies. We can build it ourselves as a hack, but it will never be 100% correct.

This is why I want it:

https://news.ycombinator.com/item?id=46060907

burntsushi · 2025-11-26T19:21:56 1764184916

You said:

> Rust needs to get rid of .unwrap() and its kin.

Now you say:

> You can keep unwrap() and panics.

So which is it?

> I just want a static first class method to ensure it never winds up in our code or in the dependencies we consume.

Now this is absolutely a reasonable request. But it's not an easy one to provide depending on how you go about it. For example, I'd expect your suggestion in your other comment to be a non-starter because of the impact it will have on language complexity. But that doesn't mean there isn't a better way. (I just don't know what it is.)

This is a classic motte and bailey. You come out with a bombastic claim like "remove unwrap and its ilk," but when confronted, you retreat to the far more reasonable, "I just want tools to detect and prevent panicking branches." If you had said the latter, I wouldn't have even responded to you. I wouldn't have even batted an eye.

> This is the Hundred Billion Dollar unwrap() Bug.

The Cloudflare bug wasn't even caused by unwrap(). unwrap() is just its manifestation. From a Cloudflare employee:

> In this case the unwrap() was only a symptom of an already bad state causing an error that the service couldn't recover from. This would have been as much of an unrecoverable error if it was reported in any other way. The mechanisms needed to either prevent it or recover are much more nuanced than just whether it's an unwrap or Result.

gishh · 2025-11-26T23:25:23 1764199523

> unwrap() was only a symptom of an already bad state causing an error that the service couldn't recover from. This would have been as much of an unrecoverable error if it was reported in any other way. The mechanisms needed to either prevent it or recover are much more nuanced than just whether it's an unwrap or Result.

This sounds like the kind of failure Bobby Tables warned about a long time ago. An entire new, safe language was developed to prevent these kinds of failures. “If it compiles it’s probably correct” seems to be the mantra of rust. Nuts.

burntsushi · 2025-11-27T00:04:53 1764201893

The fact that this wasn't RCE or anything other than denial of service is a raging success of Rust.

“If it compiles it’s probably correct” has always been a tongue-in-cheek pithy exaggeration. I heard it among Haskell programmers long before I heard it in the context of Rust. And guess what? Haskell programs have bugs too.

gishh · 2025-11-27T02:59:21 1764212361

> “If it compiles it’s probably correct” has always been a tongue-in-cheek pithy exaggeration.

If you say so, I believe you. That isn’t how it comes across in daily, granted pithy, discourse around here.

I have a lot of respect for you Andrew, not meaning to attack you per se. You surely can see the irony in the internet falling over because of an app written in rust, and all that comes with this whole story, no?

burntsushi · 2025-11-27T03:10:31 1764213031

Nope. Because you've completely mischaracterized not only the actual problem here, but the value proposition of Rust. You're tilting at windmills.

Nobody credible has ever said that Rust will fix all your problems 100% of the time. If that's what you inferred was being sold based on random HN commentary, then you probably want to revisit how you absorb information.

Rust has always been about reducing bugs, with a specific focus on bugs as a result of undefined behavior. It has never, isn't and will never be able to eliminate all bugs. At minimum, you need formal methods for that.

Rust programs can and will have bugs as a result of undefined behavior. The value proposition is that their incidence should be markedly lower than programs written in C or C++ (i.e., implementations of languages that are memory unsafe by default).

gishh · 2025-11-27T04:01:35 1764216095

> If that's what you inferred was being sold based on random HN commentary, then you probably want to revisit how you absorb information.

Heard, chef.

Dylan16807 · 2025-11-26T23:39:54 1764200394

In a local sense, "quit out safely when the config is corrupt" is pretty correct.

Coordinated systems that test and rollback are way beyond the scope of what a compiler can check.

gishh · 2025-11-27T00:06:41 1764202001

What about “detect when the content isn’t correct and take protective measures so that a core service of the global internet _doesn’t_ crash?” Wasn’t that the whole point of rust? I’ll repeat again “if it compiles it is almost absolutely correct” is a mantra I see on hn daily.

Apparently that isn’t true.

Edit: isn’t the whole idea of C/C++ being flawed pivoted around memory management and how flawed the languages are? Wasn’t the whole point of rust to eliminate that whole class of errors? XSS and buffer overflows are almost always caused by “malformed” outside input. Rust apparently doesn’t protect against that.

Dylan16807 · 2025-11-27T01:12:09 1764205929

If you corrupt memory, a huge variety of unpredictable bad things can happen.

If you exit, a known bad thing happens.

No language can protect you from a program's instructions being broken. What protective measures do you have in mind? Do they still result in the service ceasing to process data and reporting a problem to the central controller? The difference between "stops working and waits" and "stops working and calls abort()" is not much, and usually the latter is preferred because it sets off the alarms faster.

Tell me what specifically you want as correct behavior in this situation.

jacquesm · 2025-11-27T09:13:43 1764234823

Ok, I'll take a stab at that:

I would expect such a critical piece of code to be able to hot-load and validate a new configuration before it is put into action. I would expect such a change to be rolled out gradually, or at least as gradually as required to ensure that it functions properly before it is able to crash the system wholesale.

I can't say without a lot more knowledge about the implementation and the context what the best tools would be to achieve this but I can say that crashing a presently working system because of a config fuckup should not be in the range of possible expected outcomes.

Because config fuckups are a fact of life so config validation before release is normal.

Dylan16807 · 2025-11-27T12:02:41 1764244961

> I would expect such a critical piece of code to be able to hot-load and validate a new configuration before it is put into action.

And if that config doesn't validate, what should the process do? Maybe it had a previous config, maybe it didn't. And if it keeps running the old config, that adds extra complication to gradual rollout and makes it harder to understand what state the system is in.

> I would expect such a change to be rolled out gradually, or at least as gradually as required to ensure that it functions properly before it is able to crash the system wholesale.

Me too. Note that doing a gradual rollout doesn't care whether the process uses unwrap or uses something gentler to reject a bad config.

> I can say that crashing a presently working system because of a config fuckup should not be in the range of possible expected outcomes.

By "working system" do you mean the whole thing shouldn't go down, or the single process shouldn't go down? I agree with the former but not the latter.

burntsushi · 2025-11-27T11:49:58 1764244198

Yes, we are lacking information.

But the operative point in this sub thread is whether unwrap() specifically is load bearing.

If instead they bubbled up the error, printed it and then exited the program---without ever using unwrap---then presumably they still would have had a denial of service problem as a result of OOM.

And even if unwrap were load bearing here, then we would be in agreement that it was an inappropriate use of unwrap. But we are still nowhere near saying "unwrap should literally never be used in production."

saati · 2025-11-26T23:29:38 1764199778

Who lost a hundred billion dollars?

ViewTrick1002 · 2025-11-26T18:21:03 1764181263

I’ve been seeing you blazing this trail since the incident and it feels short sighted and reductive.

Rust is built on forcing the developer to acknowledge the complexity of reality. Unwrap acknowledges said complexity with a perfectly valid decision.

There are a few warts from early days like indexing and the ”as” operator where the easy path is doing the wrong thing.

But unwraps or expects are where Rust shines. Throwing up your hands is a perfectly reasonable response.

With your approach, what should Cloudflare have done?

Return an error, log it and return a 500 result due to invalid config? They could fail open, but then that opens another enormous can of worms.

There simply are no good options.

The issue rests upstream where deployments and effects between disparate services needs to be mapped and managed.

Which is a truly hard problem, rather than blaming the final piece throwing up its hand when given an invalid config.

echelon · 2025-11-26T18:27:27 1764181647

> I’ve been seeing you blazing this trail since the incident and it feels a short sighted and reductive.

Why is it inappropriate to be able to statically label the behavior?

Maybe I don't want my failure behavior dictated by a downstream dependency or distracted engineer.

The subject of how to fail is a big topic and is completely orthogonal to the topic of how can we know about this and shape our outcomes.

I would rather the policy be encoded with first class tools rather than engineering guidelines and runbooks. Let me have some additional control at what looks like to me not a great expense.

It doesn't feel "safe" to me to assume the engineer meant to do exactly this and all of the upstream systems accounted for it. I would rather the code explicitly declare this in a policy we can enforce, in an AST we can shallowly reason about.

ViewTrick1002 · 2025-11-26T18:31:23 1764181883

How deep do you go? Being forced to label any function that allocates memory with ”panic”?

Right now you all the instances where the code can panic are labeled. Grep for unwrap, panic, expect etc.

In all my years of professional Rust development I’ve never seen a potential panic pass code review without a discussion. Unless it was trivial like trying to build an invalid Regex from a static string.

echelon · 2025-11-26T18:33:56 1764182036

Malloc is fair game.

Unwrap, slice access, etc. are not.

asa400 · 2025-11-27T00:41:31 1764204091

You probably know about these, but for the benefit of folks who don't, you can forbid slice access and direct unwraps with clippy. Obviously this only lints your own code and not dependencies.

  - https://rust-lang.github.io/rust-clippy/master/#unwrap_used
  - https://rust-lang.github.io/rust-clippy/master/#indexing_slicing
  - https://rust-lang.github.io/rust-clippy/master/#string_slice

dpark · 2025-11-26T18:53:25 1764183205

So slicing is forbidden in this scheme? But not malloc?

This doesn’t seem to be a principled stance on making the language safer. It feels a bit whack-a-mole. “Unwrap is pretty easy to give up. I could live without slicing. Malloc seems hard though. I don’t want to give that up.”

echelon · 2025-11-26T18:55:47 1764183347

I posted about why this is important for distributed systems engineering:

https://news.ycombinator.com/item?id=46060907

Malloc is fine. We can and do monitor that. It's these undetectable runtime logic problems that are land mines.

In distributed systems, these can cause contagion and broad outages. Recovering can be very difficult and involve hours of complex steps across dozens of teams. Meanwhile you're losing millions, or even hundreds of billions, of dollars for you and your customers.

Someone unwrapping() a Serde wire message or incorrectly indexing a payload should not cause an entire fleet to crash. The tools should require the engineer handle these problems with language features such as Result<>.

Presently, who knows if your downstream library dependency unwrap()s under the hood?

This is a big deal and there could be a very simple and effective fix.

The Cloudflare outage was a multi-billion dollar outage. I have personally been involved in multiple hundred million dollar outages at fintechs, so forgive me for being passionate about this.

dpark · 2025-11-26T19:23:38 1764185018

I don’t actually work in Rust. I think I understand what you’re going for, though. The choice to use panic as a way of propagating errors is fundamentally problematic when it can arise from code you don’t control and potentially cannot even inspect.

I don’t necessarily agree that malloc should be okay (buggy code could try to allocate a TB of memory and OOMKiller won’t fix it) but I can understand that it’s probably workable in most cases.

Unfortunately I think the fix here would require a compatibility break.

ViewTrick1002 · 2025-11-26T18:37:28 1764182248

And now the endless bikeshedding has begun.

Thanks for making abundantly clear how such a feature wouldn’t solve a thing.

echelon · 2025-11-26T18:48:14 1764182894

https://news.ycombinator.com/item?id=46060907

Copying this so you see it too -

The Cloudflare outage was a multi-billion dollar outage. I have personally been involved in multiple hundred million dollar outages at fintechs, so forgive me for being passionate about this.

Several of the outages I've been involved in were the result of NPEs or incorrectly processing runtime data. Rust has tools to enforce safety here, but it doesn't have tools to enforce your use of them. If also doesn't have a way to safeguard you from others deciding the behavior for you.

There is potentially a very easy set of non-onerous features we could build that allow us to prevent this.

sfink · 2025-11-27T06:27:49 1764224869

Except that the outage would still have happened without that .unwrap(). So go ahead and build those features, they sound useful, but don't think that they'd save you from a failure like this.

As the poster here said, the place to build in features that would have prevented this from happening is the DB schema and queries. 5NF would be onerous overkill here, but it seems reasonable to have some degree of forced normalization for something that could affect this much.

(Requiring formal verification of everything involved here would be overkilling the overkill, otoh.)