Hacker News new | past | comments | ask | show | jobs | submit login
What the Fastly outage can teach us about writing error messages (onlineornot.com)
266 points by rozenmd on June 9, 2021 | hide | past | favorite | 132 comments



I'm a principal at AWS and spent years dealing with this on CloudFront and Route 53. A lot right, and a little off, in the post. Yes, we should definitely communicate to the viewer what's happened and appropriate next steps if any. But there are a handful of challenges that Ive not seen reasonably solved.

The first problem is that different viewers need different context, which is especially true for service providers like CDNs. Telling an end user "the page is taking too long" makes sense. But what if its the CDN customer (developer), theyre going to need request IDs and other diagnostics just like Fastly did display. Playing the "we need the RID from the headers you didnt know about to capture" game is a losing proposition. The post and it'd "good" examples suggest a way to contact support, but there is ~0% chance support can help without additional information that isnt included.

Expanding on that the post is making really dangerous assumptions and carrying those through to the semantics of the text. Exposing authnz failures as Access Denied vs Not Found is problematic in its own right. Going further to tell the user "You need to login first" is asking for pain in the other 10% of cases where it's a credential or authz problem. Again, different users need different context. And there's no way to know what context to provide apriori.

Lastly it's somewhere between incredibly hard to impossible to distinguish the cause ("why") on a per request basis. Especially for service providers like CDNs. The CDN has a cache miss, but can't retrieve the content. Is it a 504 Gateway Timeout because the origin is not responding, 502 Bad Gateway because the origin TLS is broken, 502 because the local clock is off, or a 503 Service Unavailable because of an internal service timeout? Even if you can distinguish what does the CDN represent to the end user; a general fault in the CDN? A specific failure of the origin? Or a simplistic "unavailable." Again, different viewers need different context and semantics. Which you're not going to do at thousands or millions or tps.


Very true, at the individual error level. That said, Fastly fixed the entire issue in less time than it would have taken AWS to update the status page to the blue diamond, so there’s that.


"Service is experiencing increased error rates" - Error rates increased from 0 to 100%


Or 50% error rates in case one of two nodes is down and you keep calling the faulty node.

But the faulty node should not be called in a well designed system.


But then you're serving 100% faulty.

Error rates (as alert thresholds and end user reporting) are a service thing, not a node thing.


No service has 0% error rate, and 100% errors doesn't even mean much without context (you never receive all possible errors at once).


Error rate = (number of requests which returned an error) / (total number of requests)

You can choose plenty of time periods to satisfy a 0% error rate for most services - somewhere from milliseconds to days or even years.


In large organizations, status boards are "external communications" and this must be handled by a team dedicated to that. It works about as well as you might expect.


Fastly isn’t exactly small, and their status page was updated continuously throughout the incident.


This was in response to the AWS status page.


these seem solvable.

there's no reason that a good error message AND pertinent support information are mutually exclusive. show them both.

show the errors that make sense for that action and that error. no one is saying that there should be a small list of "approved" errors that everyone would see. show what makes sense in that situation, and don't show something that doesn't help.

you don't need to have prophetic insight into the actual causes. just show the flipping errors. don't hide things behind friendly messages; show the friendly message and the error. it's insulting when an error message is hidden from me in favor of something like "oopsies! we broke something! so sorry!!" insulting and disrespectful.

there is no need for a single solution. do what's right for you, and most of all, start thinking about ways that things can work, instead of going straight to the reasons that they might not. don't talk yourself out of a good decision because it might not work for every last situation.


Showing the error itself only opens the provider to information disclosure - with the various automated infrastructure monitoring systems running on Fastly or CF they know when and where errors are happening with full traces that help them debug.

> there's no reason that a good error message AND pertinent support information are mutually exclusive. show them both

As the OP stated, there is quite literally nothing support can do for system-wide outages like this, and when you show support information on these pages, the regular users that see them end up asking for support. There are a non-zero amount of posts monthly on the Cloudflare community forum from people asking about error messages they see for sites they don’t own.


You could do a mix of both: give information useful for support (trace id, error code, etc) but don't explicitly say "send an email to support@bigcorp.com / call 01 23 45 67". The people on the tech side who are the CDN's clients will know who and how to call for support and if it's an end-user they'll be able to transmit this info. Bonus points for the tech guys to figure that there's an issue on their end and that they don't need to bother the CDN support.

I also agree with GP's point that extremely vague error messages are aggravating. Maybe I did something stupid / unexpected, and with a usable error message I'll be able to fix it on my own. Like sending the wrong file format, or whatever.

I absolutely hate when a system will just say "there's an error". And having cute animals or phrases doesn't change the fact that the error is useless. This is one of the reasons why I absolutely hate working with Windows. "An error occurred. Call your sysadmin". Right. I'm the sysadmin. What now?

This also trains users to never say what's wrong when they call for support. "Yeah, there's an issue with [thing]. What issue? It doesn't work. Oh, of course. There's only exactly one failure mode, so I'll get right on it."

I work in operations and get those fairly often even from people who should know better (software engineers and such) and I find it frustrating to no end that I always have to pull information from them even when there are clear error codes. So no, hiding this kind of info serves absolutely no purpose.


> What now?

Event Viewer. (You already know, but for the benefit of whomever else.) Hiding things here stinks if you don't know there's something to see there.


Well, the problem with the event viewer is that you kinda have to know what you're looking for beforehand.

I rarely use windows, so I don't know the conventions (if there are any), but I remember one case of someone attempting a remote desktop connection that had an error along the lines of what I said: "Cannot connect". I've never found where the logs for the RDP client were. For servers, we dump the logs in Elasticsearch, so a naïve query will at least set me on the right track.

I'm also not a particularly patient person, so the unbearable slowness of the event viewer's search function and the fact that I never know if I'm looking for Thing, Microsoft Thing, Windows Thing, or ThngSvc, I usually give up in frustration.

And, again, had I had an event id or something, the Find in the event viewer might have been more helpful.


> Showing the error itself only opens the provider to information disclosure

No offence but as an end user I really despise this attitude. Sure there might be automatic monitoring at a place like fastly, but even they can use help in tracking down the problem. Also, if the error is distinct people can put it into Google and call on the vast power on the Internet to help figure out how to fix it or work around it, especially on smaller services where a fix may or may not be forthcoming anytime soon. A good error message leads to a Stackexchange page leads to a solution. A vague error leads to a support call with a bewildered frontline tech and a lot of work for some sorry engineer who has to dig through log files.


It's a matter of information security. Error messages potentially disclose sensitive information on the internal workings of the service. Disclosing that information is a potential security vulnerability.

Also note that SE went down yesterday


I would counter that your service is not as special as your security guys think it is. The paranoia over "Oh no! Those dastardly hackers will discover that we use PostgreSQL on our backend! Our secret sauce will be spread all across the Internet!" is usually overblown.

Obviously this does require your developers not to be complete idiots by putting their passwords in the error messages or something, but this is an extremely easy bar to hurdle.

At the very least you can put an error code up. Just make sure it is a reasonably long string so you don't have collisions. If you are big enough people will work out what the codes mean and what they can do. For example, I know 0x80D02017 on Windows means Microsoft has broken their IPv6 service endpoint for the Windows Store again, and you can temporary disable IPv6 support to work around it. Even though the error message is the monumentally unhelpful "Unknown", the Internet can come to the rescue. Of course if the error message had been something like "Windows Store update connection to address 2603:1061::9f4d failed: Service responded with protocol version 4, this client only supports version 5" it would have been even better.

Would it disclose data about how Microsoft internals? Maybe a little, but most of that was observable anyway and maybe if you had error messages like this maybe it wouldn't take them 8 goddamn months to find and fix the problem?


Hey,

Fair point, I mainly picked the first contrived example that came into mind, rather than thinking long and hard about what the "correct" error message should be.

Will update the article to clarify that.


Cheers. Didnt mean to be too contrarian here or pick apart the broader message, which I absolutely agree with. More of an expression of my personal experience struggling with the same problem. As unsatisfying as it is I suspect you settle for getting it right 95% of the time and including enough details that the user can self service, or contact support, in the other cases.


Agreed.


I created that particular error-message in varnish cache 15 years ago.

Max has many good points about error messages in general, but all of them require access to out of band information.

We in the Varnish Cache Project do not have access to that information, we dont know who runs the varnish instance or what kind of information they serve to what kind of clients.

This is why the default '503 message only exposes the "XID" nonce: That allows the administrator of this cache instance to find all the details in the log files.

Varnish Cache users who want to present something else can do that from VCL, and I'm pretty sure Fastly normally does.

But when all else fails, and here it must have, Varnish Cache errs on the side of caution.


I've taken the opportunity to write up the story of the first time Varnish Cache's 503 was heard around the world, or at least in Norway:

http://varnish-cache.org/docs/trunk/phk/503aroundtheworld.ht...


Reminds me of the quintessential tweet about error messages

https://twitter.com/cherrikissu/status/972524442600558594?s=...


Right up there with Slack etcs fake loading messages

"Gearing up the dildonator"

"Implicating the fairies"

"Hogtying George Bush"

Dude - just give me a spinner or a progress bar, and if something errors during the load out give me some sort of stack trace or error ID I can use to help


The one I especially hated: "You look nice today"

1. "Nice" is an adjective with nearly zero meaning. 2. Either you have access to my camera outside of calls and are analysing my appearance (WTF) or you're making stuff up. 3. You're making chat software. Stop trying to butter me up.


Nice reply, you’re on fire today!

No, really, you should find a fire extinguisher.


You sound British.


What's the first instance of joke progress messages? Is it "reticulating splines" from SimCity 2000 (1993)? Difficulty: must be presented to the user as if it were an actual status message during some process that might have real status messages, not as something explicitly fictional—despite occurring in a game, that message in SC2K qualifies.


Ah interesting, I was going to ask the same but thought it was from /earliest I knew of was The Sims (2000).

(Actually I only now learn that they're related in series - always thought they were competitors.)


ProductBoard likes to tell me:

> Without good products, life would be a mistake.

> Make products that matter.

> Product excellence isn't a point in time. It is a state of mind, it is a way of life.

> So many feature requests, so little time.

Maybe they're being silly? I hope so. But I've also met product managers and designers who seem to think like that. It makes me internally sigh every time I have to use their product.


They should at least be reticulating splines.


Once I was at /r/androidapps and someone made a new app, asked for feedback, and «experienced devs» told OP something like «instead of „fetching data“ you should be using „getting everything ready“ because it might scare users to think that app is collecting their data».

Are people really that dumb? I'm afraid to use apps made by those people.


We have to remember the median phone user isn't usually tech savvy. Is an average Joe or Stacy who use that device with communicate with friends and family, use Facebook, watch Netflix, and buy from Amazon, and that isn't something bad. Just we have in mind that most people doesn't know tech-related terms and can confuse them.

Maybe if we want to be a bit funnier we will write "changing the break pads" in the loading screen, but in case of error we have to write something clearer, like "error accessing to the game server" or "error reading cars data files", to being able to solve the problem.


Well discord has fake loading messages too, and I think they're funny. Though it's a bit different since discord is not aimed at professionnals, the geeky jokes match better with the gamer audience


For purely cosmetic marketing sites I tend to enjoy it


With a small group of Corporate Memphis people frowning or looking bewildered.

Give me “Unspecified error” any day over that.


One of the reasonable reasons why errors often get returned generically is because of security though.


Exactly usually the real error message behind the “oopsie” is a stack trace containing sensitive information and environment variables.

It is no use to the user since they can’t do anything and is actually dangerous to give out.


aws exceptionally performs well in this question: in some IAM-related (permission management) API error messages contains encrypted details so it does not leak sensitive data but the support can decrypt an analyze it if the user pastes it back.


An alternative is just showing the request ID so support can look up the error in logging


I don't mean to sound frustrated, because this is a volunteer open-source project, not VC-backed corporate software, but whenever the Emacs package flycheck reports an error message, I find the fact that it says (paraphrased) "Please include these details when filing your error report. Thanks!" to be a bit condescending.

It's just something about being thanked in advance for an action I did not intend to do that irks me. It implies to me that the reward for doing it has been given to me without my consent, and now I'm obligated to follow up on my part to prove I deserve it. It ultimately makes me less likely to file a bug report, stemming from this unease.

It's a really minor thing, and definitely not worth getting angry over, but for some reason I always remember it whenever a discussion about out-of-touch error messages happens.


Relatedly when I opened that link I got "something went wrong. Try reloading" when I opened that. I happen to know that's usually rate limiting with twitter (since Firefox is suspicious, maybe?) so reloading is exactly the wrong thing to do, but switching between the mobile and desktop sites seems to work (maybe they have different quotas?).


I've had more than one intern or young engineer that I've had to teach to resist the urge to put witty comments and jokes in project docs, comments, or errors. Especially in error messages or things that can be exposed to users/customers. Even if it's an error you think only the dev team will see, you never know if it will make its way out to audiences you didn't intend. Somehow I doubt people faced with this error message really were amused by the reference to the old Amiga OS errors....


I'd say that some safe joke on the end of your error message is perfectly fine, as long as you are perfectly clear where the problem is. Just be very sure not to unintentionally blame the user. It is fine even if the user won't get the joke.

Software is a highly personal and creative thing, it should have a personality. The everything gray enterprise spaces have gone too far already. Also, I really miss Linux yelling on panics.

Now, the same joke on a highly visible position or on the beginning of the message is harmful and will impede people from solving the problem.


> Software is a highly personal and creative thing, it should have a personality

Not it isn't and it shouldn't. The process of creating the software is; huge distinction.

The product of your efforts should not have a personality or feel personal, it should just work as intended. Software is hard as it is and we don't need to make it more whimsical. A little experience can teach us, that whether we like or not, it will ultimately exhibit its own whims anyway.


> The product of your efforts should not have a personality or feel personal, it should just work as intended.

Genuine question: why it can't have both? I know it's hard to convey tone on the web but I'm asking the question because I'm genuinely interested in knowing what you think.

I personally think you can create software the right way, ship something that works and still incorporate some personality and make it less boring. I don't see why the two can't live happily together.


Because we haven't solved the main problem yet, which is to create robust software that does exactly what's supposed to do, no more no less. We have opinions, we have some indications of what may improve software development, but we are far away from comparing software engineering with other engineering domains. And it's only logical, because compared to other disciplines, software engineering is in its first baby steps.

Imagine if construction/aviation/etc engineers wanted their building/bridge/airplane to be whimsical and have its own personality. Are you scared yet?

Imagine if your out-of-band(not in the initial requirements/spec) and whimsical software contribution was responsible for a bug that brought down an airplane, or killed a patient. How whimsical would you be then? Well at least you wouldn't feel bored at your day job right? Anyway, I think you get my point.


> Imagine if construction/aviation/etc engineers wanted their building/bridge/airplane to be whimsical and have its own personality. Are you scared yet?

Are you aware that those things go through a design phase with the explicit objective of giving them a personality, right?

Mechanical engineers have an habit of breaking that personality due to their profession constraints, so most airplanes lose the original ones, but bridges usually are built just as intended.

Anyway, it's not like you can avoid giving your software a personality. You can't. What you can decide is if it will behave like a dull humorless thing, a holier than you all knowing braggart, or something people like having around. And yes, some software should have those two first options too, it depends on their application.


> Are you aware that those things go through a design phase with the explicit objective of giving them a personality, right?

Not really. I see quite the opposite; during the design phase the team sets the rules in order to guarantee consistency and cohesiveness and avoid any deviation from what has been agreed upon. This rules out personal or whimsical contributions, because by definition it would ruin the process.

> Anyway, it's not like you can avoid giving your software a personality. You can't.

I alluded to that, if you re-read my comment, but a software having its unintended whims, compared to intentionally trying to give it some "personality" is not the same thing at all.


Teams? Rules?

The bridge architects you are talking about behave very differently from the ones I've met.


I haven't met that many bridge architects :D

But do they build bridges by themselves? Are bridges built by a one man show?

And they don't have rules? And how do they get anything done?


Are you sure you did read my previous comment? You are replying to stuff that isn't there.

About rules, no, nobody pass rules down the stream. People communicate full designs of some issue (that is not the full design of the thing, designs are "sectorial" where people add their concerns into the overall thing). When it's done right, the design goes to and from those sectors changing the entire time. When it's done badly, somebody finishes a "general" design and sends it downstream for people to fill the other parts. A team does not work on the same issue, that would be chaos.


> Somehow I doubt people faced with this error message really were amused by the reference to the old Amiga OS errors....

I'll bet whoever put that in there feels extra silly about misspelling "meditation", too.


It looks like the spelling was fixed a while ago, does that mean fastly is running an old version of varnish?


Fastly changed it on purpose so they could distinguish between error messages from their Varnish and error messages from an origin customer owned Varnish.




Unfortunately the majority of internet users aren't trained in the art of reading HTTP status codes

I think the majority know what 404 is, and possibly 403, but I agree about the more obscure ones.

That said, I don't think it's a bad idea to rely on the "default exception handling behaviour" that the majority of users, even non-computer-literate ones, will have: they'll retry a few times, see that it doesn't work, and go elsewhere for a while.


If we keep dumbing down the world for everyone instead of teaching those who don't know we will destroy ourselves. People are not stupid and if we keep assuming it we are doing a disservice to all.

No more www, no more protocol in the address bar and apple is selling iMac colors in it's commercials...


How will it destroy us?

Do you think we should use hex or binary instead of their ASCII or UTF-8 equivalents? After-all, ASCII must be dumbing down as it makes it easier for non specialists to read.


I am not saying that there shouldn't be some abstraction. However I think we are at a point where one additional customer at any cost results in products become so simplistic that they become useless for people who actually need them for work etc.

Not having people learning new things and learning to approach things with some logic results in them never having to do this in their lives. In the end you get a dumbed down public voting on their emotions and against their interests.


... and blending the search eith the address bar.

I spent a good 1 hour to explain the difference to a tech illiterate on why typing "mywebsite.com" was different from typing "mywebsite com" and picking the first result on Google.

I'm not sure he understood, but I really have to admit that this trend of dumbing down things is only make them worse, in a way.


They'll understand when they try and buy airline tickets that end up going through some third party with a 'service charge'.

One of my friends has recently learnt that lesson when she went to buy flights on Ryanair; just typing that into Google and clicking the first link. £30 service charge.


you vastly overestimate the majority.


I don’t off the top of my head remember that 506 is, that’s what ddg is for.

When fastly was broken and telling me that london was broken (lon3356 or something), that told me I could Reroute via Cleveland and have a chance of it working. It also made me comfortable it was a CDN error rather than a site error.

That’s far better than “oops something went wrong, we’re trying to fix it”


I have a particular dislike for Apple over this - I think they set a trend of unhelpful error messages tied to obscure codes back in the 1990s.

Sadly they've been joined by Google. 'Something went wrong' could have come from the Sirius Cybernetics Corporation.


Obscure but unique and googlable error codes are great. At a previous job, we had a large project to go through the codebase and tag every single user facing error with one, then publicly document them with likely causes and solutions. Support volume dipped noticeably.


Google is unfortunately getting worse for such things. Search for some error code and you get pages of spam and those that don't contain the code you're looking for, but a "very similar" one. Even one digit off makes a huge difference.


Your plastic pal is busy remembering how to be fun! Come back later, and go stick your head in a pig.


Wow, the internet. It’s like my meetings where everyone feels entitled to expressing their opinion, splitting hairs on the most tangential aspects of the issue.

Thanks god for coffee, and nice views out of the window


10 of our clients called and asked why they are paying us if the sites are not working today. I think a proper 503 browser message could help us a little?


Uh, really? I bet Prometheus would do a better job


Have they published a post-mortem for the outage yet? I'm curious as to what happened here


Here it is https://www.fastly.com/blog/summary-of-june-8-outage

Posted about 17 hours after the incident.

In short, a valid customer configuration change triggered a bug. One thing I don't see in this writeup is a commitment to ensure that customer configurations cannot break the whole system. Cloudflare does seem to make this promise with their zero trust architecture,

https://www.cloudflare.com/learning/security/glossary/what-i...


Zero trust has little to do with individual clients and more to do with the idea of the internal "corp" network not being trusted. That is, there isn't a conventional network you VPN into at which point all traffic is trusted.

The generic topic your looking for is probably something like "customer isolation" ("service isolation" might also be relevant, but is used also in the context of "tenant isolation" which isn't really what you want). See this thread: https://news.ycombinator.com/item?id=25237836 for some talk about how AWS does "cellularization" which is a form of workload/service isolation/partitioning.

In general I don't think there's much discussion of this issue on the wider web.


Zero Trust is not really applicable here. ZT is about not implicitly allowing someone elevated access just because they have access to network and requiring explicit grant based on the identity and other rules.

Fastly's downtime seems to be caused by an automatically generated config that got deployed in production as a result of change requested by a legitimate customer.

This happened to CF in its early days and I really doubt that ZT had anything to do with the fact that they do not have this kind of problem anymore. It's probably some sanity checks before they deploy updated lua scripts to their fleet of nginx's if anything.


They wrote that they had to reinstall Workbench.

Isn't that too early for pay mortem? I didn't experience it myself, but I think it happened at most few hours ago.


Good error messages are hard. There’s so many things that can go wrong, and each needs its own custom explanation. Never really found a good way to organize this. And since errors should be uncommon, it feels like a waste of time coming up with thoughtful messages.


The problem is that you have to know exactly what has gone wrong to show a proper message. But if you know what is wrong, most of the time the correct action is to fix the program so it does not go wrong instead of showing a better error.

With the main exception of user submitted data validations where it is up to the user to submit correct data. For the average web service, there the error is a stack trace and there is nothing the user can do so a blank screen with a "something went wrong, we have logged this erorr" is the only real option.


Even if there’s usually nothing the user can do I still like to have all the stack traces, etc visible - because sometimes there’s enough of a hint that you can get a workaround.

A number of DNS failures I’ve worked around with the hosts file.


You've worked around server-side DNS failures?


Yes - often the DNS issue is unrelated to the server itself - and sometimes a quick lookup shows an obvious cname error - or shows a round-robin dns where the response you have is failing but the others work.

Or if the host name has .eu.domain or something else indicating a geographic location changing that can get around a localized failure sometimes.


Good error messages have a template: a UUID, a description, and a suggested solution. This gives folks enough to try to solve it themselves and also enough to Google / grep with.


You can’t suggest a solution because if you were aware of the problem enough to suggest a solution you would just fix the problem. This 503 error from fastly was never meant to show to a user. You can’t do much to account for novel situations which shouldn’t ever happen.


A solution that the end-user can take. It's not unreasonable to ask user to check that their URL is correct for 404s, for instance. For 5xx errors, especially 502s and 503s, it's still helpful to tell users that there's nothing much they can do (apart from maybe monitoring the status page).


Without telling malicious actors how to break your systems further.


But who's going to host that nice unicorn images? and those nice CSS? I agree with putting up a link tho.


Embed both in the page.


I suspect most folks, especially ones which operate at GB/s rates serving traffic would not like to serve up a 100kb base64 encoded image every time an error is thrown.


Browsers send the Accept headers, and I have configured my CDN to return a plain text page for static assets (that users wouldn't see on the browser), but text/html requests get a useful 10-20 kb response. Should be a trivial thing to setup.


Sure, which is why they probably wouldn’t serve up a 100kb base64 encoded image every time an error is thrown.

It’s more than possible to embed a small vector image to add some humanity to an error page without breaking the bank, bandwidth-wise


Why not? The non-error page was likely much larger.


Or just cache the image with a major CDN, like Fastly. No wait...


We create separate 404 and 500 error pages. Question is - should we do this for most status codes?


Just today I had some code that helpfully said "should never get here" and aborted. It left us to figure out what the possible options of how we could end up in that spot "we should never get to". It was maddening because it's ~12 year old code that had some library updates. A little bit more help like the error code or really *anything* would've been helpful.


A couple of years ago, I redid the error code definitions in an internal C++ library. The error returns were of the form:

  -__LINE__
which, if you know C++, means your error code is just the negation of the current line number. That's super convenient when writing. It's really annoying when someone actually sees such an error code and emails you about it -- because they obviously can't do anything with it in software. The obvious problem is that the semantics of an error code depend on the software revision they built with, but you also need to figure out which source file has an error return at that line that the user could have reached at their context.

The new error codes are almost as ergonomic to write (involving a little splash of code generation to make it so, not ideal but worth it). However, they can actually be handled in software and there's a functional perror-equivalent.


So that looks like a stock Varnish 503. I'm pretty sure the magic behind fastly is a ton of Varnish cache. What's interesting (VCL can be unforgiving), also that it was a big central stuff up. Which makes me think they have layers and layers of caching in depth and some top level config just blacked out everything. I kind of would have thought they'd have lots of small instances.


Fastly is indeed powered by a lot of Varnish magic. My company used(s) them for the CDN for some of our large website frontends, and they have engineers which will help generate the insanely convoluted and large VCL templates which can be setup in your account and will handle all of the magic routing / caching / much more complex things you need for your origin.


Yeah I've used varnish before. I like it a lot, but I do think coloring inside the lines as much as possible is probably the safest thing. Some of the VMOD's and more exciting things you can do with varnish become really trick to manage.


> So that looks like a stock Varnish 503

Looks like it, except that Varnish has it spelled “Guru Meditation”, not “Guru Mediation”. Anyone know why that would be?

https://github.com/varnishcache/varnish-cache/search?q=medit...


You've found a kind "trap street". This disambiguates the error message generated from Fastly vs the error message generated from an origin which happens to use varnish as well.

Edit: As to why Fastly wants to expose the relatively stock error message instead of custom text, I dont know. My guess would be Faslty (or their customers) have tooling built around parsing the http response and they're preserving compatibility.


https://twitter.com/dormando/status/1402466173778677764

my fault. I would sometimes monitor for "guru mediations" popping up to tell if we were throwing errors without it being caught by other systems. Among other reasons.


this question was already answered in the fastly outage thread.

It was to identify fastly's Varnish vs customer's Varnish.

https://news.ycombinator.com/item?id=27433139


Try searching an older branch manually. GitHub only searches the main branch.


This is not true

Edit: trying to reverse engineer the “magic”, architecture, and the failure from a varnish error message is folly, and misleads others. How do I know the comment is patently false? I ran those teams at Fastly for 3 years.


Alternately, document your 'magic architecture' in public so that people don't have to offend you or other employees of the firm by making incorrect assumptions based on the little you do show in public.

Also helps others pick up if you do go out of business or end up acquired, and before you do it even leads to a more level playing field.

However I guess a level playing field wouldn't have given you a NYSE listing... so I guess you're just being selfish?


Most outages are config errors. Code errors would be detected before they broke everything.


More in reply to the comments than the original article.... In my experience no developer ever actually imagines that their code could possibly have bugs. Yes yes, they will say that of course all software has bugs, but deep down they know that that doesn't mean their code, just other's code. Turn all that code into products and that means when something goes wrong it isn't the product's fault. Therefore it must be the user's fault.

Error messages can't possibly explain the problem, because the product can't know what the dumb user did. So don't bother really trying.

This from the perspective of someone doing customer support.


I'm a believer in not hiding things and logging the whole problem all the way to the end consumer:

http://test.rupy.se/?id=2

Because when you develop things you are the end consumer.


This is considered a critical security vulnerability. [0]

[0] https://owasp.org/www-project-top-ten/2017/A6_2017-Security_...


Not if it's open-source.

Security by obfuscation is generally not a good option if you can avoid it.


If there are other vulnerabilities present, stack traces can be forced to dump all sorts of data like env variables and network information and maybe someone else's personal information.

I strongly urge everyone to hide their stack traces in production. This will reduce your application's attack surface.


>Security by obfuscation is generally not a good option if you can avoid it.

Of course, but obscurity increases security


Also, consumers may not understand backtraces but they generally understand how to Google them or ask other people what they mean, or tell you about them.

Technical errors are way better than just "I'm sorry we couldn't process that right now."


This is probably a dumb question, but isn't a backtrace likely to reveal some sensitive information from a security perspective?


Possibly, though I would say that is a little paranoid.

But anyway I was really using "backtrace" as a synonym for "technical details that users don't understand". Not very clear, sorry!


Ahhhh. I’ve spent countless hours debugging “Guru Meditation” errors from FreeRTOS on my ESP32 dev boards. I always wondered what the heck that meant. Turns out I was just not in on the joke.


> I’ve spent countless hours debugging

You are the guru. And you are meditating on the problem.


The problem with writing user friendly error messages is in a lot of cases you have no idea what context they'll show in. I mean, do you think they could have predicted yesterday's particular problem at Fastly?

Better to give some plain debug info than tell the user "have you tried turning it off and on again?" in my opinion.


My preferred format is simply what & why, which is the second piece of information in the list in the article.

For example: Cannot serve website. (What?) Reason: could not connect to database. (Why?)

Most of the time, it is very easy to programmatically assemble such messages. It is much harder to automatically figure out who caused it and when it will be fixed.


It seems to me one of these articles pops up every time there's an outage related to Varnish. There's plenty of poor error messages out there but "Guru Mediation" really grinds folk's gears.


Isn't it time we had more error messages? I feel like we got 40x, 50x and the pioneers left lots of space for more.

I quite liked the windows 0x800*** hex error numbers, though some were more useful than others.


The old, original HTTP error codes are useful because they are standardized in their meaning (even if you have to beat the difference between 404 and 410 into an SEO Consultant's head with a bat).

New error codes are only useful if they are generally understood. Maybe instead of using more codes for sub-use-cases, use the permissible error text to express these (HTTP-Status: 503 0x63F0 Data corrupt ?)


Quite back then in a personal project, when putting error logs, I also mandated to put what the fix is along with the error message in error log functions.

I suppose no one does it.


Also try to translate the messages into the user's language when that is feasible. The native language for the majority of internet users is not English.


the Fastly outage can teach us to stop relying entirely on Fastly, AWS, Cloudflare, and so on. Diversify your deployment setup!


This way, if any one of them has problems, you get incident management practice. Win win!


Bullsh*t! 99% of users are only interested in accessing the content they requested. It either works, or it doesn't.


I tend to agree. But I also think that bad error messages are what has gotten there in the first place. People just click on OK on error messages without reading them, because they learned that it doesn't really matter if they read them most of the time.


A great example of content marketing.


I would consider a cutesy rainbow unicorn to be unnecessary.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: