
404 Found - luu
https://flak.tedunangst.com/post/404-Found
======
LeonM
I wrote some scrapers for a customer a few years back. After about a month I
got a call that my scraper flagged a certain URL as unreachable, but the URL
did work in a browser.

As it turned out: the webserver returned a 500 for every file, but still
served it. So the website rendered flawless in a browser.

I still wonder if it was just a badly configured webserver or if the owner did
this on purpose to prevents scrapers and search engines.

~~~
evilpie
I wrote a simple Firefox extension mostly for personal use that finds
bookmarks that point to unreachable pages.

After publishing it on addons.mozilla.org, I almost immediately got messages
that it marked reachable websites as expired.

So instead of just checking the HTTP status code for 404, I now also check if
the page content contains strings like "404" or "not found". If it doesn't
contain those I mark the bookmark as maybe expired.

~~~
MichaelMoser123
How do you deal with languages other than English? I guess one might generate
a very unlikely url and then take that response as 'not found'. Still wouldnt
work sometimes.

~~~
dzek69
i guess it works like that: \- 404 status with "not found" or "404" \- surely
not found \- 404 status without above - "maybe" not found \- not-404 status -
bookmark is fine :)

------
ComputerGuru
With respect to Ted, there’s virtually nothing of content in this post... I’m
confused. 404 SEO issues are nothing new, and “friendly” 404 browser
intercepts have been discussed much more coherently elsewhere. I don’t at all
mind discussions bringing up something other people might already know, but
this doesn’t really delve into anything and merely mentions the existence of
this issue. A comment from OP (or anyone upvoting this) explaining what they
found interesting here would be helpful.

~~~
mirimir
It seems like Firefox is trying to be helpful. They show the content for a
dead project ("Support the sites you love, avoid the ads you hate ..." ) with
a header explaining that "This study is no longer active. Thank you for your
participation." That's arguably much better than an enigmatic 404 error.

~~~
Liquid_Fire
> with a header explaining that "This study is no longer active. Thank you for
> your participation."

I think that header is way too easy to miss, because it occupies the same
space as the usual pointless cookie or "sign up"-type banners that many
websites show. I certainly didn't see it at first.

I suspect that the author of the linked blog post did not see that message
either, since they describe that page being a 404 as "probably a bug".

~~~
happytoexplain
I saw the banner and was equally confused - a 404 seems like the wrong
abstraction layer for that content, just intuitively.

------
Sharlin
This is a potential use case for 410 Gone: there used to be content, but not
anymore, and it is unlikely to reappear in the future, so you can cache this
response and not bother trying to fetch it again. Of course, 410 is only
appropriate if you can be fairly sure that you don't want to reuse the URI in
the future.

~~~
corebit
Interesting idea. It would also need to be the case that the content has no
moved either so a 302 is not appropriate.

I don't necessarily agree that you should be certain not to reuse the URI. Why
do you think that should be the case?

~~~
Sharlin
It depends, really, but I seem to recall that at least some browsers ignore
cache headers on 410 responses and always cache them ”forever”, which is
arguably allowed by the spec.

As a cautionary example, I once was trying to be fancy and used 410 to denote
the expiration of a session-bound resource (actually an API endpoint). That
would have been fine had the resource URI been unique across sessions… but it
wasn’t, so after one session expiration some browsers naturally assumed that
the endpoint URI isn’t going to come back even after starting a new session.
Should have used 404 or 403 instead.

------
dzek69
Years ago I had a problem serving downloads with PHP. IIRC it worked just fine
on Opera (Presto-based) and was showing blank page on Firefox.

The problem was serving 404 Not Found status with `Content-Disposition:
attachment` and actual contents of the file. Opera hadn't had a problem with
that, Firefox was confused.

If you would like to test your browser behavior, here is replicated behaviour
of what i've done in the past:
[http://o7o.pl/down404.php](http://o7o.pl/down404.php)

My results:

\- Chromium-based browsers (tested on Chrome, Vivaldi, Opera Developer) shows
generic Chromium error with `ERR_INVALID_RESPONSE` code.

\- Firefox displays own page about resource not being found

\- Edge (not Chromish) removes the url (if opened in new tab) and shows
infinite loader in the tab favicon or just restores previous url (if going to
the url from another page)

\- Old Presto-based Opera (newest from 12.x) just downloads the file

\- wget just returns 404 error

\- Internet Explorer 11 shows own "page cannot be found" page

Can anyone test it on Safari on Mac?

I wonder which should be the correct behavior? Personally I am satisfied with
just downloading the file ignoring 404 status.

~~~
909090ffe4
\- Safari shows a blank page, downloads it as a file, and auto-displays the
content in TextEdit

~~~
dzek69
> auto-displays the content in TextEdit

wow, that's kind of "brave" thing to do for a file marked as
`application/binary` mime type. it has `.txt` extension however as filename

of course I have no idea how TextEdit behaves with big binary files, but such
apps on Windows/Linux can't handle it well (usually hangs forever/for a long
time)

~~~
ambentzen
MacOS/Safari has an option to automatically open "safe" files, like text files
and other media.

~~~
Nextgrid
PDFs and ZIPs are also considered as safe despite the format being very
complex and huge chances there's an exploitable bug in there.

That's one of the very first things I disable on any Mac.

------
jancsika
Would be neat if there was a 404-blocker extension to take over the 404 "bling
space."

For example, in Firefox the extension could replace 404 bling content with
randomly chosen little nostalgic animations and concept art based on the
history of the web. It could just ship with a small collection of such stuff
without much of a footprint so there's no network hit.

Then

~~~
CiPHPerCoder
Replace all 404 errors with 404 Party

[https://www.youtube.com/watch?v=qvwwzV6ruGc](https://www.youtube.com/watch?v=qvwwzV6ruGc)

------
zaarn
You missed the chance to set the response code of the post to 404 instead of
200, I was very disappointed.

------
dzek69
the website is down :) here's a copy:
[https://webcache.googleusercontent.com/search?q=cache:yli5sn...](https://webcache.googleusercontent.com/search?q=cache:yli5sny2_k0J:https://flak.tedunangst.com/post/404-Found+&cd=1&hl=pl&ct=clnk&gl=pl)
(click on text version link if it cannot load anyway)

------
megous
Maybe it's 404, because: "This study is no longer active. Thank you for your
participation."

~~~
dzek69
as article stated - there is a difference between opening that link with and
without trailing slash.

without it - it serves exactly the same page with 200 OK (and redirects you to
slash url but with javascript, not with Location header)

So i guess there is just a little server misconfiguration mixed with
javascript application taking the opposite way of thinking about urls than the
server

~~~
megous
Ah, ok. Trailing slash can be a source of "fun" in webserver configurations,
yes.

