
Firefox’s latest Test Pilot: No More 404s - onion2k
http://www.ghacks.net/2016/08/03/firefoxs-latest-test-pilot-no-more-404s/
======
zeveb
An interesting idea, but I'm concerned about the privacy implications. Does it
send all 404 URLs to the Wayback Machine to see if it has a copy? That reveals
one's browsing habits without one affirmatively doing anything. Does it strip
URL query parameters? These are commonly used for session IDs and auth tokens.
Does it strip URL path parameters? Conceivably these might have private
information in them as well. For that matter, paths themselves may contain
private information.

If all it does is offer a link to the Wayback Machine, then it sounds great.
But I worry that folks might want it to do more, without realising what that
might mean.

~~~
jgruen
The No More 404s experiment, like all of Test Pilot, is totally opt-in. The
whole idea is to let us try things with Firefox, get feedback and iterate
quickly.

No More 404s Telemetry ping is only gathering data about how often the Add-on
is fired and how often it is clicked. We (Mozilla) don't know anything about
URLs. The Wayback Machine has it's own privacy protections in place that you
can learn about here: [https://blog.archive.org/2013/10/25/reader-privacy-at-
the-in...](https://blog.archive.org/2013/10/25/reader-privacy-at-the-internet-
archive/)

Also worth noting, every Test Pilot experiment comes with a brief explainer of
all data collections. Here's the explainer for No More 404s:

In addition to the data collected by all Test Pilot experiments, here are the
key things you should know about what is happening when you use No More 404s:

* We collect basic usage on how many times you encounter a Page Not Found error (code 404), how many times a cached version of that page exists from Archive.org, and how many times you choose to view the cached version.

* To provide cached versions of pages, we send 404 error page URLs to Archive.org. Archive.org discloses its privacy policy here ([https://archive.org/about/terms.php](https://archive.org/about/terms.php)).

* We do not collect URLs of the pages you request or the URLs we send to Archive.org.

* We may share survey results you submit to us and aggregated telemetry data related to this experiment with the Internet Archive.

~~~
zeveb
> To provide cached versions of pages, we send 404 error page URLs to
> Archive.org.

Why not only do that on a user action, instead of on every single 404?

~~~
rcthompson
I think that's exactly what the text you quoted is saying. When you actually
request the cached page, it sends the URL to Archive.org.

~~~
zeveb
It could also legitimately mean that they send _every_ 404 page URL to
archive.org — that's why I'm asking for clarification.

------
sp332
[http://archive.org/donate/](http://archive.org/donate/) They're a non-profit,
it's tax-deductible!

The Archive is a lot bigger than just than the Wayback Machine. The
collections include books, audio, video, and software.

~~~
whatever_dude
PS. Please use [https://archive.org/donate/](https://archive.org/donate/)
instead if you're doing to submit anything like credit card numbers.

~~~
tonmoy
Although I found the comment funny, the paypal button sends the data to
[https://www.paypal.com/..](https://www.paypal.com/..). so wouldn't http be
just as safe? (As long as the MIM does not serve a altered version of the page
that has attacker's site instead of paypal's that looks exactly alike as
paypal I guess)

~~~
qu4z-2
Yeah, so your details are protected from passive wiretapping/MitMing, but not
any active attempts (see sslstrip).

~~~
icebraining
Paypal is on the HSTS preload list, so if you're using a reasonably recent
browser version, sslstrip (without changing the domain) shouldn't work.

~~~
qu4z-2
Ah, I didn't know this. Thanks.

------
wvh
The title makes it sound as if this is the default, a unconditional redirect,
which appears not to be the case; it seems to offer an optional redirect,
which sounds like a better idea. I don't think unconditionally redirecting
users would be the way to go; it would be better for the end user to clearly
understand the page they were trying to access is not there, to prevent
security issues, spam or other abuse resulting from the confusion and
automation.

To offer the user clearly explained options after a clear 404 does sound like
a good idea however, granted that the Wayback Machine can (and wants to)
handle the additional load.

~~~
awqrre
It appears that you need to click a link to be redirected to the archived
version, but it may still be a privacy issue if they check if an archived
version of the page exist before showing the link:

> The notification reads: "This page appears to be missing. View a saved
> version courtesy of the Wayback Machine". You may click on the link to open
> the Internet Archive website to read an archived snapshot of the page on the
> site.

------
perfectfire
I publish an open source Chrome plugin that does this for 404 or 503 errors.
Except it tries the Google cache first and only hits the Wayback Machine if it
gets a 404 from the Google cache. You can also turn it on and off manually if
you don't want automatic redirection. When it finds a page in the Google cache
or Wayback Machine it will change all links on that page to cache links, so if
you're trying to browse a site that's down (rather than a single page) it's
completely seamless.

I started a Firefox extension and got the basic functionality working before
getting bored reimplenting something I had already done. But now I've switched
to Firefox for Android I'm thinking of reviving it, so I can have my plugin on
mobile too. It's been pretty useful. I use it a couple times a week usually
with about a 75% success rate at getting a cached or Wayback Machine page of a
page that is either down or no longer has the content that used to be there.

Edit: Also, it only turns on per-tab instead of browser-wide which I could see
annoyed everybody that used similar plugins for Chrome that would turn the
whole browser into "browse cache-mode" when activated.

~~~
TheWorldIsFun
If by chrome plugin you meant extension then you might not have to recreate
your chrome extension from scratch for firefox. Look at firefox's
webExtensions page, it's pretty much compatible with most of chrome's api and
you probably won't have to change your code all that much.

~~~
perfectfire
Wow, yeah, I definitely meant extension. Don't know where plugin came from. I
was already writing my Firefox extension using the then WIP Web Extensions
api, but I don't remember it being compatible with Chrome extensions, but it's
right there on their Web Extensions homepage: "To a large extent the system is
compatible with the extension API supported by Google Chrome and Opera.
Extensions written for these browsers will in most cases run in Firefox or
Microsoft Edge with just a few changes." And it should even work in Edge. Now
I get to go back and look at what the heck I was doing with the Firefox
extension. Thanks for the info!

------
robert_tweed
Nice idea. My only concern is that it could lead to an increase in overzealous
removal requests, especially if the idea is copied by other vendors.

~~~
J_Darnley
You don't have to request anything. Just alter robots.txt and you can make the
Wayback Machine memory hole your entire website.

~~~
idsout
The keyword here is 'your'.

~~~
dingaling
Or someone else's website to which you now own the domain name

~~~
minikites
It's so infuriating when I come across a domain squatter that nuked the entire
history of a domain in the Wayback Machine. I sort of get why they have to do
that but it also defeats most of the point of the Wayback Machine.

~~~
slrz
_I sort of get why they have to do that_

I don't. Can you explain?

~~~
btgeekboy
Archive.org doesn't know the domain changed hands, just that it used to be
allowed to show the results but now no longer is.

~~~
dragonwriter
Doesn't really explain why they have to nuke it, even if it is the current
site owner. _Respecting_ robots.txt is one thing, but that just means _not
spidering and archiving the content that is now there_. Deleting already
archived material based on later changes to robots.txt is a non-obvious
behavior, given the usual understanding of the general meaning of robots.txt.

~~~
aab0
They're not deleting it, just hiding it from public access. Once the squatter
goes away, the content comes back.

~~~
J_Darnley
What's the difference? Both make this feature (and more general use of the
archive) useless.

~~~
aab0
The difference is exactly what I said: if they deleted it, it's gone forever.
If they hide it, it can come back. I've seen pages I cite disappear for a year
or two thanks to scummy squatters - but they came back! It's the difference
between being sentenced to execution and to 1 year of prison.

------
speps
A huge amount of traffic to Wayback Machine, I hope they donated some money.

~~~
jklinger410
As someone who uses Wayback all the time for work I would not be a fan of this
extra traffic slowing the tool down.

~~~
XMPPwocky
How much does your employer donate to the Internet Archive?

------
mynewtb
How does this handle oblivious users entering their credentials to forms? I
have seen that happen in the wild and it was really hard to explain.

I don't think this is a great idea, it would make me worried about the Wayback
Machine.

~~~
throwanem
> it was really hard to explain

No doubt; I'm not sure what you're describing here.

~~~
mywittyname
I think his/her question is: what about POSTs from forms to 404 URLs.

So, I User put my user/pass into a form field and hit submit. The browser
makes a post to site.com/login with my information, but the server returns a
404. What happens in this case?

~~~
throwanem
The same thing that happens in any other case: you're presented a banner that
includes a link to a Wayback Machine search for the URL that 404ed.

There's no magic here; it's just a link that produces a GET request when you
click it. I feel like a lot of the concerns and objections raised in this
comment thread originate in not having taken a minute to find out what this
extension actually _does_.

------
abstractbeliefs
"No More 404s" has been a long time coming, I remember it was one of the
original projects announced with the launch of Test Pilot.

You can find more upcoming projects and information on Test Pilot here:
[https://testpilot.firefox.com/](https://testpilot.firefox.com/)

------
kenrick95
Take note that it is an opt-in Test Pilot project.

------
makecheck
The problem is, I don’t always want the URL that returns 404. In fact, I’d say
almost every time I _don’t_ want it; I’m seeing the page because I made a
typo.

When I have a mistake, the only thing I want is my UNALTERED original input,
and a chance to fix it.

The ire-inducing thing about most “helper” pages, including the ad-ridden
search pages that ISPs like to return instead of 404, is that they _complete
rewrite my address bar_. This means that instead of being able to fix a single
character and try again, I most likely have to retype the whole damned thing.

Especially on mobile, there’s nothing quite like entering
"simplething.com/foo", accidentally typing "simplethin.com/foo", and being
redirected to
"godawfulisp.com/applications/helper.jsp?unnecessarycrap=1&adtracker=2&garbage=xyz&useextraobnoxiousads=true".
At that point, my options are “go back” (returning a blank page) or trying to
“select” the URL field (which is horrible on the iPhone) and reentering
everything. All because I saw a page I didn’t even want. This made me so mad
that I configured special blocker patterns to ensure that ISP pages could not
even be loaded.

In other words, please stop “helping” users if you don’t really understand
what the problem is. Sometimes an error and the unaltered original input is
exactly the right thing to show.

------
DonaldFisk
Good idea. One of the ideas in Ted Nelson's Xanadu project was unbreakable
links.

The problem on the WWW is that you don't always get a 404. Sometimes the
original web site goes up for sale and you end up with something like this:
[http://www.strictlybowhunting.com/Anov01issue/crows.htm](http://www.strictlybowhunting.com/Anov01issue/crows.htm)
which is not a 404. It would have to recognize that as being equivalent to a
404 and serve up the archive.org page anyway.

Ideally, a URL should allow the option of including a date.

~~~
slrz
Man, do I hate that crap. Oracle is another ugly example. After buying Sun,
some genius there apparently thought that it's OK to break tons of links all
over the web (and in Usenet archives, etc.), redirecting them to the fucking
generic Oracle front page.

Nothing spells out "fuck you" quite as clearly as breaking thousands of
hyperlinks overnight.

------
urda
Why was this flagged?

~~~
akkartik
Looks like the ghacks domain has really poor karma:
[https://news.ycombinator.com/from?site=ghacks.net](https://news.ycombinator.com/from?site=ghacks.net)

------
merqurio
Let's see how long until someone claims to the Wayback Machine their right to
be forgotten. I really like the idea of ephemerality on the internet. I kind
of enjoy to see thing come and go; that nothing lasts forever.

~~~
pYQAJ6Zm
In the European Union, the “right to be forgotten” on the Internet is
protected by law. I believe it is the right thing to do, but I recognize the
issue rests on a delicate equilibrium, maybe yet to be found.

On the one hand, we wouldn’t expect our daily interactions to be recorded and
stored for long. Most of them, anyway – and certainly so for our most personal
ones. On the other hand, the web is a medium where things tend to last a
while, if not for very long thanks to efforts such as the Internet Archive’s.
The delicacy of where to rest the equilibrium, I think, stems from the
particularity of the web as a medium itself: a web page is kind of like ye
olde book, but it adds the possibility of easy changes and user interactivity.
In other words, if so used, a web will sit between traditional written media
and face-to-face (or technologically mediated, but unrecorded) interaction.

Perhaps by custom, perhaps by nature, we expect personal interaction to be
ephemeral, and we have no problem with other media, such as books, newspapers,
etc., to store their contents for long. It’s no harm to publish a book with
our thoughts and have it found across libraries in the world (indeed, it is a
very nice thing), but we would feel violated if our conversations at home
where recorded without proper cause.

The web stays somewhere in between. It’s a place for words to lie and last,
but also for personal interaction. Personally, I favor ephemerality on the
Internet, but don’t see it as characteristic of it. It’s not a given, and we
should find a place for it, but only as far as personal interactions go. I
want to be able to remove (or, at least, anonymize) whatever I post online in
the manner of a casual conversation. But I want other kinds of content, such
as manuals, news, etc., to be preserved from ephemerality.

I’m not sure whether The Internet Archive stands within these bounds. I would
like to see it more as an opt-in facility for website administrators, instead
of requiring manual forbidding.

------
Karunamon
Neat!

Are there any plans to integrate, or allow integration of other services? If
you could search for a dead URL on Wayback, Archive.is, Google Cache, and some
others, you'd have a _lot_ of coverage, and in the case of the last two, be
immunte to IA's (frankly boneheaded) robots.txt policy.

------
ape4
Just change the error number, problem fixed :)

------
swehner
Sounds like better left for a browser plugin

~~~
pbhjpbhj
My first thought too. Firefox _seems_ to be cramming stuff in as default
recently that's really peripheral; strikes me we're approaching that point in
the cycle where a new streamlined offering comes along and people start
jumping ship to that. The place where Firefox started as a simple fast web
browser without the bells and whistles.

~~~
lmorchard
The irony with your comment is that this is exactly one of the purposes of
Test Pilot - to give new features a trial run rather than just "cramming stuff
in as default".

It's an opt-in experiment, and at the end it's possible to a) be rolled in as
a feature, b) spun off as an optional add-on, c) scrapped entirely.

Oh yeah, or d) taken back to the drawing board & re-engineered for another go-
round.

~~~
pbhjpbhj
I've taken part in Test Pilot, so I'm aware of how it works. Hence the wording
of my comment that it was my initial reaction.

The idea that it's appropriate as a plugin is in anticipation that they'd
decide to integrate it like hello and pocket.

It's a feeling, there are all sorts of different browser efforts, FF just
feels like we're getting to a point where it is drifting closer to the other
browsers and getting heavier. YMMV.

------
J_Darnley
Does this mean it only took Firefox some 10 years to integrate (part of) the
Resurrect Pages extension?

~~~
Sylos
Not every possible feature has to be part of Firefox. And this is for now also
just a suggestion that Mozilla is making, so that users can decide over it in
the Test Pilot program.

