
Show HN: Mirror of current articles on HN front page - enan
http://hn.getpageback.com/
======
aaronpk
You seem to have some UTF-8 issues. Lots of things like "itâ€™s" everywhere.

~~~
enan
Thanks, fixed! The content-type header was missing the encoding on the S3
object!

------
enan
This is my attempt to solve the problem of sites becoming unavailable when
they make it to the HN front page. It is also a trial balloon for a new
service that I am developing. AMA! :)

~~~
greenyoda
If you're mirroring articles on these sites without permission, wouldn't that
be a violation of the owners' copyrights?

~~~
enan
That's a valid concern. I believe this service is very similar to Google cache
and the copying should be permissible under fair use [1]. But IANAL :)

[1] [http://fairuse.stanford.edu/case/field-v-google-
inc/](http://fairuse.stanford.edu/case/field-v-google-inc/)

edit: we also respect robots.txt!

~~~
greenyoda
I looked at the judge's order in that case, which was very interesting. Some
of the points he makes in Google's favor are:

\- Google respects any "noarchive" tags that are on the page, so the page
owner can control whether Google copies each page.

\- The site owner can also prevent Google from copying the entire site (or
parts of it) via robots.txt.

If I understand the argument correctly, this metadata, as set on the
plaintiff's site, gave Google an implied license to use the content, based on
widely-understood web conventions.

Also, the order notes that Google places a prominent banner on top of its
cached pages stating that they're copies that may not be current. However,
your copies seem to be indistinguishable from the original content. If
somebody were to send someone else a link to one of your cached articles, it
would be difficult to tell that it was a cached copy.

~~~
enan
Thanks for the comments. Our crawler does respect the robots.txt standard and
the nofollow tag. Seems like noarchive is what google recommends. Will look
more into it.

Although we do put a banner on the index page - we don't have them on each
page. Thanks for pointing it out - will fix!

~~~
Paulods
Even more important than that for me (possibly for you too) is that you make
sure that none of these pages make it into googles index.

The duplication of content (potentially sending the original pages down in
search ranks) and the fact that you are polluting the organic search results
for the sites you mirror could be a big issue for the owners of the pages.

~~~
enan
Good point! There is a robots.txt that prevents the site from getting indexed
now:
[http://hn.getpageback.com/robots.txt](http://hn.getpageback.com/robots.txt)

