
Show HN: Updating archive of HN's /newest, with pre-rendering of the URLs - avifreedman
http://hnflood.com
======
avifreedman
I'm frequently mobile or in the air when scanning /newest so the browser CPU
from opening 50+ tabs at once and the latency and loss with all the RTTs on
modern pages is a time waster.

And I couldn't find any RSS feeds of /newest (no one that crazy I guess), so I
wrote something to grab pointers to the submissions and render them as GIF,
HTML, and text.

If this is useful to anyone I'm happy to improve it (including design).

------
bazzargh
There's i18n bugs in the characters you're sending - I notice you've used
"text/html; charset=iso-8859-1" but HN itself is "text/html; charset=utf-8" \-
ie you can only correctly display a subset of HN's headlines. There's an
example on the page right now - the Lockheed story from medium.com.

Also, really dislike the target="_blank" on every link - I can get that anyway
with ctrl-click.

Was surprised you're generating gif previews but not then inlining thumbnails?

~~~
avifreedman
Good point(s). Will look at encoding later and rerun all the pages.

Re: _blank - it was a peeve of mine that I'd accidentally miss a ctrl on the
click reading /newest, but with this interface that's not a problem since
moving back will work and keep your place. I changed it and the historic pages
are regenerating now.

Re thumbnails, I wanted pages I could load quickly on ipad, gogo wireless, bad
3g, etc. Happy to do a version with thumbs this week if you'd like.

~~~
bazzargh
Don't do it for me! I'm happy enough using
[http://ihackernews.com/](http://ihackernews.com/) on my mobile; was just
asking the question.

------
rb2e
I like this personally as someone who submitted something to HN and I don't
mind my content used this way as a link is used pointing back to the main site
and its obvious where the main site is but if it downloaded every post I made,
then I might take issue.

However some of the website owners may not like you're downloading the content
of their articles and hosting them elsewhere. There is a great army of spam
sites that will just watch [http://pingomatic.com/](http://pingomatic.com/)
and scrape each new entry. Then they will on host it on a splog and stick
adverts up. Which gets website owners annoyed.

Trouble is Google might not realize which is the main site and won't get the
page rank or the visitors don't come to their site but an alternative. They'll
get annoyed as they won't be able to monetize them or see who is reading the
page through analytics.

So you may want to set up a DMCA page and abuse email address to stay on the
right side of the law. Also a robots.txt which denies Google and Bing from
crawling the pages you downloaded.

~~~
avifreedman
Thanks for the suggestions. robots.txt should now be blocking the /db/, which
has all saved content, and a link has been added to the DMCA page on every
generated page (putting it at the bottom would be obscure since the pages can
get so long).

I'm not planning on copying any of the actual HN content, and don't present
copy at all if it is on news.yc. At some point I'll hook into the API to grab
comments/points every so often to update into the index pages and probably
allow voting from the pages.

~~~
rb2e
Great! Thank you for doing that.

------
nezza-_-
How do you generate the gifs? This one didn't work too well:
[http://hnflood.com/db/ab/70a/ab70a72b415067b3130bd0f3ea77baa...](http://hnflood.com/db/ab/70a/ab70a72b415067b3130bd0f3ea77baa3_gif.html)

Very nice idea otherwise! I wonder if you could get in (legal) trouble for
providing the screenshots/texts.

~~~
avifreedman
Using phantomjs. For medium it just grabs an image on all renders. I
considered making it not show the gif at all in the index for sites on medium
for now... but I'll try to get that debugged soonish by someone familiar with
phantomjs.

------
ddod
This is actually really awesome from a research perspective. I imagine you
could do things like analyze word pools, website layouts, colors, and other
latent link data and correlate that with success/failure rates on HN. If I had
more time, I could definitely see myself building things based off of this.
Some sort of API would be useful for such.

Great work!

~~~
avifreedman
I think that the thriftdb folks (integrated to HN and hnsearch) do a great job
with some of what you are looking for. If not, I'm happy to make data
accessible.

Actually... It's just using the unix fs as a db right now, the structure is
open and pretty easy to decipher. The dir tree for the objects is just
"db/substr(md5hashofurl, 0, 2)/substr(md5hashofurl, 2, 3)/md5hash)" and the
bytime/yyyy/m/d/hr/min is just symlinks to the md5hash (0 len file).

------
nwh
Could you save something better than the incredibly lossy GIF? Even a JPEG
would be better if you're trying to save size on disk.

~~~
avifreedman
I tried all of the formats from phantomjs and GIF didn't look worse than the
JPEG. Disk space isn't an issue... I will take another look or maybe just add
JPEG or BMP as well.

Usually I am just scanning to get the sense of whether it's worth reading
every word, and if so I go to the original source.

