The front-end specifically asks for pages in 200-chunks. So whether you slide to page 299,300 or 385, it will request 200-400 from the API. This means I can then very easily serve these requests out of Redis [1] and during usage spikes, requests never hit the disk :)
I see that you are using pyquery to scrape the content. I was using BeautifulSoup for my previous projects, but it seems pyquery is a better choice due to its compatibility with jQuery, so I am planning to switch too. Are there any downsides to pyquery, though?
Very nice. I actually thought of implementing this a few months ago but a quick Google search brought up http://hackerslide.com/ (also open source), which works fine as well.
I've learnt two things from this project in particular. First, that most similar projects don't seem to stick around very long (the Reddit one it was based on disappeared after a few months as have several others - http://hckrnews.com/ is an exception I can recall). Second, these tools seem to be popular at first but then rarely used over time. Luckily I still find it useful to catch up after vacations, etc ;-)
"Perhaps one minor bonus to HackerSlide for now is anyone can take the data collected."
Indeed, may I make a suggestion here:
1) Why not make a datadump so people wouldn't need to scrape ~800*24 json files individually?
2) OP ought to load this data into his version so the timeline goes back further
3) It seems quite a few people get the urge to tinker like this with HN, I'm sure pg doesn't mind the scraping, but it strikes me as vastly more efficient if some sort of shared resource was setup and perhaps added to the footer, in the vain of HNsearch, so people don't waste time get crawling data setup.
I'm sure somebody else has a dataset just like yours that goes back further still. :)
Also, thank you for making this and OP for making his. Fun.
This looks amazing! Good job! If I may ask(yes I have taken a look at your code but unfortunately my python is not too good), whats the "secret" to grabbing HN's historic data? Also, would it be possible to go back beyond one month towards maybe a year or even few years back (idk if it is just me but I can go up until October 9th, 2012)?
This is where the scraping happens. The code is a little uglier than I'd like, but that's largely to do with the hard-extract-data from HN markup. I'll look at adding more data soon.
There was something very similar to this for reddit that was very handy and useful. It was at redditsnapshot.sweyla.com but it shut down at some point. I still think there is an opportunity for someone to index reddit's front page and have it hour by hour and maybe a premium version of all the subreddits you subscribe to.
When I saw that sweylas's stuff has disappeared, I asked them in an email if it will be back again. I didn't get an answer so I implemented my own version: http://redditsnapshots.com/ (running since March 2012)
Awesome. One can also use it to analyse various trends on HN. For ex. look at this story: http://news.ycombinator.com/item?id=4739649 from November 4th 6:40 pm to 11:00 pm. Shows you how HN'ers are eager to help each other.
Loved this. I would love the feature where I could "highlight" (add a glow for effect) any particular story and watch it move up and down as I play with the arrows. Right now everything moves, so this would help spot trend for a particular story.
Seems like there's an API if you want to get some datas out of it. Otherwise, you might want to try to use javascript on trusted websites.. it actually often enhance the pages.
I created this for the HN community and I'm very happy to see so many people enjoying it :)
If you have any questions about how I built this or if you'd like to suggest something new, let me know!