A few comments:
1. To other commenters, as with the HN Vue demo a week ago (https://news.ycombinator.com/item?id=14284877), the project is a technical proof-of-concept; the aesthetics aren't the primary focus.
2. The Algolia API is better for scraping because it allows for bulk requests, unlike the official API (my old 2014 script still works I think: https://github.com/minimaxir/get-all-hacker-news-submissions...)
3) How much time did it take to manually label the training/test set before training the RF classifier? Even with topic modeling for extrapolating tags, accurate labeling for 20,000 submissions is a task.
1. That's the way we were thinking about it :)
2. Oh, excellent! We hadn't found that or we'd have used it, and we'll start working with it.
3. Tomorrow I'm going to blog about how we approached the machine learning. Short version; we manually came up with regular expressions to classify a training set based on titles. The idea is that when we experimented with manual annotations on titles, the vast majority of the time we were looking for only a few key words. There's no question that this adds biases and will not be entirely accurate, but manual inspection convinced us it was a good enough approach for our hackathon, and most of the articles we identified with the resulting algorithm would not have been found by the title regex alone.
You can see the table of regular expressions [here](https://github.com/dodger487/analyze_hn/blob/master/topics.c...) and a bunch of (pretty unstructured) analysis code [here](https://github.com/dodger487/analyze_hn/blob/master/hn-analy...).
The firebase API is excellent. I have been using that to keep http://searchhn.com up to date in real time.
Also big query is updated every day with all comments and posts.
This is what I started with to update the Searchera (https://searchera.io) index which powers Searchhn
During a hackathon it can be hard to tell when to keep searching for an easy solution like that, as opposed to going with something slow you know will work- sometimes it turns out to be a dead end.
Thanks for the recommendations!
BTW, congrats for the projects, well done!
Might wanna tweak that...
Also in 10 minutes there is no time to read every articles (or any good long for that matter) so not sure what that leave on the front page
I've also posted this sign-up as a discussion if you want to leave any comments for everyone to see: https://news.ycombinator.com/item?id=14338456
There are also options like HN top 10: http://www.daemonology.net/hn-daily/
Or the ultra-cynical weekly antidote to HN: http://n-gate.com/
If only there were a way to make the default search order recency instead of popularity — most of my searching is before posting something, to make sure it hasn't already been posted.
I do wish there was a way I could set the defaults. I almost always want to search comments and sort by date.
Somebody contact Newsweek!
I thought he was funny and I think lighthearted humor has value to it. It didn't seem snarky to me, did it to someone else?
Did it seem off-topic? It's a joke rather than useful information, but I'd argue that it is on-topic per the rules: "Anything that good hackers would find interesting."
Am I missing something?
There are instances where humorous comments can add value (e.g. irony/satire), but it's not common and hard to do properly.
(Stupid ideas? I'm full of em! Execution on stupid ideas until they get enough VC capital to become obviously good ideas? I'm pretty lame at that... Anyone wanna be my "Executing Co-founder"?)
?X is, in fact, Satoshi Nakamoto!
Since this topic is already fairly meta already, and because of the nature of your question, I'll chime in here as well as to why I would normally down vote your comment in other threads.
"why did people downvote this comment?"
Any discussion of voting (outside of a few exceptions such as this) gets down voted quickly. Not only is it discouraged in the guidelines, it's also generally self-correcting. I've seen far too many comments that ask why they are down voted when they clearly have more votes up than down. In addition, the conversations in reply generally revolved around why people might be voting down, and whether that is wrong.
Basically, it creates a bunch of useless commentary for no good reason.
In addition to this, asking people why they voted down a comment is annoying. The goal of commentary should be to spark either conversation or thought. If it does neither, it's really not worth my trouble to explain why I down vote it. I vote down the comment because it is a bad comment, and not worthy of worthwhile discussion.
I've voted comments up that I disagree with because the discussions they've sparked were interesting and voted down comments I agree with because they don't honestly contribute to the active discussion and exchange of ideas.
Not everyone thinks this way. I'm sure people vote up what they agree with and vote down with what they disagree with without a care to the overall discussion simply to fit an agenda. I admit I've done it in the past (I am not perfect, after all), and I've regretted it. But overall voting corrects itself, and frankly, it does not matter. Karma is representative of your value.
If you are that concerned about the karma of a comment, do not post. If people are voting down your comment and not replying, start by addressing the failings in your comment to spark proper discussion.
Blaming others (tripe like "people would rather vote down than explain where I am wrong") is weak and childish, and will get voted down without hesitation. HN should be better than that weak (non-existent?) rhetoric, and the moment you add that to a comment, you've lost.
It refreshes with new stories every few hours. You can check it out here: http://headlinr.herokuapp.com/
EDIT: click on the bubbles to see individual headlines. Also, here's the GitHub page: https://github.com/dgarrick/headliner
Custom tag groups could be a useful extension.
Obvious suggestions that would make it usable as a primary HN interface:
• Login and voting (not sure the HN API supports this though)
• Tag suggestions to feed into the model
Hashtag spam makes #hashtags mostly useless as a method of discovery.
Random forests are a method that's often effective in taking into account many interactions among high dimensional data.
I, for one, do like the idea of tagging stuff, since I might favorite a lot of stories but then years later it's hard to find a particular one, even if you do remember the general topic.
Tags for this would be really helpful for me, ergo it's not perfect for me, ergo it's good someone else is trying to make it better.
Since it's not affecting the original site, why would you want to stop them?
Materialistic on Android makes it decent, but would still prefer something like Boost for Reddit for comments