1. To other commenters, as with the HN Vue demo a week ago (https://news.ycombinator.com/item?id=14284877), the project is a technical proof-of-concept; the aesthetics aren't the primary focus.
3) How much time did it take to manually label the training/test set before training the RF classifier? Even with topic modeling for extrapolating tags, accurate labeling for 20,000 submissions is a task.
2. Oh, excellent! We hadn't found that or we'd have used it, and we'll start working with it.
3. Tomorrow I'm going to blog about how we approached the machine learning. Short version; we manually came up with regular expressions to classify a training set based on titles. The idea is that when we experimented with manual annotations on titles, the vast majority of the time we were looking for only a few key words. There's no question that this adds biases and will not be entirely accurate, but manual inspection convinced us it was a good enough approach for our hackathon, and most of the articles we identified with the resulting algorithm would not have been found by the title regex alone.
Oh, that was silly of us not to use BigQuery! I was just able to use that download a full million stories (though we still would have had the rate-limiting step of downloading the articles).
During a hackathon it can be hard to tell when to keep searching for an easy solution like that, as opposed to going with something slow you know will work- sometimes it turns out to be a dead end.
A few comments:
1. To other commenters, as with the HN Vue demo a week ago (https://news.ycombinator.com/item?id=14284877), the project is a technical proof-of-concept; the aesthetics aren't the primary focus.
2. The Algolia API is better for scraping because it allows for bulk requests, unlike the official API (my old 2014 script still works I think: https://github.com/minimaxir/get-all-hacker-news-submissions...)
3) How much time did it take to manually label the training/test set before training the RF classifier? Even with topic modeling for extrapolating tags, accurate labeling for 20,000 submissions is a task.