This is really cool. I've been working on a search engine for lectures, and I recently set it up so you can filter conference talks by the programming language.
For the first iteration I wrote heuristics based on a list of language-specific subreddits. The technique this blog post describes is the next logical step, so I'm thrilled about this write-up.
Thank you. Post author here. I'm glad you enjoyed it. I'll write many more blog posts over the christmas period on the same topic, so if you enjoyed then do keep checking.
Indeed. I've been very busy with work, but I'll dedicate a lot more time to articles like this, and shall list them all on the frontpage of http://ioloop.io. So if you just bookmark that for now, you should find more content like this before the New Year.
What I really like about this technique is that I can play with my scikit-learn while being offline, which seems to go hand in hand with the holiday travels ahead.
Wait, you had 15 test cases, and didn't even say how the classifier did? I guess I could run the code myself, and that's nice you provided it on Github, but still, I feel like you gotta post results in a blog post like this.
You are indeed right, and I encourage anyone interested in computing the precision metrics to do so.
The focus of this post is the GoLang proxy used for the caching. It's actually used in a CI / CD environment, but I'm finding it incredibly useful for a whole variety of tasks.
In the scikit-learn article the classifier scores over 90% precision, so I'd expect it would be possible to do the same.
I'll be posting more about this over the xmas period, so I'll write a part II, where I compute the precision metrics.
Another thing worth noting is that it's not caching the HN comments, while it is using the reddit comments. Despite using tfidf, this still completely skews the results towards reddit as opposed to HN. So that's something else any interested reader can look into.
But results and how well the classifier performs really just depends on the quality and amount of training data you have. So would be interesting to see how this does if you can get a bunch more data from each of the subreddits and have some more test examples!
Thank you Jack. I think your point is important so I'll add a note at the end of the article. I'll also add link to your article as I think it's complimentary.
I'm still confused about the use case. Is the purpose of a cache like this to create a unified interface to accessing and storing web pages? So you can write code to scrape pages based on their URLs, but then you can re-run without changing your code using local cached data?
I'm just getting started with experimenting in this topic, and it's great to see something that isn't about classifying flower petals that is approachable!
Hi Spooky23. Thank you. I'll be writing more posts on the topic over the christmas period, so keep checking. I also find it more interesting with data I find interesting.
This is a great post, and overlaps somewhat with a project I'm working on (scraping and classifying large amounts of text).
I know it's not the focus of the post, but was there any particular reason why you went with the MultiNomialNB classifier? I've been getting pretty good results recently with LinearSVC which seems to be a lot faster and in my case a bit more accurate too.
An interesting metric for a future post might be how your proxy compares with scrapy + httpcache middleware.
Looks very interesting, have you looked at H2O.ai which spits out a classifiers as Java code which can be wrapped in an ultra low latency API without caching.
https://www.findlectures.com/?p=1&class1=Technology&category...
For the first iteration I wrote heuristics based on a list of language-specific subreddits. The technique this blog post describes is the next logical step, so I'm thrilled about this write-up.