
Show HN: Subreddit classifier using scikit-learn and a high-performance Go proxy - ioloop
http://ioloop.io/blog/hoverpy-scikitlearn/
======
garysieling
This is really cool. I've been working on a search engine for lectures, and I
recently set it up so you can filter conference talks by the programming
language.

[https://www.findlectures.com/?p=1&class1=Technology&category...](https://www.findlectures.com/?p=1&class1=Technology&category_l2_Technology=Programming%20Languages)

For the first iteration I wrote heuristics based on a list of language-
specific subreddits. The technique this blog post describes is the next
logical step, so I'm thrilled about this write-up.

~~~
ioloop
Thank you. Post author here. I'm glad you enjoyed it. I'll write many more
blog posts over the christmas period on the same topic, so if you enjoyed then
do keep checking.

~~~
derekja
Agree, nice article! Afraid other than your posts making it to front page, I
don't see how to keep checking:

\-- can't find an RSS feed for your blog \-- ioloop.io has a blank homepage
\-- ioloop.io/blog gives a 404

So where's the best place to keep checking?

Thanks!

~~~
ioloop
Indeed. I've been very busy with work, but I'll dedicate a lot more time to
articles like this, and shall list them all on the frontpage of
[http://ioloop.io](http://ioloop.io). So if you just bookmark that for now,
you should find more content like this before the New Year.

What I really like about this technique is that I can play with my scikit-
learn while being offline, which seems to go hand in hand with the holiday
travels ahead.

------
jackschultz
Wait, you had 15 test cases, and didn't even say how the classifier did? I
guess I could run the code myself, and that's nice you provided it on Github,
but still, I feel like you gotta post results in a blog post like this.

~~~
ioloop
You are indeed right, and I encourage anyone interested in computing the
precision metrics to do so.

The focus of this post is the GoLang proxy used for the caching. It's actually
used in a CI / CD environment, but I'm finding it incredibly useful for a
whole variety of tasks.

Regarding the precision metrics, you can find all the information required
here [http://scikit-
learn.org/stable/tutorial/text_analytics/worki...](http://scikit-
learn.org/stable/tutorial/text_analytics/working_with_text_data.html) to
compute the precision.

In the scikit-learn article the classifier scores over 90% precision, so I'd
expect it would be possible to do the same.

I'll be posting more about this over the xmas period, so I'll write a part II,
where I compute the precision metrics.

Another thing worth noting is that it's not caching the HN comments, while it
is using the reddit comments. Despite using tfidf, this still completely skews
the results towards reddit as opposed to HN. So that's something else any
interested reader can look into.

Thank you for pointing this out.

~~~
jackschultz
Well sure it's possible, I've done writeups on text classification with
Scikit-Learn as well here:

[https://bigishdata.com/2016/12/05/classifying-amazon-
reviews...](https://bigishdata.com/2016/12/05/classifying-amazon-reviews-with-
scikit-learn-more-data-is-better-turns-out/)

But results and how well the classifier performs really just depends on the
quality and amount of training data you have. So would be interesting to see
how this does if you can get a bunch more data from each of the subreddits and
have some more test examples!

~~~
ioloop
Thank you Jack. I think your point is important so I'll add a note at the end
of the article. I'll also add link to your article as I think it's
complimentary.

------
maksimum
I'm still confused about the use case. Is the purpose of a cache like this to
create a unified interface to accessing and storing web pages? So you can
write code to scrape pages based on their URLs, but then you can re-run
without changing your code using local cached data?

~~~
garysieling
It's possible that the proxy caches duplicate requests, e.g. if you're
crawling articles linked on reddit.

------
Spooky23
This is a really great resource. Thank you.

I'm just getting started with experimenting in this topic, and it's great to
see something that isn't about classifying flower petals that is approachable!

~~~
ioloop
Hi Spooky23. Thank you. I'll be writing more posts on the topic over the
christmas period, so keep checking. I also find it more interesting with data
I find interesting.

------
sandwell
This is a great post, and overlaps somewhat with a project I'm working on
(scraping and classifying large amounts of text).

I know it's not the focus of the post, but was there any particular reason why
you went with the MultiNomialNB classifier? I've been getting pretty good
results recently with LinearSVC which seems to be a lot faster and in my case
a bit more accurate too.

An interesting metric for a future post might be how your proxy compares with
scrapy + httpcache middleware.

------
zero-x
Looks very interesting, have you looked at H2O.ai which spits out a
classifiers as Java code which can be wrapped in an ultra low latency API
without caching.

~~~
garysieling
That looks really interesting, do you have any experience working with it that
you can share?

