
Tagger News takes a subset of HN articles, analyzes using ML, and applies tags - var_explained
https://techcrunch.com/2017/05/14/building-a-smarter-hacker-news/
======
minimaxir
Direct URL to project details: [https://devpost.com/software/tagger-
news](https://devpost.com/software/tagger-news)

A few comments:

1\. To other commenters, as with the HN Vue demo a week ago
([https://news.ycombinator.com/item?id=14284877](https://news.ycombinator.com/item?id=14284877)),
the project is a technical proof-of-concept; the aesthetics aren't the primary
focus.

2\. The Algolia API is better for scraping because it allows for bulk
requests, unlike the official API (my old 2014 script still works I think:
[https://github.com/minimaxir/get-all-hacker-news-
submissions...](https://github.com/minimaxir/get-all-hacker-news-submissions-
comments))

3) How much time did it take to manually label the training/test set before
training the RF classifier? Even with topic modeling for _extrapolating_ tags,
accurate labeling for 20,000 submissions is a task.

~~~
var_explained
One of the devs here.

1\. That's the way we were thinking about it :)

2\. Oh, excellent! We hadn't found that or we'd have used it, and we'll start
working with it.

3\. Tomorrow I'm going to blog about how we approached the machine learning.
Short version; we manually came up with regular expressions to classify a
training set based on _titles_. The idea is that when we experimented with
manual annotations on titles, the vast majority of the time we were looking
for only a few key words. There's no question that this adds biases and will
not be entirely accurate, but manual inspection convinced us it was a good
enough approach for our hackathon, and most of the articles we identified with
the resulting algorithm would not have been found by the title regex alone.

You can see the table of regular expressions
[here]([https://github.com/dodger487/analyze_hn/blob/master/topics.c...](https://github.com/dodger487/analyze_hn/blob/master/topics.csv))
and a bunch of (pretty unstructured) analysis code
[here]([https://github.com/dodger487/analyze_hn/blob/master/hn-
analy...](https://github.com/dodger487/analyze_hn/blob/master/hn-
analysis.Rmd)).

~~~
searchhn
This is awesome ! Congrats..

[https://github.com/HackerNews/API](https://github.com/HackerNews/API)

The firebase API is excellent. I have been using that to keep
[http://searchhn.com](http://searchhn.com) up to date in real time.

Also big query is updated every day with all comments and posts.
[https://bigquery.cloud.google.com/dataset/bigquery-public-
da...](https://bigquery.cloud.google.com/dataset/bigquery-public-
data:hacker_news)

This is what I started with to update the Searchera
([https://searchera.io](https://searchera.io)) index which powers Searchhn

~~~
var_explained
Oh, that was silly of us not to use BigQuery! I was just able to use that
download a full million stories (though we still would have had the rate-
limiting step of downloading the articles).

During a hackathon it can be hard to tell when to keep searching for an easy
solution like that, as opposed to going with something slow you know will
work- sometimes it turns out to be a dead end.

Thanks for the recommendations!

------
robertelder
I think the biggest value proposition of this is the ability to do sub-reddit
like filtering on specific tags. As Hacker News grows I think dealing with the
number of new submissions would become a bottleneck. During high-traffic
times, new submissions sometimes drop of the first 'new' page in around 10
minutes. Of course there is more traffic during these times to upvote good
content, but I'm not sure that is better than letting a smaller number of
people have a longer period of time to filter a smaller collection of content.

~~~
HiroshiSan
Not too mention a lot of the content has strayed away from technology and has
gone towards mental health, personal growth, news, politics, etc.

~~~
rfrey
I don't remember a time when all those topics were not common.

~~~
HiroshiSan
Maybe now that I'm older I notice it more than I did when I first joined.

------
nickpsecurity
Strength of Hacker News is the network effect of a diverse, intelligent crowd.
It would be hard to replace. A supplemental site tagging it and aiding search
has value. Biggest problem I have searching HN, though, is Google mixing up
stories and comments. The fix might be as simple as two domains that contain
stories and comments cloned from HN, one domain for each, followed by Google
Searches within those domains. Not sure if Google would automatically crawl
it, though.

~~~
CaliforniaKarl
[https://hn.algolia.com/](https://hn.algolia.com/) does the job pretty well!

~~~
gnicholas
Also, you can easily search HN in Chrome: start typing news.ycombinator.com,
it will be ready for autocomplete, and it will say in grey text "Press [tab]
to search HN". The results show up from algolia.

If only there were a way to make the default search order recency instead of
popularity — most of my searching is before posting something, to make sure it
hasn't already been posted.

~~~
oneeyedpigeon
Weirdly, the default values for the search url are slightly different from
what the Chrome search populates them with. The 'address bar search' defaults
'dateRange' to 'all', whilst the script itself defaults 'dateRange' to
'last24h'. Does anyone know how the Chrome address bar search is implemented?

------
asymmetric
Congratulations to the team, although it seems the algorithm isn't very
accurate, since this article[0] from 1997 was tagged with "Blockchain".

[0]: [https://www.gnu.org/philosophy/right-to-
read.html?source=tec...](https://www.gnu.org/philosophy/right-to-
read.html?source=techstories.org)

~~~
georgemcbay
Or maybe the algorithm is scary accurate and has correctly deduced that
Richard Stallman is, in fact, Satoshi Nakamoto!

Somebody contact Newsweek!

~~~
thedevil
As a sincere question, why did people downvote this comment?

I thought he was funny and I think lighthearted humor has value to it. It
didn't seem snarky to me, did it to someone else?

Did it seem off-topic? It's a joke rather than useful information, but I'd
argue that it is on-topic per the rules: "Anything that good hackers would
find interesting."

Am I missing something?

~~~
minimaxir
It's a Reddit-esque comment that adds no value. (and at best, it's off topic
where the topic is about an improved Hacker News)

There are instances where humorous comments can add value (e.g. irony/satire),
but it's not common and hard to do properly.

~~~
bcjordan
Maybe we need a machine learning comment value predictor

~~~
bigiain
Then nobody would need to visit the site at all, it'd be completely self-
ranking and automated - imagine the productivity improvements in the valley?
(Disrupting the web forum industry!)

~~~
bcjordan
> Someone somewhere takes this seriously, actually starts next unicorn

~~~
bigiain
And I go home and cry into my beer. Again.

(Stupid ideas? I'm full of em! Execution on stupid ideas until they get enough
VC capital to become obviously good ideas? I'm pretty lame at that... Anyone
wanna be my "Executing Co-founder"?)

------
hntop
I did similar project few months ago, it does automatic tagging +
summarization of HN largely using scipy and numpy, you can see it in action:
[http://hntop.org](http://hntop.org) here github link
[https://github.com/bexp/textai](https://github.com/bexp/textai)

------
salmonfamine
Not to hijack, but this is similar to a small ML project a friend and I built.
It takes news headlines from a bunch of sources and classifies them by common
topic. We took a lot longer than a day to build it, though. ;)

It refreshes with new stories every few hours. You can check it out here:
[http://headlinr.herokuapp.com/](http://headlinr.herokuapp.com/)

EDIT: click on the bubbles to see individual headlines. Also, here's the
GitHub page:
[https://github.com/dgarrick/headliner](https://github.com/dgarrick/headliner)

------
rileymat2
Personally, I find the visual weight that the tags have exceed their value to
me.

~~~
qznc
That page is to show off their work, which it does. The goal is probably not
to replace news.ycombinator.com.

------
shawkinaw
I think this is great, especially being able to click a tag and see a top 30
list of that tag.

Obvious suggestions that would make it usable as a primary HN interface:

• Login and voting (not sure the HN API supports this though)

• Tag suggestions to feed into the model

------
big_spammer
Link to try it out [http://www.taggernews.com/](http://www.taggernews.com/)

------
sasoon
Here is my take on it few years ago. I tried to make it more like magazine,
and get article text and photo, and there is a section with only articles that
reached top position. [http://www.hnzine.com](http://www.hnzine.com)

------
egypturnash
Possibly off-topic, but: this was done at the "Disrupt NYC" hackathon, and
somehow "add tags to HN" feels like about the least disruptive thing ever.

------
pera
Wow I was thinking to make something similar: an experimental HN fork where
submissions are tagged (collaboratively) _but_ without titles, as these are
rarely useful to predict the content of an article. And of course there is
also the convenience of categorization.

~~~
sitkack
You are describing lobste.rs

------
RichardHeart
This article on book publishing:
[https://news.ycombinator.com/item?id=14334845](https://news.ycombinator.com/item?id=14334845)
is tagged with "blockchain" only. Any idea why?

------
the_arun
Tried
[http://www.taggernews.com/tags/aws/](http://www.taggernews.com/tags/aws/) and
didn't find any results. Is it because it is listing only few tags for now?

------
Imagenuity
This is what @twitter needs to do to make discovering worthwhile information
better.

Hashtag spam makes #hashtags mostly useless as a method of discovery.

------
faragon
Using ML? Why not using just Bayesian filters?

~~~
var_explained
You mean Naive Bayes? Because it can't account for interactions between the
effects of multiple words.

~~~
faragon
Just add some "magic", e.g. per response analysis and inter-response rule-
based system.

~~~
var_explained
If you keep adding "magic" and doing careful research on what magic works and
what doesn't, you end up roughly with the modern field of machine learning.

Random forests are a method that's often effective in taking into account many
interactions among high dimensional data.

~~~
faragon
Expert Systems "magic" predates neural networks by decades, being predictable
and giving validable results (unlike most ML models).

------
magicmikexxl
Can you also add a way to add TLDRs to everything, pls? :D

------
rocky1138
Stop trying to remake Hacker News. It's pretty much perfect the way it is.

~~~
saganus
Maybe it's not perfect for everyone.

I, for one, do like the idea of tagging stuff, since I might favorite a lot of
stories but then years later it's hard to find a particular one, even if you
do remember the general topic.

Tags for this would be really helpful for me, ergo it's not perfect for me,
ergo it's good someone else is trying to make it better.

Since it's not affecting the original site, why would you want to stop them?

