Hacker News new | past | comments | ask | show | jobs | submit login
Tagger News takes a subset of HN articles, analyzes using ML, and applies tags (techcrunch.com)
308 points by var_explained on May 14, 2017 | hide | past | web | favorite | 64 comments

Direct URL to project details: https://devpost.com/software/tagger-news

A few comments:

1. To other commenters, as with the HN Vue demo a week ago (https://news.ycombinator.com/item?id=14284877), the project is a technical proof-of-concept; the aesthetics aren't the primary focus.

2. The Algolia API is better for scraping because it allows for bulk requests, unlike the official API (my old 2014 script still works I think: https://github.com/minimaxir/get-all-hacker-news-submissions...)

3) How much time did it take to manually label the training/test set before training the RF classifier? Even with topic modeling for extrapolating tags, accurate labeling for 20,000 submissions is a task.

One of the devs here.

1. That's the way we were thinking about it :)

2. Oh, excellent! We hadn't found that or we'd have used it, and we'll start working with it.

3. Tomorrow I'm going to blog about how we approached the machine learning. Short version; we manually came up with regular expressions to classify a training set based on titles. The idea is that when we experimented with manual annotations on titles, the vast majority of the time we were looking for only a few key words. There's no question that this adds biases and will not be entirely accurate, but manual inspection convinced us it was a good enough approach for our hackathon, and most of the articles we identified with the resulting algorithm would not have been found by the title regex alone.

You can see the table of regular expressions [here](https://github.com/dodger487/analyze_hn/blob/master/topics.c...) and a bunch of (pretty unstructured) analysis code [here](https://github.com/dodger487/analyze_hn/blob/master/hn-analy...).

This is awesome ! Congrats..


The firebase API is excellent. I have been using that to keep http://searchhn.com up to date in real time.

Also big query is updated every day with all comments and posts. https://bigquery.cloud.google.com/dataset/bigquery-public-da...

This is what I started with to update the Searchera (https://searchera.io) index which powers Searchhn

Oh, that was silly of us not to use BigQuery! I was just able to use that download a full million stories (though we still would have had the rate-limiting step of downloading the articles).

During a hackathon it can be hard to tell when to keep searching for an easy solution like that, as opposed to going with something slow you know will work- sometimes it turns out to be a dead end.

Thanks for the recommendations!

I've now blogged in more detail about building Tagger News- check it out here! https://news.ycombinator.com/item?id=14343854

Hey mate, you should follow this guide step by step when you deploy a django app: https://docs.djangoproject.com/en/1.11/howto/deployment/chec...

BTW, congrats for the projects, well done!

The Awful Reign of the Red Delicious (2014) (theatlantic.com) is tagged 'Microsoft' 'Apple'

Might wanna tweak that...

I think the biggest value proposition of this is the ability to do sub-reddit like filtering on specific tags. As Hacker News grows I think dealing with the number of new submissions would become a bottleneck. During high-traffic times, new submissions sometimes drop of the first 'new' page in around 10 minutes. Of course there is more traffic during these times to upvote good content, but I'm not sure that is better than letting a smaller number of people have a longer period of time to filter a smaller collection of content.

Being able to filter out and automatically hide stories from the front page by tag would be lovely. There's just so much stuff I don't care about at all that gets pushed up the front page and takes up residence there, and I'm starting to wear out the "hide" link...

Not too mention a lot of the content has strayed away from technology and has gone towards mental health, personal growth, news, politics, etc.

I don't remember a time when all those topics were not common.

Maybe now that I'm older I notice it more than I did when I first joined.

subreddit force are their communities

Also in 10 minutes there is no time to read every articles (or any good long for that matter) so not sure what that leave on the front page

If you or anyone else is interested in a 10 minute audio summary of Hacker News, let me know.


I've also posted this sign-up as a discussion if you want to leave any comments for everyone to see: https://news.ycombinator.com/item?id=14338456

There are also options like HN top 10: http://www.daemonology.net/hn-daily/

Or the ultra-cynical weekly antidote to HN: http://n-gate.com/

Strength of Hacker News is the network effect of a diverse, intelligent crowd. It would be hard to replace. A supplemental site tagging it and aiding search has value. Biggest problem I have searching HN, though, is Google mixing up stories and comments. The fix might be as simple as two domains that contain stories and comments cloned from HN, one domain for each, followed by Google Searches within those domains. Not sure if Google would automatically crawl it, though.

https://hn.algolia.com/ does the job pretty well!

Also, you can easily search HN in Chrome: start typing news.ycombinator.com, it will be ready for autocomplete, and it will say in grey text "Press [tab] to search HN". The results show up from algolia.

If only there were a way to make the default search order recency instead of popularity — most of my searching is before posting something, to make sure it hasn't already been posted.

Weirdly, the default values for the search url are slightly different from what the Chrome search populates them with. The 'address bar search' defaults 'dateRange' to 'all', whilst the script itself defaults 'dateRange' to 'last24h'. Does anyone know how the Chrome address bar search is implemented?

The search bar at the bottom of news.ycombinator.com also takes you to that site.

Yes, I used to be critical of the search but it's actually pretty good.

I do wish there was a way I could set the defaults. I almost always want to search comments and sort by date.

The brand new https://hacker-search.net isn't bad either.

http://searchhn.com is a small demo we have been building at Searchera (https://searchera.io)

IMO I find the domain to user facet lookup to be more useful that the tagging option - I am sure you can just deduce tags from that alone on 90% of the submissions - good demo.

Thank you !

Congratulations to the team, although it seems the algorithm isn't very accurate, since this article[0] from 1997 was tagged with "Blockchain".

[0]: https://www.gnu.org/philosophy/right-to-read.html?source=tec...

Or maybe the algorithm is scary accurate and has correctly deduced that Richard Stallman is, in fact, Satoshi Nakamoto!

Somebody contact Newsweek!

As a sincere question, why did people downvote this comment?

I thought he was funny and I think lighthearted humor has value to it. It didn't seem snarky to me, did it to someone else?

Did it seem off-topic? It's a joke rather than useful information, but I'd argue that it is on-topic per the rules: "Anything that good hackers would find interesting."

Am I missing something?

It's a Reddit-esque comment that adds no value. (and at best, it's off topic where the topic is about an improved Hacker News)

There are instances where humorous comments can add value (e.g. irony/satire), but it's not common and hard to do properly.

Maybe we need a machine learning comment value predictor

Then nobody would need to visit the site at all, it'd be completely self-ranking and automated - imagine the productivity improvements in the valley? (Disrupting the web forum industry!)

> Someone somewhere takes this seriously, actually starts next unicorn

And I go home and cry into my beer. Again.

(Stupid ideas? I'm full of em! Execution on stupid ideas until they get enough VC capital to become obviously good ideas? I'm pretty lame at that... Anyone wanna be my "Executing Co-founder"?)

My impression is that the bar for humor is higher on Hacker News than most other places. Perhaps at a minimum the joke should be intellectually interesting enough that there is no need to lawyer up behind a claim that the joke is on topic. Here the nexus is formulaic:

    ?X is, in fact, Satoshi Nakamoto!
Chuck Norris, Paul Graham, my dog Spot for ?X are each about equally humorous. I think this is because each is about as clever an intellectual move. That's not to say that a joke connecting Stallman to Nakamoto couldn't work on Hacker News. Just that it would probably require a lot more work: e.g. better premise than a hackathoned machine learning classifier might be the singularity. Even here it might have worked if the author had gone all in and backed up the claim with examples, anecdotes, rationals that pushed the joke telling art via absurdity. https://www.youtube.com/watch?v=itWxXyCfW5s

Note: While I'm replying to you, please note that I'm not claiming you've done anything like this or that you are doing these things. Rather, it's just your comment sparked these thoughts. That is all. =)

Since this topic is already fairly meta already, and because of the nature of your question, I'll chime in here as well as to why I would normally down vote your comment in other threads.

"why did people downvote this comment?"

Any discussion of voting (outside of a few exceptions such as this) gets down voted quickly. Not only is it discouraged in the guidelines, it's also generally self-correcting. I've seen far too many comments that ask why they are down voted when they clearly have more votes up than down. In addition, the conversations in reply generally revolved around why people might be voting down, and whether that is wrong.

Basically, it creates a bunch of useless commentary for no good reason.

In addition to this, asking people why they voted down a comment is annoying. The goal of commentary should be to spark either conversation or thought. If it does neither, it's really not worth my trouble to explain why I down vote it. I vote down the comment because it is a bad comment, and not worthy of worthwhile discussion.

I've voted comments up that I disagree with because the discussions they've sparked were interesting and voted down comments I agree with because they don't honestly contribute to the active discussion and exchange of ideas.

Not everyone thinks this way. I'm sure people vote up what they agree with and vote down with what they disagree with without a care to the overall discussion simply to fit an agenda. I admit I've done it in the past (I am not perfect, after all), and I've regretted it. But overall voting corrects itself, and frankly, it does not matter. Karma is representative of your value.

If you are that concerned about the karma of a comment, do not post. If people are voting down your comment and not replying, start by addressing the failings in your comment to spark proper discussion.

Blaming others (tripe like "people would rather vote down than explain where I am wrong") is weak and childish, and will get voted down without hesitation. HN should be better than that weak (non-existent?) rhetoric, and the moment you add that to a comment, you've lost.

it detracts from the conversation, and meaningful responses (ie why did the classifier classify it as blockchain?)

Humor is not allowed on HN.

I did similar project few months ago, it does automatic tagging + summarization of HN largely using scipy and numpy, you can see it in action: http://hntop.org here github link https://github.com/bexp/textai

Not to hijack, but this is similar to a small ML project a friend and I built. It takes news headlines from a bunch of sources and classifies them by common topic. We took a lot longer than a day to build it, though. ;)

It refreshes with new stories every few hours. You can check it out here: http://headlinr.herokuapp.com/

EDIT: click on the bubbles to see individual headlines. Also, here's the GitHub page: https://github.com/dgarrick/headliner

Personally, I find the visual weight that the tags have exceed their value to me.

That page is to show off their work, which it does. The goal is probably not to replace news.ycombinator.com.

I would agree, for me the bright blue tag with the sharp box really makes it hard to scan the headlines without a ton of concentration ignoring the tags separately.

Agreed. Tag colours draw too much attention, particularly contrasted against visited links.

Custom tag groups could be a useful extension.

I find the white on light blue very unreadable.

Same first impression. No subtlety there

I think this is great, especially being able to click a tag and see a top 30 list of that tag.

Obvious suggestions that would make it usable as a primary HN interface:

• Login and voting (not sure the HN API supports this though)

• Tag suggestions to feed into the model

Link to try it out http://www.taggernews.com/

Here is my take on it few years ago. I tried to make it more like magazine, and get article text and photo, and there is a section with only articles that reached top position. http://www.hnzine.com

Possibly off-topic, but: this was done at the "Disrupt NYC" hackathon, and somehow "add tags to HN" feels like about the least disruptive thing ever.

Wow I was thinking to make something similar: an experimental HN fork where submissions are tagged (collaboratively) but without titles, as these are rarely useful to predict the content of an article. And of course there is also the convenience of categorization.

You are describing lobste.rs

This article on book publishing: https://news.ycombinator.com/item?id=14334845 is tagged with "blockchain" only. Any idea why?

Tried http://www.taggernews.com/tags/aws/ and didn't find any results. Is it because it is listing only few tags for now?

This is what @twitter needs to do to make discovering worthwhile information better.

Hashtag spam makes #hashtags mostly useless as a method of discovery.

Using ML? Why not using just Bayesian filters?

You mean Naive Bayes? Because it can't account for interactions between the effects of multiple words.

Just add some "magic", e.g. per response analysis and inter-response rule-based system.

If you keep adding "magic" and doing careful research on what magic works and what doesn't, you end up roughly with the modern field of machine learning.

Random forests are a method that's often effective in taking into account many interactions among high dimensional data.

Expert Systems "magic" predates neural networks by decades, being predictable and giving validable results (unlike most ML models).

Can you also add a way to add TLDRs to everything, pls? :D

Stop trying to remake Hacker News. It's pretty much perfect the way it is.

Maybe it's not perfect for everyone.

I, for one, do like the idea of tagging stuff, since I might favorite a lot of stories but then years later it's hard to find a particular one, even if you do remember the general topic.

Tags for this would be really helpful for me, ergo it's not perfect for me, ergo it's good someone else is trying to make it better.

Since it's not affecting the original site, why would you want to stop them?

Rebuilding HN in some fashion if a yearly thing. I rebuilt it nearly 5 years ago, and even then people told me it's a yearly occurrence haha.


it's difficult to find more mobile unfriendly page nowadays than Hacker News

Materialistic on Android makes it decent, but would still prefer something like Boost for Reddit for comments

This is the best joke I've ever read on HN.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact