
Algorithmic tagging of Hacker News or any other site - doppenhe
http://blog.algorithmia.com/post/86295023534/algorithmic-tagging-of-hackernews-or-any-other-site
======
Goosey
Looking at the hn demo, I'm impressed. There are definitely relevant tags
being generated. Unfortunately there also some noisy tags which clutter the
results. Taking one example, the post "DevOps? Join us in the fight against
the Big Telcos" given the tags "phone tools sendhub we're news experience
customers comfortable", I would say that "we're" is unarguably noise. Another
example, "Questions for Donald Knuth" with tags "computer programming don i've
knuth taocp algorithms i'm" I would call out "i've" and "i'm".

There are other words in both examples that I personally would not use as
tags, but I can't really say they would be universally not-useful. I think a
vast improvement could be made just by having a dictionary blacklist filled
with things like these - from this tiny sampling contractions seem to be a big
loser.

~~~
doppenhe
Agreed. Actually we could turn up the number of times it runs inside LDA
algorithm and those would fix up but it affects performance. This was just a
quick and dirty example (with an expectation of high traffic).

You can also seed LDA with a whitelist of words which we didn't do either -
again all in the name of a quick and dirty solution to show.

Glad you liked it!

~~~
ppod
Try using tf-idf instead of raw word frequencies.

~~~
doppenhe
not using raw word frequencies but
[http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation](http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation).

Didnt know about tf-idf thanks for the tip.

~~~
hnriot
it's hard to imagine someone knowing an LDA without also knowing about TF.IDF
(it's a dot product, not a hyphen)

~~~
GFK_of_xmaspast
The OP is pushing "Algorithmica" whose manifesto is here:
[http://blog.algorithmia.com/post/75680476188/algorithm-
devel...](http://blog.algorithmia.com/post/75680476188/algorithm-development-
is-broken)

and this doesn't really strike me as much of a victory for the idea that it's
just the implementation of an algorithm being the sticking point in practice.

~~~
doppenhe
We are just showing the versatility of the platform through a real world use
case. LDA is hard to implement/scale for the untrained same as many other
machine learning, optimization, graph traversing,etc algorithms. What we are
building is crowd-sourced and generalized API where all these algorithms can
be combined and used together to really make any application smarter.

The demo we show here is a version of how we used our platform to generate
tags for all entries in our API by combining algorithms that existed already
in Algorithmia. (modified for performance over quality due to the volume that
HN would bring).

Cheers.

~~~
s0x
It's only hard because most libriaries I've seen have so little documentation
available. It's simple once you understand the library. We need people picking
these libraries up, implementing them on weekends on fun projects, documenting
their work and code, and publishing it for everyone to learn from.

------
vhf
Very interesting.

I have been doing some research towards automatic tagging lately, and I found
several Python project coming close to this goal :
[https://pypi.python.org/pypi/topia.termextract/](https://pypi.python.org/pypi/topia.termextract/)
, [https://github.com/aneesha/RAKE](https://github.com/aneesha/RAKE) ,
[https://github.com/ednapiranha/auto-
tagify](https://github.com/ednapiranha/auto-tagify)

but none of them is satisfying, whereas Algorithmic Tagging of HN looks pretty
good.

I have been trying to implement a similar feature for
[http://reSRC.io](http://reSRC.io), to automagically tag articles for easy
retrieval through the tag search engine.

~~~
doppenhe
got your email will be responding later today. We enable automated tagging for
any site directly from our API no need to implement anything else.

------
sytelus
Well, it's not that easy. The algorithms are very primitive and too full of
noise to be useful.

For example, try this on restaurant reviews like [http://www.yelp.com/biz/el-
gaucho-seattle](http://www.yelp.com/biz/el-gaucho-seattle). I get these tags:

steak reviews seattle food service gaucho restaurant review

Not useful, right?

The current state of the art would use much more sophisticated NLP for
generating POS tags and use sentiment analysis. For example, check out MSR
Splat at [http://research.microsoft.com/en-
us/projects/msrsplat/defaul...](http://research.microsoft.com/en-
us/projects/msrsplat/default.aspx).

------
Theodores
This does well on the 'T Shirt test' on some sites, e.g.
[http://www.riverisland.com/men/t-shirts--
vests](http://www.riverisland.com/men/t-shirts--vests)

This could be really useful in ecommerce for creating search keywords for
category pages. The noise in the results matters not, so long as it gets
'T-Shirt' and someone searches for 'T-shirt' then all is well and good.

Are you looking to plug what you have into something such as the Magento
e-commerce platform? The right clients could pay proper money for this
functionality. It is something I would quite like to speak to you about.

~~~
doppenhe
definitely interested please email me at diego at algorithmia dot com

------
EGreg
LDA is very impressive. But it might be better to have an iterative algorithm
that forms a linear-algebraic basis from several tags (and let people add more
tags as vectors into the mix) and then every time people upvote something, you
update their interests (points in the linear algebraic space) and then every
time an article gets upvoted you update ITS tags ...

after a while the system converges to a very useful structure and new members
can see correctly tagged articles and the system learns their interests by
itself

do you know anything like this already existing?

~~~
frik
slashdot.org tag system ?

~~~
doppenhe
should be easy enough to implement, our API will be in public beta very soon I
can show you how to build it.

~~~
EGreg
Please do. Can you contact me at
[http://qbix.com/about](http://qbix.com/about)? Shoot me an emailthere plz

------
dlsym
This has real poetic potential:

    
    
        "Erlang and code style" 
    
        process erlang undefined
        file write data
        true code

------
zokier
After watching "Enough Machine Learning to Make Hacker News Readable Again"[1]
I thought of recommendation engine/machine learning based linkshare/discussion
system (eg HN/reddit style). Your frontpage would be continuously formed by
your up/down-votes. I'm not sure if the same could be applied to comment
threads too, essentially creating automatic moderation. Algorithmic tagging
would certainly be useful for that kind of site.

[1]
[https://news.ycombinator.com/item?id=7712297](https://news.ycombinator.com/item?id=7712297)

------
NKCSS
Not too impressed to be honest; singular/plural forms are not treated equal;
not familiar with LDA, but I've written and LSA implementation in the past,
and it did a lot better than what is shown here.

~~~
lugg
I'm sure there is an amusing LSD joke in there somewhere.

------
NicoJuicy
Lol, this seriously took me by suprise. I'm currently developing a HackerNews
with tags (you can self host it). I quickly generated this Google Form, if you
are interested for being a beta user in the nearby future

[https://docs.google.com/forms/d/1UeSD11hrjwhsVbbPiv63VZBrEcz...](https://docs.google.com/forms/d/1UeSD11hrjwhsVbbPiv63VZBrEczzG5Tr4lwkuKAzY8A/viewform?usp=send_form)

PS. Screenshot included + it's already in alpha in a company with 100 users.

~~~
zokier
Do you know [https://lobste.rs/](https://lobste.rs/) which is essentially HN
with tags.

~~~
NicoJuicy
Yeah, i know lobste.rs. But i go much further then HN or lobste.rs.. (not
limiting with only url's or texts is just one feature). It's more a "Document
Management System" with HN influence for larger Businesses (or public
websites) with ( > 30 users) then a HN copy.

Call it lobste.rs 2.0

------
snippyhollow
I did that in 2012 for a pet project with a friend
[https://github.com/SnippyHolloW/HN_stats](https://github.com/SnippyHolloW/HN_stats)

Here is the trained topic model (Nov. 30, 2012) with only 40 topics (for file-
size mainly)
[https://dl.dropboxusercontent.com/u/14035465/hn40_lemmatized...](https://dl.dropboxusercontent.com/u/14035465/hn40_lemmatized.ldamodel)

You can load it with Python:

    
    
      from gensim.models import ldamodel
      lda = ldamodel.LdaModel.load("hn40_lemmatized.ldamodel")
      lda.alpha = [lda.alpha for _ in range(40)]  # because there was a change since 2012
      lda.show_topics()
    

Now if you can figure out what is this file:
[https://dl.dropboxusercontent.com/u/14035465/pg40.params](https://dl.dropboxusercontent.com/u/14035465/pg40.params)
I'll pay you a beer next time you're in Paris or I'm in the Valley. ;-)

~~~
doppenhe
I'll take a look, thanks!

~~~
sitkack
[0.006791154078718692, 0.004654721825624361, 0.004632114646875695,
0.011976800134937546, 0.01799155954072435, 0.009181094647452455,
0.0345230793213232, 0.005232498042562552, 0.011402661654834138,
0.009024477282685779, 0.007034922780349653, 0.0031922239118904504,
0.007097905854058182, 0.004999249488505551, 0.016499595508879424,
0.024729974527642036, 0.004985711413178751, 0.03119529793092641,
0.015437847520669401, 0.2948424084650949, 0.06912364384956156,
0.004776347484051836, 0.0893067258386264, 0.018226129712679208,
0.0315656235097838, 0.006267920316323028, 0.01240414536928756,
0.005343403072840281, 0.006566103139036195, 0.009403510178615212,
0.009875448474490003, 0.0038449507757111973, 0.007531241580033292,
0.0014680865916836157, 0.00767397135040071, 0.002118254148078463,
0.02605710351719099, 0.034716581721697254, 0.002314474872742398,
0.12599103592023264]

~~~
snippyhollow
Yes but what does it mean?

~~~
waterside81
Looks like cosine similarity output

~~~
snippyhollow
It's PG's parameters for a naive Bayes model based on these LDA topics,
learned by taking his comments on HN as upvotes for the articles content.

------
andrew_gardener
After receiving 42 comments, I've ran their tagging algorithm on this page and
got:

tags tagging hours link doppenhe reply ago lda

looks pretty promising!

------
gibrown
LDA/Topic Modeling is interesting stuff. I always feel like the way this data
gets surfaced as "tags" is very ineffective. Any non-tech person would look at
this and generally be confused. So this item is triggering my rants against
tagging: \- Tagging is like trying to predict the future. What word will help
some future person to get to this content? \- Tagging often tries to fill the
hole left by bad search \- There is no evaluation method to measure how good a
set of tags are \- Tags make very bad UI clutter.

Some of these points are related to encouraging users to tag content, but
auto-tagging also seems problematic.

To me something more along the lines of entity extraction is more useful
because it is a well defined problem, and can be used to improve a lot of
other applications.

~~~
sitkack
It seems like you would want to run k-means over the comments and the tags to
pull out semantically meaningful words for tags and then reduce the total
number of tags over the corpus. Then say, use wikipedia to generate an
automatic taxonomy where those extract words occur.

~~~
doppenhe
we do have k-means in our API as well this wouldn't be hard to do.

------
NicoJuicy
I like this project (i am creating something like this, so i'm pretty
serious).

But doesn't the auto-tagging feature make to much noise for a business use-
case? For example, it tags a article of Amazon and includes Google in the
tags. White-listing words wouldn't fix this (Google is a whitelisted word if
Amazon is).

I don't know about LDA though. Perhaps a proper tag administration would fix
this, but then you'd have to remove tags on the go.

~~~
doppenhe
would love to chat more diego at algorithmia dot com.

------
platypii
Direct link to HN with tags:
[http://algorithmia.com/demo/hn](http://algorithmia.com/demo/hn)

~~~
EGreg
I always wondered -- how are some sites able to get an up-to-date mirror of
HN, when HN blocks usage of its API after a while?

Are they using some alternative API that was blessed by HN?

~~~
doppenhe
we just use their RSS

~~~
EGreg
And they don't block that after a while?

How does iHackerNews show all the comments and everything?

Where are these RSS feeds? I doubt that's how "the pros" so it.

~~~
maxerickson
There is also an api:

[https://hn.algolia.com/api](https://hn.algolia.com/api)

I would think lots of apps are still scraping pages.

------
nopal
Has anyone seen Open Calais [1]? It does tagging and categorization. It's been
around for years and seems pretty powerful. It's a bit lower-level than
Algorithmia (not href aware), but it seems more powerful, and a system like
Algorithmia could be built on it.

[1] [http://www.opencalais.com/about](http://www.opencalais.com/about)

------
doczoidberg
With german sites it does not work so well. There is no blacklist for to
generic terms for other languages than english?

------
shawabawa3
Doesn't seem to handle pdfs properly. For the mtgox link it comes up with

> stream rotate type/page font structparents endobj obj endstream

~~~
doppenhe
the demo version if we cant scrape the text from the HTML we cant really run
the topic analysis against it. pdfs, images, etc wont work.

~~~
r00fus
PDF is easy to support if you use pdftk. You can even simply get the first
page for large docs.

[http://www.pdflabs.com/tools/pdftk-
server/](http://www.pdflabs.com/tools/pdftk-server/)

~~~
doppenhe
thanks for the tip.

------
pjbrunet
I signed up. Not sure if i would use it, but the Algorithmia concept is pretty
interesting.

------
draz
@doppenhe - any hunch as to how well it would work on transcripts?

~~~
doppenhe
you can try it yourself at the bottom of the blog post or you can send me a
url and I can try a bunch for you diego at algorithmia dot com or @doppenhe.

------
vincentbarr
error: failed to find worker for algorithm

~~~
doppenhe
HN took us down for a second. Back up and running. Thanks for reporting.

------
hnriot
Maybe also take a look at AlchemyAPI

------
justplay
looks cool.

