

An algorithm for generating automatic hashtags - shlomib
http://blog.swayy.co/post/61672584784/an-algorithm-for-generating-automatic-hashtags

======
Udo
I think we - and to some degree the Twitter platform itself - are using hash
tags redundantly, and this algorithm is just a manifestation of this
redundancy that is killing data quality. These pathological tweets do tend to
look like the example sentence provided, maybe even more extreme:

    
    
      #Swayy #Launches Into Public #Beta To Curate #Content For Your #SocialMedia Audience
    

Now, all of these words would be reachable with a normal search, so why do we
over-tag everything? Are users really going to see what other Tweets have been
recently tagged #Content? It makes even less sense with product names like
#Swayy.

A more reasonable approach would be to tag things that are _not_ part of the
sentence itself:

    
    
      We're launching into public beta to curate content for social media! #Swayy
    

Or inline, on occasion, to express that you're taking part in a meme:

    
    
      Dear gods, #IHateIt when it's cold outside
    

We don't need algorithmic help to find hash tags in these cases either, and
I'm arguing that automatically converting every third word into a hash tag
doesn't do Twitter feeds any good, quality-wise.

~~~
sp332
#Content means you're talking about content, and it's not just a word you used
in passing. Similarly, #launches adds your tweet to the conversation people
are having about launches, instead of making people sort through missile
launches, some guy who launches into a story, and misspelled lunches to get to
relevant tweets.

------
danmaz74
The tech is nice, but adding hashtags like this does, in my opinion, more harm
than good. Hashtagifying common words does rarely make sense, except in those
few cases when there is a specific conversation going on about that word for
any special reason.

Users ask us all the time to add an automatic hashtagging feature to
hashtagify.me, but I'm resisting those requests because bad hashtagging makes
hashtags less useful. It would be great to find an algorithm that (at least
almost) always finds hashtags that are really relevant, but until that will
happen it's better to ask users to make a little effort.

[edited for clarity]

------
AznHisoka
For those people who are interested in topic/article classification and NLP,
Twitter can a gold-mine, especially hashtags. If you gather the hashtags for a
million articles, you pretty much have a co-location database. Now you can
mine that data and see which hashtags are common if you have "Google Panda" in
your title for instance, or which hash tags are commonly used with #seo.
Hashtags are basically structured semantic data, if you look at them in
aggregation. A good tool for doing this is SOLR or ElasticSearch. Simply
import all the hashtags for a bunch of articles to the index, and do a faceted
search for a specific hashtag, or keyword, and you'll get the top 10
associated hashtags that are highly related to that keyword.

~~~
AznHisoka
Of course, the applications of this go beyond hashtags. You can apply this
towards topic classification of content.

------
joosters
A strange example. Does the author really think hashtagifying the word
'content' has improved the tweet in some way? Do they expect people to be
searching Twitter for #content and getting some useful results?

~~~
shlomib
[https://twitter.com/search?q=%23Content](https://twitter.com/search?q=%23Content)

~~~
dtauzell
You can even get a #content t-shirt:
[http://bunnyaimee.tumblr.com/post/61678190440/we-opted-
for-a...](http://bunnyaimee.tumblr.com/post/61678190440/we-opted-for-a-nice-
pose-haha-buffwoto)

------
stephen_mcd
Here's a terrible one I wrote years ago for a Twitter bot:

[https://github.com/stephenmcd/babbler/blob/master/babbler/ta...](https://github.com/stephenmcd/babbler/blob/master/babbler/tagging.py)

If I recall, it simply extracts non-dictionary words from the outgoing tweet,
then actually queries the Twitter API itself to gauge the popularity of each
potential hashtag, only using the most popular.

------
AnSavvides
Although very basic, this is really nice - the wonders of NLTK! You explain
your algorithm in very simple terms so kudos for that.

It might be a good idea to put this in a GitHub repository rather than a
simple gist maybe? I am sure there will be plenty of people (including myself)
interested in contributing, it's much easier doing it on a repository rather
than a gist :)

~~~
shlomib
Good idea! Actually I'm thinking about creating some open source “social-NLP”
python package. What do you think?

~~~
AnSavvides
That would be really cool!

------
yankoff
Cool stuff. I have been playing with nltk and gensim myself recently in order
to solve similar problem. But I want to combine unsupervised topic modeling
(LDA) with some supervised learning algorithm (probably naive bayes).
Struggling with LDA at this point, but pretty interesting stuff.

------
louyang
A friend and I cofounded another content curator also, we do tagging but in a
different fashion: [http://wintria.com](http://wintria.com)

Nice article also, I don't think you guys take enough advantage of the article
body though.

~~~
shlomib
Obviously this is NOT the algorithm we use in our product (Swayy)...

~~~
louyang
I didn't say it was, what was that remark about?

------
solvemenow
Build classification for these generated hash tags. Give a score and put this
into a feedback loop to build better hashtags.

------
btbuildem
What's the point exactly?

~~~
icedog
So later generations will have good material for ridiculing us.

~~~
bowerbird
that time can't come soon enough for me. :+) #fedupwiththehashtagcrap

-bowerbird

------
quarterto
#Eww. I #hate it when #hashtags are used inline.

~~~
shlomib
So just filter the hashtags from the output and put them at the end of the
original title :P

~~~
zeckalpha
Or just make them links and hide the #.

------
zeckalpha
f (x) = x + " #yolo"

------
jheriko
#hashtagsarelame :)

~~~
krapp
#honestly_though_hacker_news_could_use_them_or_at_least_something_like_them
#meta #hackernews

------
sv123
#tinytext

