Hacker News new | comments | show | ask | jobs | submit login

What we used for counting is slightly different from what you see in the Twitter widgets (yeah, those tweets are from Twitter directly). In our backend, we have a pretty conservative filter that matches a bag of phrases, such as "voted for barrack obama", "voted for pres obama", etc. The accuracy is over 95%. Of course, political tweets are full of sarcasm and humor, and Twitter is full of demographic bias. This is just a fun project for us.



How many of the votes are from the 25,000+ people that retweeted Michelle Obama saying 'voted for President Obama'?

https://twitter.com/MichelleObama/status/265906946530496513

You'd probably clean up a whole lot by ignoring tweets containing "RT". That seems to be much of the stream.


We don't remove RT tweets, but instead, we only count each user once. If a user retweeted Michelle, s/he probably will vote for Obama. But if a user have a few tweets in favor of Obama, it's counted once only.


The accuracy is over 95%

Citation needed...

How can you draw this conclusion at this point in the process? I'm genuinely curious to your filtering scheme to be able to extract information out of such a noisy data stream.


This is not a scientific research, so I didn't compute std, t-stats, etc. But I did pull a few hundred tweets from our database and counted how many wrong ones we had. That's where the number comes from. The filtering scheme is very simple: classify only if we're confident. There are many tweets containing "voted", but we only took ones we have a strong confidence and throw away the rest. For a complete set of keywords used for filtering, please feel free to email.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: