Isn't it a bit silly to throw away every single @<user> tweet? Imagine I'm having a twitter conversation with my friends about who can make the best pangram and this data is ignored just because we're tweeting @ each other?
I'm guessing that the majority of tweets will reference a person. How about just stripping that @word from the tweet?
Interesting. The issue I was getting was that it would like tweets that were "@MyFriendWhoHasAUserNameLikeZXCVBNMQWERTYU hey". Should I just strip it of the word "@____"?
I'd try stripping such words (and URLs) when they appear at the start or end of the tweet. If in the middle, removing seems likely to mess with the meaning.
"?????????????????????the quick brown fox jumps over the lazy dog??????????????????"
This is considerably less interesting than I hoped it would be. (And a little surprising! That's a lot of data.)