

Early detection of Twitter trends explained - snikolov
http://snikolov.wordpress.com/2012/11/14/early-detection-of-twitter-trends/

======
jrmg
I find it interesting that, in contrast to the other things that the article
mentions that techniques like this can be used to predict ("We can try this on
traffic data to predict the duration of a bus ride, on movie ticket sales, on
stock prices, or any other time-varying measurements."), Twitter trends are
artificial phenomena, with a very precise definition that was created by
Twitter, not some natural emerging thing. The actual tweets are of course a
natural phenomenon, but how topics are selected from them as 'trending' is
not.

Of course, that's not to say this is not impressive work - predicting what
Twitter's proprietary algorithm will select as trending without direct
knowledge of the algorithm, before it selects them, and before all the tweets
that make them be selected are made is impressive, and no doubt not any easier
than predicting more natural phenomena or emergent behaviours.

~~~
snikolov
You bring up a great point. To classify something as a trend or not a trend,
we have to use this artificial black box to supply ourselves with examples of
what's a trend and what isn't. The nice thing, IMO (and this is something I
admittedly gloss over at the very end) is that doing prediction/forecasting
with this method is almost the same as doing classification, even though you
don't have any labeled examples when doing prediction. To do classification,
we compare current activity to past examples of activity, and decide if it
looks like the positive examples or the negative examples. For prediction, we
compare current activity to past activity, and see how similar-looking past
activity continued to evolve over time.

------
rdw
You know this is a good explanation of the technique because in retrospect it
seems obvious and clear.

------
mailshanx
Thanks for the excellent explanation, and many congratulations on your
thesis!:)

Could you point to any resources on time series analysis? While i am well
familiar with supervised/unsupervised learning methods for tasks like
classification, anomaly detection etc, analyzing time series is a different
beast. And most machine learning literature (eosl?) doesn't seem to address
time series data either.

~~~
snikolov
Thanks! I'm afraid my background, like yours, is more in methods that are not
specialized for time series, and so I couldn't credibly give any comprehensive
references. My understanding is that a lot of methods designed specifically
for time series draw heavily on the theory of stochastic processes. For
example <http://en.wikipedia.org/wiki/Autoregressive_model>,
[http://en.wikipedia.org/wiki/Autoregressive%E2%80%93moving-a...](http://en.wikipedia.org/wiki/Autoregressive%E2%80%93moving-
average_model). I once took a course called Signals, Systems, and Inference
that covers some of these ideas (full course notes here
[http://ocw.mit.edu/courses/electrical-engineering-and-
comput...](http://ocw.mit.edu/courses/electrical-engineering-and-computer-
science/6-011-introduction-to-communication-control-and-signal-processing-
spring-2010/readings/)), but that's about as far as I've gotten along that
road.

------
Nogwater
There's something I don't understand about this. It depends on twitter
supplying its picks for trending topics. How do you use something like this in
general if you're just given the stream of tweets but nothing else?

~~~
snikolov
You can get the trending topics through the Twitter API
<https://dev.twitter.com/docs/api/1/get/trends/%3Awoeid>

But I think what you are asking is how such a method would come up with its
own trends, given just a stream of tweets. This is a supervised approach
(<http://en.wikipedia.org/wiki/Supervised_learning>), so for now, you would
need to train it (possibly online) by giving it examples of what should be a
trend and what shouldn't. It would be interesting to make it semi-supervised
(<http://en.wikipedia.org/wiki/Semi-supervised_learning>) so that you would
only need to provide a small number of labels.

~~~
Nogwater
Thanks. That is what I was asking.

It sort of comes down to the question of what's really being learned here? Are
they modeling some inherent process of topics becoming popular (or memes
spreading in a population) that could be used in other situations, or are they
just modeling some arbitrary algorithm that twitter uses to mark some topics
as "trends"? If they're just modeling twitter's existing algorithm, then it's
less interesting because that algorithm already exists. Since they're able to
detect the trend before twitter does (well, before twitter announces it
anyway), then it seems like they're probably onto something more fundamental.

~~~
snikolov
_It sort of comes down to the question of what's really being learned here?_

That's a great question. We are learning to recognize trends and non-trends
based on previous examples. Since the Twitter trends algorithm gives us such
examples, you could say we are learning to replicate the outputs of an
arbitrary algorithm --- and you'd be right. But learning from examples is a
very general thing, so the method has applications beyond detecting trending
topics.

 _Are they modeling some inherent process of topics becoming popular_

No, we don't model the process of something becoming popular. (To do this, one
might suppose people spread popular topics in X way and unpopular topics in Y
way, and try to estimate from the data whether the topic is popular or
unpopular.) The beauty of this is that we never have to build a model, because
we rely directly on the data. As a corollary, this approach is applicable out
of the box for any domain with time-varying data (though I suppose you might
have to take care to measure the right kind of time varying data).

Does that answer your question?

~~~
Nogwater
I think that answers it. Thanks. :)

------
karamazov
For me, the most striking thing about the article is looking at the chart of
#Barclays and seeing _just how early_ the trend was detected.

I would never guess that the pattern, when cut off just after 12, is
indicative of a topic that's about to trend.

