

Google Correlate - ssn
http://correlate.googlelabs.com/

======
leot
Methods section of their whitepaper:

In our Approximate Nearest Neighbor (ANN) system, we achieve a good balance of
precision and speed by using a two-pass hash-based system. In the first pass,
we compute an approximate distance from the target series to a hash of each
series in our database. In the second pass, we compute the exact distance
function on the top results returned from the first pass.

Each query is described as a series in a high-dimensional space. For instance,
for us-weekly, we use normalized weekly counts from January 2003 to present to
represent each query in a 400+ dimensional space. For us-states, each query is
represented as a 51-dimensional vector (50 states and the District of
Columbia). Since the number of queries in the database is in the tens of
millions, computing the exact correlation between the target series and each
database series is costly. To make search feasible at a large scale, we employ
an ANN system that allows fast and efficient search in high-dimensional
spaces.

Traditional tree-based nearest neighbors search methods are not appropriate
for Google Correlate due to the high dimensionality of the data resulting in
sparseness of the data. Most of these methods reduce to brute force linear
search with such data. For Google Correlate, we used a novel asymmetric
hashing technique which uses the concept of projected quantization [16] to
reduce the search complexity. The core idea behind projected quantization is
to exploit the clustered nature of the data, typically observed with various
real-world applications. At the training time, the database query series are
projected in to a set of lower dimensional spaces.

Each set of projections is further quantized using a clustering method such as
K-means. K-means is appropriate when the distance between two series is given
by Euclidean distance. Since Pearson correlation can be easily converted into
Euclidean distance by normalizing each series to be a standard Gaussian (mean
of zero, variance of one) followed by a simple scaling (for details, see
appendix), K-means clustering gives good quantization performance with the
Google Correlate data. Next, each series in the database is represented by the
center of the corresponding cluster.

This gives a very compact representation of the query series. For instance, if
256 clusters are generated, each query series can be represented via a unique
ID from 0 to 255. This requires only 8 bits to represent a vector. This
process is repeated for each set of projections. In the above example, if
there are m sets of projections, it yield an 8m bit representation for each
vector.

During the online search, given the target series, the most correlated
database series are retrieved by asymmetric matching. The key concept in
asymmetric matching is that the target query is not quantized but kept as the
original series. It is compared against the quantized version of each database
series. For instance, in our example, each database series is represented as
an 8m bit code. While matching, this code is expanded by replacing each of the
8 bits by the corresponding K-means center obtained at training time, and
Euclidean distance is computed between the target series and the expanded
database series. The sum of the Euclidean distances between the target series
and the database series in m subspaces represents the approximate distance
between the two. Approximate distance between target series and the database
series is used to rank all the database series. Since the number of centers is
usually small, matching of the target series against all the database series
can be done very quickly.

To further improve the precision, we take the top one thousand series from the
database returned by our approximate search system (the first pass) and
reorder those by doing exact correlation computation (the second pass). By
combining asymmetric hashes and reordering, the system is able to achieve more
than 99% precision for the top result at about 100 requests per second on
O(100) machines, which is orders of magnitude faster than exact search.

------
wxs
Heh, time correlation of "exam schedule" tells a sad story.

Shifted 1 week: gpa calculator

Shifted 2 weeks: final grades

Shifted 3 weeks: academic suspension

Shifted 4 weeks: academic dismissal

[http://correlate.googlelabs.com/search?e=exam+schedule&t...](http://correlate.googlelabs.com/search?e=exam+schedule&t=weekly&shift=1)

[http://correlate.googlelabs.com/search?e=exam+schedule&t...](http://correlate.googlelabs.com/search?e=exam+schedule&t=weekly&shift=2)

[http://correlate.googlelabs.com/search?e=exam+schedule&t...](http://correlate.googlelabs.com/search?e=exam+schedule&t=weekly&shift=3)

[http://correlate.googlelabs.com/search?e=exam+schedule&t...](http://correlate.googlelabs.com/search?e=exam+schedule&t=weekly&shift=4)

------
hugh3
There's certainly the potential to extract interesting data here, but I
haven't found it yet.

I do note with distress, however, that:

[http://correlate.googlelabs.com/search?e=why+is+my+poop+gree...](http://correlate.googlelabs.com/search?e=why+is+my+poop+green&t=weekly#)

searches for "Why is my poop green?" peaked in March 2010 before subsiding,
and that it's correlated with "hiv symptoms in women" and "how to get a guy to
ask you out".

Meanwhile, "why is my poop black?" is correlated with "How to say I love you
in French"

~~~
ntoshev
If you upload stock market data, you could see if there are searches that
strongly predict certain stocks.

~~~
Symmetry
Well, it works well with unemployment at least:
[http://www.freakonomics.com/2011/05/25/mining-for-
correlatio...](http://www.freakonomics.com/2011/05/25/mining-for-correlations-
it-works/)

------
leot
What many here seem to miss is that this looks at coincidences in the timing
of searches. This is not "within-subject": it's not that people-who-search-
for-x-also-search-for-y. Rather, it's WHEN-people-search-x-other-people-are-
also-likely-to-search-y .

That being the case, can anyone come up with an explanation for this?
[http://correlate.googlelabs.com/search?e=accident&t=week...](http://correlate.googlelabs.com/search?e=accident&t=weekly)

~~~
rorrr
People drive more in the summer.

~~~
leot
The pattern looks more interesting than simply that. Multiple semantically
related terms are correlated with "accident". And, strangely, "fatal accident"
is not correlated very much with any of those other semantically related
terms. Further, there are bunches of other queries that you might imagine
"summer-ness" could also drive (air conditioners, heat rash, e.g.), but which
it doesn't.

------
koanarc
Interestingly, migraine headaches
([http://correlate.googlelabs.com/search?e=migraine+headaches&...](http://correlate.googlelabs.com/search?e=migraine+headaches&t=weekly))
correlate with:

\- small business development

\- us copyright office

\- education grants

\- legal advice

------
yesbabyyes
Search by drawing is really cool. <http://correlate.googlelabs.com/draw>

A bug report: I get 500 Internal Server Error when inputting non-ascii
characters:

[http://correlate.googlelabs.com/search?e=santa+f%C3%A9&t...](http://correlate.googlelabs.com/search?e=santa+f%C3%A9&t=weekly)

------
zzleeper
This is not robust to outliers. See for instance:
[http://correlate.googlelabs.com/search?e=payday&e=payday...](http://correlate.googlelabs.com/search?e=payday&e=payday%20loan&t=weekly)

Maybe they could use a different (and probably more computationally intensive)
correlation to fix this.

------
hammock
Interesting to go to Winter example then shift the lag-time (lefthand side).
If you shift it 2 months, you get summer internships. If you shift it 4
months, you get summer camps. Five months, you get baseball. Eight months, you
get spiders fleas and fire ants.

~~~
nooneelse
Doing a popular movie name gets other things related to that movie with
shift=0, and other popular movies (or other things that spike like them) with
shifting. Movie names were just the first very "spiky" thing I thought of to
use as a filter like that, there might be something else better (short of
loading a spike time series, of course).

------
JonnieCache
Nice to see that the tradition of comic book product launches is still going
strong.

<http://correlate.googlelabs.com/comic>

------
iandanforth
Anyone play the real-estate market?

A great leading indicator of 'selling a home' seems to be 'european airfare'
and 'florida apartments.' So here's what you do. Take out google adwords for
these searches, offer 'great deals' in return for your zipcode and email
address.

Then you can use these addresses to send inquiries about home sales and get in
on sales before they hit the market!

There, go make money.

P.S. If this actually works, be nice enough to let me know :)

------
jfager
This is cool, but I'd like to see it include the volume of the search term and
provide a way to filter out terms above or below a specified volume threshold
(i.e., show me the terms within the top 20% of all terms by volume that best
correlate to this curve).

If that's too much to ask, it could at least provide a way of skipping the
step of manually entering the returned search terms into Trends.

------
nck4222
Some of those are really obvious ("treatment for flu" had the highest
correlation with flu activity), but "disney vacation package" correlating with
a states average rainfall was a bit surprising (in that the highest
correlation wasn't a search for umbrellas).

~~~
meowbark
Does anybody really google for "umbrellas"?

~~~
nooneelse
According to the tool in play here, yes, but mostly "patio umbrellas", in the
summer.

------
moonboots
Looking forward to version 2.0, Google Causality

------
tibbon
Now seems to be a good time to remember...

"correlation is not causation"

------
fogus
Dear lord -- Clojure is the most pornographic programming language ever!

------
orijing
[http://correlate.googlelabs.com/search?e=hacker+news&t=w...](http://correlate.googlelabs.com/search?e=hacker+news&t=weekly)

So many git commands...

------
powertower
> Google Correlate is an experimental new tool on Google Labs which enables
> you to find queries with a similar pattern to a target data series.

I don't understand what the correlation is here. Is this just matching queries
by frequency of search?

So you could have completely random and unconnected search phrases/queries
"correlating" because the quantity and time/date are matching?

~~~
Symmetry
They're using 'correlation' in its precise statistical sense:
<http://en.wikipedia.org/wiki/Correlation>

------
lostbit
A bunch of Portuguese words seems to follow a pattern with a big rise on 2008
and a huge drop following. Looks like something was wrong with the data.
[http://correlate.googlelabs.com/search?e=suporte&t=weekl...](http://correlate.googlelabs.com/search?e=suporte&t=weekly#)

I searched: suporte, cadeira, filho, barata, coelho, figueira, orkut.

------
Semiapies
"War" gives an interesting and very regular yearly pattern, aside from the
obvious spike in 2003. Part of it might be explained by the summer slowdowns
in insurgent activity in Iraq, but the rest of it - especially the drop right
at the beginning of the year - mystifies me.

Maybe people avoid searching for anything war-related around the holidays.

~~~
moultano
Lots of patterns are determined by the time in the school year when they are
usually studied.

------
BradGutting
I'd like to see a search feature that maximizes "happy accidents," kind of
like how major scientific discoveries have been made inadvertently. Goes along
with the notion that sometimes it's more important to create conditions in
which success can occur rather than just "zero in" on something that you think
(possibly inaccurately?) to be successful.

~~~
JonnieCache
A sort of "I'm feeling serendipitous" button.

------
rhygar
The infamous "facebook login" search has some interesting correlations:
[http://correlate.googlelabs.com/search?e=facebook+login&...](http://correlate.googlelabs.com/search?e=facebook+login&t=weekly)

Also, there seems to have been a huge drop-off in this search over the last
few months.

------
eagletusk
[http://correlate.googlelabs.com/search?e=losing%20weight&...](http://correlate.googlelabs.com/search?e=losing%20weight&e=rental%20homes&t=weekly#)

This is strange:

US Web Search activity for losing weight and rental homes (r=0.9418)

------
ot
I wonder how much (if at all) query autocompletion biases the results. I think
they rank higher queries that are spiking right now, to give fresh
suggestions, and this may create some feedback effect on spikes.

BTW, how does "google" correlate (0.98) with "kratom"???

------
headShrinker
Are people getting more sick or are they just searching it more?

[http://correlate.googlelabs.com/search?e=sore+throat&t=w...](http://correlate.googlelabs.com/search?e=sore+throat&t=weekly)

------
dmboyd
[http://correlate.googlelabs.com/search?e=id:6n3Ji59CZ3S&...](http://correlate.googlelabs.com/search?e=id:6n3Ji59CZ3S&t=weekly#)

Did hemorrhoids cause the GFC?

------
torstein
Anyone care to explain why almost every computer related queries have declined
since 2004?

Is it as simple as more "normal" people use the internet?

------
ciex
Does anyone know how they can match these so fast?

~~~
tsewlliw
They put it in Shazam...

Seriously though - DFT -> key -> build giant R-tree. You can probably munge
the key to get the week offset. Seems like a straightforward mapreduce problem
:)

------
meow
A quick search of programming languages (C#, java, C++, Objective C) is ...
disturbing ... :(

------
leot
... ultra-cool would be rankings by mutual information ...

------
Andrew_Quentin
google adwords correlates with

county detention center pain in ln nauseous sharp pain pain in back right side
el paso tx constantly take

------
klbarry
What I see from this is that I could make a lot of correlations seem 95%
certain to laymen with this tool that don't mean anything. Almost any term can
get an unrelated term matching perfectly. Search "Eco fashion".

