Also, why would they use the Dow Jones Industrial Average? It is a ridiculously bad average of stock market performance for many reasons. This planet money podcast goes into why:
The whole methodology is a joke anyway. If you evaluate a huge number of search terms, some of them are going do better than others. So terms that might mean something ("debt") get mixed up with terms like ("color") that got lucky.
What you have to do is come up with one strategy that trains a classifier on the combined words, tunes it up with logistic regression and derives optimal trading actions from that. And you've got to factor in what you're paying to the broker and the market makers.
There are more similar considerations. Statisticians have been very busy for the last century or so.
So, while the methodology might be spurious (random search terms may give false positives) i don't think the use of the Dow vs SP500 is a key criticism.
An extreme example, illustrating a common mistake in the field. https://xkcd.com/882/ has the gist of it.
Uncorrected multiple comparisons are, of course, a big portion of the statistical dubiousness inherent it "data dredging"
That's why you build another threshold-type learner and then apply logistic regression to convert the score produced by learner A into a probability score.
Then you can tune at the exact point of the precision-recall curve that maximizes business value.
If you look at enough words the chances, that some of them have predictive value approaches 100%.
EDIT: Ok, I couldn't resist the xkcd reference:
“Look,” he says, “it’s a ten dollar bill”.
“Nonsense,” says the economist. “If that was a ten dollar bill, someone would have picked it up by now.”
"One day we had a conversation where we figured we could just try to predict the stock market, and then we decided it was illegal. So we stopped doing that."
I mean, subject to a variety of risk and regulatory issues, perhaps, and not a core competency, and a variety of other things, but... the fundamental reason for insider-trading laws is to make sure the insiders don't abuse their positions and act against the interests of the company's owners (shareholders) by trading in stock tips instead of building shareholder value. If you get knowledge independently -- like if you sent your analyst out to count the number of cars in a firm's parking lots to estimate hiring/firing -- that's all well and good (and is something that hedge funds actually do.)
Heck, if it works, this sort of thing would be great. Moving the market earlier means fewer people buy and sell companies at the wrong price. That sort of knowledge is worth billions. (Think about it from the perspective of startups: if you could see the future and know whether a company would work out before you even founded it, then you would build only successful companies, and you'd be certain they'd be funded well. This is but a small fraction of that power, but it's still quite meaningful.)
Let's say you are not connected to company X, but you have a friend that works there. One day your friend tells you material fact. Now, your friend probably broke a couple of rules, but _you_ cannot control what people tell you and you cannot be held liable for what you hear. However, acting on this information is illegal. Even if you presumably do not use this information, but trade shares of company X before the information you know becomes public, you are in murky waters.
Same with Google. Google cannot control what people search for. However, acting on this information would be illegal, unless they are absolutely sure that information in the search terms is consistent with public knowledge about the company.
At a high level you aren't allowed to use "material, non-public information" for investment purposes, but information isn't material just because you can make money off of it in some way, otherwise "channel checks" would be illegal. Material non-public information has to come from insiders of the company, so the only argument that could be made that it was illegal for Google to make investments is based on material information that was being provided to them, via search terms, by corporate insiders. If they are merely using the sentiment exposed by the public to them through search terms that is probably legal. Similarly it's legal for hedge funds to fly planes over department stores and count the cars in their parking lots to gauge the level of business they are seeing at Christmas time, even though this isn't public information.
Six examples from 2008-2010: http://www.huffingtonpost.com/dan-mirvish/the-hathaway-effec...
A computer scientist who works with hedge funds: "We come across all sorts of strange things in our line of business, strange correlations"
As the intelligence of our technology grows, I find it amusing to consider these less-logical patterns in the algorithms (indirectly) responsible for so much of our economy.
Typically you split your data pool in half and use data analysis to determine some kind of relation, such as the change in "color" seems to correlate to a reverse change in DOW Jones. You then generate a prediction algorithm based on that.
Finally you use your prediction system on the other half of your data, to see if you are actually predicting or just correlating. Feel free to adjust any of your methodology but make sure you don't include the second half in your generation step or else you have only shown correlation not prediction.
See also, a collection of other related links (helpful if you want to try your hand at making something like this):
Derivatives could also reduce risk for users. For instance: you're a business that accepts bitcoins for payment. But the price of bitcoins can fall by 50% in a day! So you'd better transfer those bitcoins to cash the moment you get them, or you're subject to losing half your cash any day. Something that's too hot to handle like that kind of damages the ability to use it as a medium of exchange, doesn't it? But you could alternatively buy bitcoin currency futures, essentially locking in your current price. So maybe you could hang onto those bitcoins a while and use them to pay people for your referral program or something. (Oh, sure, there's a cost associated with it, but there are costs associated with translating cash back and forth as well. Depending on the price of the derivatives, it might make sense.)
If anyone in SF is also into this stuff, we should grab coffee. Email is antonlakin at gmail.
Hal Varian, Google's chief economist, has been talking about using Google Trends this way for a while: http://www.frbsf.org/economics/conferences/1103/Varian-part_...
What's interesting is that I had this exact idea with precious metals. The frequency of the terms searched (like gold or buy gold) matches the gold spot market index impressively closely.