
Numerai, a hedge fund built by a community of anonymous data scientists - joeykrug
https://medium.com/@Numerai/7b208deec5f0
======
dzdt
Some thoughts on the business model:

* traditional hedge funds have a problem with scaling: if you put more money in the same strategy returns go down. Numerai hopes to scale the number of strategies it employs by scaling the number of researchers participating.

* by providing researchers only opaque streams of data, they prevent researchers from leaving and competing directly. If you don't know how the data corresponds to the market, you can't replicate the trading at another fund. (Some big hedge funds like D.E.Shaw do the same!)

* researchers may still leave and compete indirectly, using the same algorithms on different market features. But by paying anonymously in bitcoin, Numerai may be hoping for the reverse, that programmers from other quant funds will anonymously moonlight for Numerai using their algorithms from those other funds.

* by being opaque with the data, Numerai keeps researchers from knowing the true value their strategies are providing. That information asymmetry is in Numerai's favor, letting them underpay even strong performers.

------
lordnacho
Quant fund insider here.

The data is pretty pure, in the sense of not telling you any metadata at all.
It's literally just a bunch of numbers and 0/1 labels.

It's hard to implement a strategy without knowing what exactly you're looking
at. I get the feeling this "pure dataset" is part of some framework that
Numerai thinks will beat the market, given good predictors.

That's not necessarily the case. Say I assume the 0/1 means up/down over some
period. Well, being able to guess 0/1 correctly would obviously help. Say I'm
right 70% of the time, then I can equal weight my bets and it will be just
swell. But say I'm right about 51% of the time. Then it's going to take quite
a while longer for the law of large numbers to work in my favour. Remember
your ML algo will only be able to give you good predictions if some of the 21
features are actually meaningful, and we have no reason to think they are
actually meaningful.

Now, let's say I have some domain knowledge in finance. I want to predict
over/underachievement relatively. I would be able to guess which shares go up
relative to others, but not the market factor. That would require a different
framework to the one I'm supposing is presented here. Is there flexibility for
that?

The secrecy thing makes me wonder, too. If it's just a matter of not showing
your work, why don't you just have a website where people submit their
daily/weekly/monthly portfolios and you keep track of the tally?

~~~
valdiorn
> Say I'm right 70% of the time, then I can equal weight my bets and it will
> be just swell. But say I'm right about 51% of the time. Then it's going to
> take quite a while longer for the law of large numbers to work in my favour.

That's actually very far from being true. If you trade a single instrument,
sure, the variance will kill you in anything but the very long run. But if you
trade thousands of securities (like say, the entire US equity market), then a
55% prediction ratio and a market neutral strategy will absolutely crush. Even
if you blindly buy/sell on every signal without doing any sort of weighing
(excluding low confidence predictions, etc), then you should see a several
sigma strategy.

It only takes a very, very small edge to make a very low risk strategy if you
can diversify.

[https://en.wikipedia.org/wiki/Signal_averaging](https://en.wikipedia.org/wiki/Signal_averaging)

Now add on top of that the fact they will have _several_ low SNR prediction
signals, and the effects of signal averaging become even greater

I'm also a "quant fund insider", as you put it...

~~~
lordnacho
Yes, Taleb did the actual calculation in one of his books. I'm exaggerating
because if I say it's 50.01% it will cause head scratching.

------
gtrubetskoy
I'm skeptical. There are skyscrapers in NYC's, Londons, Singapores and Hong
Kong's of this world filled with people who are smart and have enormous
computer resources and funds and are paid handsomely to work on solving this
problem with all manners of ML and AI at their disposal, the "crowd" has no
advantage over them. The "closed system" is much larger than the "crowd" in
this case.

~~~
karmacondon
I don't think this is true at all. 10,000 people are just going to have more
ideas and better individual ideas than 100 experts. The impact of that much
creativity and perspective can be exponential, and it's hard to duplicate.

When I'm designing a system, I hate to have to try to out think everyone on
the internet. If you have a known set of opponents you can predict what they
might do. When you're up against anybody from anywhere, you never know what
you're going to get. Global scale collaboration is a very powerful thing
because it allows a complete exploration of the solution space, and it's
difficult to stop.

~~~
iopuy
I absolutely disagree. I'll take the word of 100 experts over 10,000 amateurs.

~~~
taneq
Examples of the word of 10,000 amateurs:

* Anti-vaxxers

* The healing power of crystals

* Moon landing conspiracy nuts

* Multi-level marketing

~~~
darawk
There are also 'experts' in these fields, so your point is moot.

~~~
wallace_f
OK, but how many physicists support the moon landing conspiracy? How many
aerospace engineers support the moon landing conspiracy? What credentials do
the experts of the moon landing conspiracy have that I should trust?

I think he has a valid point if you have bias in which experts you place
trust. There are, in fact, a lot of experts -- and even academically tenured,
credentialed, published experts -- that I agree, don't have much of anything
worthwhile to say.

~~~
darawk
Ya, true. There are particular areas though where 'wisdom of the crowds' works
much better than any expert, and then there are obviously areas where it does
not.

I'm not sure exactly what the properties of each type of problem are, but it
doesn't seem at all obvious to me that stock picking is not one in which a
sort of herd optimization approach might be very effective.

------
joncooper
"It is intuitively obvious that an open access hedge fund will generate more
intelligence than a closed system built on a pre-internet, pre-cryptocurrency,
pre-AI organizational design."

Really? Because the folks with the magic black box aren't capable of funding
an Interactive Brokers account to keep 100% of their upside and 100% of their
IP?

(Also: risk management and order handling are harder problems than signal
generation.)

~~~
arcanus
Isn't that the magic of OSS? Linux::Windows, matplotlib::mathematica,
android::iPhone, etc. In each case, the free variety quickly catches up to the
proprietary version, and in doing so, cuts into the profitability of the
parent. Furthermore, this often breaks down monopolies, as they must innovate
or die.

~~~
superuser2
Matplotlib covers a tiny spec of a footnote of Mathematica's functionality.
SageMath (a composition of Numpy, Scipy, Sympy, matplotlib, R, etc.) is a more
appropriate analogy.

------
fpgaminer
I started poking at this out of curiosity, and a desire to begin sharpening my
TensorFlow axe, and one thing remains unclear. They give you two spreadsheets,
one being the training data and the other is the tournament data (what you
need to predict on). Each entry in the spreadsheet is 21 features and a single
binary class. The latter is what you predict. But for the submissions they
request a probability, not a class. They don't explain what "probability" here
means. Does it mean probability of class 0? Probability of class 1?
Probability of the moon exploding on a Thursday?

Overall interesting idea. Undecided whether it's real/scam/fake, but
definitely very interesting at face value. I just wish their documentation was
more clear. Seems kind of important...

EDIT: Found a comment on Reddit that indicates that it means probability of
class 1
([https://www.reddit.com/r/MachineLearning/comments/3wdr9e/num...](https://www.reddit.com/r/MachineLearning/comments/3wdr9e/numerai_a_global_ai_tournament_to_predict_the/cyathv5))

~~~
TrickedOut
Have you found any good TensorFlow examples which handle financial or time
series data like this? Please do share! Most of the examples I find are either
image processing or text processing. Rarely time series or traditional DB type
data.

~~~
nl
Generally people use a LTSM for time series if they want to use a NN approach.

See
[http://robromijnders.github.io/LSTM_tsc/](http://robromijnders.github.io/LSTM_tsc/)

------
cryptokoala
Numerai comes across as fraudulently abusing cryptographic buzzwords like
homomorphic encryption [https://medium.com/@Numerai/encrypted-data-for-
efficient-mar...](https://medium.com/@Numerai/encrypted-data-for-efficient-
markets-fffbe9743ba8#.ifdtksq5o)

~~~
_yvjs
Yes, they still haven't replied to a question about this.

[https://www.reddit.com/r/Bitcoin/comments/4p5xgx/ai_hedge_fu...](https://www.reddit.com/r/Bitcoin/comments/4p5xgx/ai_hedge_fund_predictions_using_bitcoin/d4il1xz)

Based off that article they don't seem to understand the homomorphic in
homomorphic encryption.

The mix of technical BS and seemingly expert advisers is weird.

------
bberenberg
Understanding which features to create and why is significantly more impactful
than just trying new models on the same dataset.

------
Xcelerate
> Numerai was seed funded by Howard L. Morgan the co-founder of Renaissance
> Technologies.

Very interesting. This gives this idea some legitimacy in my opinion.

~~~
s_q_b
Very much agreed.

For those who are not aware, Renaissance Technologies is a massively
successful hedge fund that makes investment decisions solely from data, with
perhaps the most sophisticated mathematical models in the marketplace.

Their approach was entirely novel when James Simons founded the firm. Simons
is incredible mathematician, graduated MIT in his teens, and obtained his
doctorate at 23. Before and during Renaissance, he made significant
contributions to cryptology, topology, and string theory.

His firm essentially invented quantitative trading. To this day, with close to
$30 Billion under management, Renaissance Technologies still makes investment
decisions purely algorithmically.

~~~
vostok
To provide a counterexample, P/NP was doing quantitative trading before
RenTech. I will say that I wouldn't comment on implementation details in this
industry unless I've worked at the company in question.

~~~
s_q_b
That's fair. It was not my intention to provide a comprehensive review of the
firm's approach, but rather a quick summary that glosses over a great many
details.

To address your second point, I wouldn't comment on implementation details in
a company for whom I _had_ worked.

P/NP is Princeton Newport Partners. I have a passing familiarity with that
story ;)

~~~
vostok
I agree. I wouldn't do that either.

------
powera
This is where the "accredited investor" warnings are appropriate. _Don 't do
this with your money if you aren't willing and able to lose it!_

In the long run, it's impossible for people to beat the market simply by
looking at historic stock prices. _Impossible_. If it is possible in the short
run, more and more people will do it until they don't make any money at all,
or a "black swan" event occurs and they go completely bankrupt. (I suppose
there's a third option, that they all make so much money that the entire rest
of the world goes bankrupt, but that's absurd)

So be careful!

------
hault
As Peter Thiel says, great startup founders are those which can see the future
in ways in which others can't. This idea certainly looks like the future to
me. Very interested to see where this goes.

------
rgbrgb
So cool. This is the first time I've heard of homomorphic encryption. In my
case, Open Listings has a lot of real estate data that we're not allowed to
vend programmatically (sale prices, list prices, property characteristics). It
would be interesting to be able to release this data in a legally encrypted
form and let data scientists train predictors. We currently have an offer
creation API that's being used by algorithmic investors but they have to get
their data to decide what to bid on from another source. My immediate
questions maybe someone here will know the answer to...

1) Is it legal to vend a dataset that is encrypted this way if you're not
allowed to vend the original? The OP implies that it is, but that seems too
good to be true.

2) Is there software purpose-built for this type of thing? What's good in this
domain? Our stack is mostly ruby but we're polyglots.

------
modeless
So in exchange for giving a hedge fund a stock tip that earns 20% in a month,
the guy gets $10k? That sounds like a ripoff to me! If you have the skills to
do that repeatedly you can make a whole lot more than $10k doing the trades
yourself.

~~~
onion2k
The point is that you probably can't do it repeatedly, or predict with any
confidence that you can even do it once. Numerai enables you to bet using
someone else's money, with a vastly reduced reward, but no risk to you.

Numerai won this time (hence the PR piece) but I don't think we should judge
their performance on one action in isolation. We should judge whether their
approach works based on a year or two of trading on these predictions. Maybe
longer, if reacting to unusual events (economic collapse, freak speculation on
tulips, etc) is something you care about.

~~~
mrkgnao
Plus, Numerai gives them opaque features for ML training that correspond to
real-life data in ways that only Numerai knows. So you can't bail and use your
model on your own.

------
davnn
Interesting idea. Not that crowd driven investment algorithms are new, but I
have not seen a machine learning one before.

What really ennoys me about this kind of businesses is that they pay tiny
prices and shut the competitions once they have found what they were looking
for, however Numerai might be completely different in that regard and I wish
them the best!

Btw: The article kind of conveys the feeling as if machine learning is
something new to the hedge fund business and that's absolutely not the case.
There are already smart people working on really complex algorithms since a
couple of years now.

~~~
dharmon
Even more than "a couple of years". About 15 years ago I worked at a day
trading firm and we were writing models that used machine learning. At the
time we thought of it more as "computational statistics", but its basically
what is called ML now and taught in ML courses (although we didn't use Neural
Nets).

BTW, even in 2001 we were far from the first to do this.

~~~
T-A
Depending on where you draw the line between statistics and machine learning,
it could be argued to have originated with Bachelier's thesis in 1900 [1], or
Thorp's adaptation of Kelly's work first to gambling and then to finance in
the early 60s (he may have been first to use computers for this kind of thing)
[2] or maybe with James Simons' Renaissance Technologies in the early 80s [3].

[1]
[https://en.wikipedia.org/wiki/Louis_Bachelier](https://en.wikipedia.org/wiki/Louis_Bachelier)

[2]
[https://en.wikipedia.org/wiki/Edward_O._Thorp](https://en.wikipedia.org/wiki/Edward_O._Thorp)

[3]
[https://en.wikipedia.org/wiki/Renaissance_Technologies](https://en.wikipedia.org/wiki/Renaissance_Technologies)

~~~
hkmurakami
Regarding Rentec, the only decent book on their history has been in "The
Quants". Are there any others?

[https://www.amazon.com/Quants-Whizzes-Conquered-Street-
Destr...](https://www.amazon.com/Quants-Whizzes-Conquered-Street-
Destroyed/dp/0307453383)

------
ianpurton
_When you’re standing at the beginning of a super exponential curve, that’s
the time to buy insurance against any negative outcomes along that curve. So
today, we’re allowing users to donate Bitcoin to the Machine Intelligence
Research Institute (MIRI) as a hedge against things going horribly right._

If you're the kind of person that falls for this kind if thing, then you
should know I'm also standing in front of a super exponential curve raised to
the power of infinity and beyond. You can also send me bitcoin as a hedge if
you wish.

------
dharma1
I've been looking at this a few times. Its like a giant ensemble. But I'm not
sure ML will be able to beat chance on average on a data source like this.

And if someone discovers they are making money consistently on numerai, I
think they would set up their own fund quite quickly.

I do like the encrypted system though, could be used for other ML competitions
where you don't want to give your model away

~~~
HappyTypist
I know why they're paying out Bitcoin and keeping everything anonymous. They
are hoping quant hedge fund insiders submit their model to the site.

~~~
dharma1
You don't submit a model, just results

------
abcampbell
But _why_ did the machine want to go long Salmar ASA?

~~~
brycehidysmith
Does it matter? The machine saw a pattern, and it responded to the pattern. We
don't need to know.

~~~
datamingle
If the machine is "anonymous", it does matter. Scenario #1: A human gets
insider information that a Solar City will be bought. Makes his anonymous "AI
machine" predict that Solar City is a great stock to buy!

~~~
alexmingoia
That's not possible. The data is encrypted. None of the participants can see
which stocks (or anything) about the data they train with. Numerai turns stock
prediction into a pure ML problem.

~~~
theli0nheart
Doesn't the encrypted chart still need to display price history or volume? If
so it seems like it'd be a trivial task to match it up with its real-world
counterpart.

~~~
richard_craib
It would be easy to match an obfuscated stock market dataset with some third
party dataset, and this has happened on many Kaggle competitions (data leaks).
That's why the encryption here is important.

~~~
Someone
But how do you encrypt a stock's historical performance without removing the
information (performs better in summer, went up after 9/11…) hidden inside it?

You can add noise, but I doubt that will be enough.

