
Numerai – A hedge fund built by a global community of anonymous data scientists - maxt
https://numer.ai/
======
lordnacho
Hedge fund guy here.

\- You need to know something about the domain in order to make sensible
predictions. Is the data daily? Is it per second? Is it ticks? You can't build
a sensible model if you don't know that, even if you have good predictions.
Relative cost will vary a lot between timescales.

\- It matters what the features are. Maybe there's some clever reason why it
doesn't, but until I hear why I'm going to take the ordinary view that some
features are different in nature to others. For instance maybe one feature is
volatility, a thing we typically model with GARCH, while another is some
fundamental like P/E, which we'd incorporate some other way.

\- How are you executing the trades? It matters a lot whether you're click-
trading through some broker API, automating via Excel, or running your own
network of colo servers. Some things just aren't possible if you're too slow.

\- If you make the data encrypted, you'd better know very well what it
represents. For instance, you might take all the closing prices of the LSE
stocks on a given day as inputs. You can make analyses that are valid with
that, and ones that aren't, because the data you've collected do not represent
a snapshot of the market at a specific time. It might sound like it does, but
it doesn't on deeper inspection (market opens and closes are not
simultaneous).

Does anyone know how it's going for them?

~~~
solaarphunk
Although numerai could control for some of the factors that you mention above,
I'm pretty sure they are falling into the data-mining fallacy that most people
do when they approach quantitative trading from the outside world. I
completely agree with your skepticism of their approach.

While prediction based on data may be valuable in some cases, it isn't robust
enough to scale up in any meaningful way. Context matters, like you state
above, and most quantitative traders start by taking their contextual
knowledge of the markets, and then collecting data on features, and THEN they
fit a model to it.

Skipping these steps is only going to lead to a bunch of blowups. I doubt they
have any meaningful sharpe that they could scale up or publicly defend with
the approach they have taken so far. I'd guess they are paying people VC
dollars, not actual market profits right now.

I do think its cool that they have been able to use homomorphic encryption to
solve the problem of wanting anonymize data, but I'm not sure it actually
helps in this case.

~~~
botexpert
Knowing what features represent is an advantage but isn't necessary. At least
that's what nn research shows. But currently a lot of data of required for
good representation learning. Ensembling also works, or can work, and it seems
that's what they're doing. Although it doesn't look like datasets are large
enough for NNs out that ensembles are large enough for good prediction.

------
delegate
"It's just a pure math problem. It's like a math competition. You don't need
to know anything about finance, you don't know anything about hedge funds...
you don't even have to speak English..."

This is not a pure math problem. Eventually the outcome of all these models
and predictions affects the stock prices and - if it becomes as successful as
you hope - the economy as a whole. And the physical world: people, animals,
plants, pollution, CO2 and so on.

I would much rather see work in that area - than this data juggling deep
learning bullshit which results in profits being paid out to a bunch of
intelligent, greedy and unwise people.

I just hope that when the intelligent people finally become wise, it won't be
too late.

~~~
wyager
> than this data juggling deep learning bullshit which results in profits
> being paid out to a bunch of intelligent, greedy and unwise people.

Holy anti-intellectualism, Batman!

If you describe machine learning as "data juggling bullshit", this is very
strong evidence that you simply don't understand what it is. This is an
indictment of you, not of machine learning. Machine learning would more
accurately be called "applied computational statistics" in 99% of cases.

What makes you think that the people using applied statistics to make money
are "unwise"? Based on your tone, I would guess that it's because what they're
doing doesn't agree with your folk definition of "an honest day's work" or
something like that. This isn't really a good criticism; it just means you
don't see the utility of what they're doing, which requires some degree of
abstract thinking about the market.

~~~
delegate
> What makes you think that the people using applied statistics to make money
> are "unwise"?

Because it is an incredible waste of talent.

I know this is in contradiction with all the 'values' that have been drilled
into our minds since we were born, but it's about time we wake up and reorder
our priorities.

>Based on your tone, I would guess that it's because what they're doing
doesn't agree with your folk definition of "an honest day's work" or something
like that.

If you look at the state of the biosphere / atmosphere / oceans - the data -
and if you have children - then it should be quite obvious.

Also there are the socio-economic challenges that the world faces right now -
in fact it is unclear if we're going to make it to the next century as a
species.

It really don't matter how much "money" you have in your account when your
city sinks under the ocean...

~~~
hueving
>It really don't matter how much "money" you have in your account when your
city sinks under the ocean..

You just move. That's what money enables. No city is going to sink so rapidly
people just drown.

~~~
inimino
So you're literally arguing that we just need to focus on making money, and
ignore things like the environment? Yep, sounds about like a textbook example
of "unwise" to me.

~~~
hueving
Nope, just pointing out how stupid the conclusion is that the poster made.

------
jimfleming
Spent some time experimenting with Numerai. Really fun competition, clean
(encrypted) dataset, and Bitcoin payouts. I wrote about my experience and
open-sourced all of the models here[0] if you're looking to get started.

[0]
[https://github.com/jimfleming/numerai](https://github.com/jimfleming/numerai)

~~~
dharma1
Awesome. Thanks for the write up

------
ktamura
Their dataset reeks of startup hustling.

I just downloaded the training set [1] and plotted some of its descriptive
statistics [2]. It looks that all features are uniform distributions and the
response variable is Bernoulli coin-flipping. In layman's terms, you can't
really come up with a good predictive model with this training set.

I give them the benefit of the doubt that they wanted to have something in
place to push the website live, but I cannot imagine any serious data
scientist not noticing this.

[1]
[http://datasets.numer.ai/57feb95/numerai_training_data.csv](http://datasets.numer.ai/57feb95/numerai_training_data.csv)

[2] [https://cl.ly/051G3Y2Z2O0W](https://cl.ly/051G3Y2Z2O0W)

~~~
antognini
The features are all encrypted using homomorphic encryption, which ends up
mapping the overall distribuiton uniformly onto [0, 1]. If you play with the
dataset, though, you'll find there are some significant correlations between
the different features and you can make a model that does substantially better
than chance.

~~~
daveguy
I would like to see a comparison between random generated uniform features and
these features. Can you fit the noise and come up with "statistically
significant" predictions? If so, is anyone doing any better than can be done
on random data? If no one is doing better than that, what is the likelihood
that these aren't any better than monkey with a dartboard?

I would love to see peer reviewed articles from numerai with some of their
behind the scenes results.

~~~
antognini
Whenever I make a model I always do cross-validation to make sure that I'm not
overfitting. If we were just fitting random noise I would always see the
performance of my test set being no better than chance (or worse).

------
asdfologist
I don't see how Numerai can avoid the multiple comparisons problem [0]. If
people submit thousands of random models, then some subset of them will do a
fantastic job in predicting prices in historical simulations but do poorly
under real market conditions. As long as the models are black boxes, there's
likely no good way to distinguish them from noise.

[0]
[https://en.wikipedia.org/wiki/Multiple_comparisons_problem](https://en.wikipedia.org/wiki/Multiple_comparisons_problem)

~~~
dreamdu5t
That's true only if stock market data is completely random and that there's no
signal to predict on. That doesn't seem to be the case, considering hedge
funds successfully use ML on historical data to capture alpha. You don't need
a model to work forever to be successful.

~~~
asdfologist
It looks like you missed the point of my comment. I'm saying that Numerai
won't be able to distinguish between zero alpha and positive alpha models, if
all they're doing is running historical simulations on black boxes.

------
kowdermeister
"In December 2015, we created the world’s first encrypted data science
tournament for stock market predictions. Since then, Numerai data scientists
have submitted 13,350,675,598 equity price predictions. The most accurate and
original machine learning models from the world’s best data scientists are
synthesized into a collective artificial intelligence that controls the
capital in Numerai’s hedge fund."

What does this mean? What do they do?

~~~
bigdubs
Monkeys throwing darts at the board.

This brings to mind the Buffett hedge fund wager, where he invested in a
vanguard s&p 500 tracking fund (VFIAX) and a hedge fund actively managed an
equal amount, and Mr. Buffett ended up winning handily.

~~~
dsacco
It's easy to be dismissive of this, especially by working from the popular
Warren-Buffet-index-funds-beat-hedge-funds story that makes the rounds. It's
true that most hedge funds and active traders lose money (or at least,
underperform the market). But as I am fond of pointing out, there are a non-
negligible number of funds and traders who consistently and demonstrably earn
significantly stronger returns than the market benchmark, even net of fees.

I am skeptical of Numerai for different reasons. If someone can consistently
churn out profitable and novel equity pricing insights, it would be more
rational for them to work for a more well established hedge fund in
quantitative research. Perhaps more importantly, I'm skeptical of how they
judge accuracy in their participant-volunteered insights.

~~~
pjlegato
They are betting that there exist people who are capable of performing that
work who cannot work at a hedge fund, for whatever reason.

Perhaps they live in the wrong location -- it's hard/impossible to get a quant
job if you don't live in a major market center. Not everyone is 23 years old
and unattached and prepared to move around the world for a job.

Perhaps they can do the predictions but they didn't go to a high end
university and have no track record -- just try even getting an interview at a
hedge fund without one of those.

~~~
dharma1
Or people who do work at hedge funds, but want to moonlight on the side and
get paid via bitcoin anonymously

------
Houshalter
I don't understand how this works. What exactly is being predicted? The data
isn't a time series. The outputs are only binary. There are only 27 features.
I don't understand how this represents market data at all. In fact they
probably destroyed most of the information trying to convert the data to this
format.

------
bgitarts
The training set just has anonymized features. Data scientists generally would
like to know the nature of the data they are working with. Does the site at
any point give access to the labeled featured?

~~~
yankoff
You don't necessarily have to know the meaning of the features to build a
successful model. That has been done on Kaggle a lot.

------
dooglius
This is fishy. The entire point of encrypted data is that one message cannot
be distinguished from another without decryption (which would require the
private key). In other words, the entire premise shouldn't work. This means
that either bad encryption is being used (i.e. statistical information about
the data is leaked) or the good results we see are just noise. Or, the whole
thing is a scam to get funding: the best algorithms are planted, and the
company just shuffles BTC between accounts it controls.

~~~
jamez1
The data is homomorphically encrypted, meaning you can do operations (such as
add and subtract) on the ciphered message and it will also perform them on the
underlying data.

~~~
dooglius
Yes, I realize that. The issue is that the result of any operation is also
encrypted, which means that there should be no way to connect the target of
the training data (encrypted or not) to the output of a function of encrypted
data. Suppose the unencrypted data is (a,b,c) where a+b=c, and
(x,y,z)=encrypt((a,b,c)). We have an addition function plus on encrypted data
such that decrypt(plus(x,y))=a+b=c=decrypt(z), but it is not the case that
plus(x,y)=z (at least, not if plus is computable in polynomial time, and
assuming the encryption scheme is sound). If it were, we could statistically
distinguish encrypt((a,b,c)) from encrypt((rand(),rand(),rand())) which would
mean the encryption is not sound.

~~~
jamez1
They could just be normalizing every data point to be between 0 and 1 by
dividing by the range. That's a homomorphic encryption.. it passes your weird
assumptions.

I don't know why you're harping on about sound encryption, the point of this
is to keep the statistical information intact in the cipher, without giving
away the underlying market data.

~~~
Houshalter
It's very weird to call data normalization "encryption". This is just a
standard procedure done on most datasets. Encryption implies they've gone to
extra processing to make sure you can't figure out what the variables mean.

I think it's either an abuse of the word 'encryption'. That, or they really
have done something weird to this dataset. Which will probably make it useless
for statistical algorithms. Even normalization destroys a lot of useful
information.

~~~
jamez1
They only need to go as far as to obfuscate the market data this was derived
from, so they don't have to pay exchange licencing fees.

It doesn't have to have an exponential time complexity on decryption to
qualify as 'encryption'. Multiplying by 2 could be considered homomorphic
encryption.

You might think encryption means something else and that it's an abuse of the
word but unlike the spy novella that you derive this impression from, these
guys actually are ex-spies.

~~~
nathan_f77
> They only need to go as far as to obfuscate the market data this was derived
> from, so they don't have to pay exchange licencing fees.

Thanks for explaining this, I was struggling to figure out the difference
between this and [https://www.quantopian.com/](https://www.quantopian.com/)

That's actually pretty clever.

------
baccredited
If I had a winning strategy why would I feed it to Numerai instead of
instavest.com?

~~~
yankoff
You don't know if you have a winning strategy. You'd have to put your own
money and take a risk to find out. Plus you'd have to take care of data and
feature engineering yourself.

------
zitterbewegung
Why should I think that being anonymous should give some advantage ? I think
it would be more of a disadvantage due to should I trust you? Also, how do I
know these people are truly anonymous ?

~~~
daveguy
Anonymous in the sense that numerai does not ask for any personally
identifying information. You could easily prove you are who you say you are in
numerai.

Also, I think the big benefit from a data scientist working on this is you can
test methods with generic features, submit the results and get paid if they
are good, but not submit any part of the methodology to any third party.

If you can kill it on numerai then maybe you would consider buying data
sources and apply your methods to your own data. Although you still don't know
what the features are.

It's the polar opposite of open source.

The owners don't have to trust the data scientists. They evaluate their
results against additional data.

~~~
dharma1
That's what I thought - if you are smashing it on numerai, wouldn't you be
better off raising some capital for your own fund to scale it?

------
gravypod
Couldn't I make thousands of fake accounts and submit thousands of slightly
different models. Then after a while I could push one or two of my stocks
higher in my models artificially. If you represented a large enough % of the
"data scientists" in this hedge fund you could make it look like your stock is
"definitely" work investment. After they invest in your company, you could
take off to the hills.

Hell, you don't even need to be the owner of the company. This would be a
great way to obtain large amounts of political sway/power. Like a company/want
it to succeed for some agenda? Make it look better as an investment
opportunity. Dislike a company? Well that stock is going to do horrible next
quarter. It's also a self fulfilling prophesy.

For all of you finance people out there, is what I am saying impossible or
stupid? I hope I'm wrong otherwise this is a horrible idea.

~~~
nradov
You don't even need to push one or two stocks higher. With thousands of fake
accounts you could just pick a different stock for each account at random.
Certainly a few of those stocks will turn in huge returns. Collect your
Bitcoin reward from Numerai. Repeat.

Maybe they have some way of preventing people from gaming the system that way?

~~~
mikbob
The thing is, with the data no one knows what each row represents, or even
what the features are or what they're predicting is. Each submission has
30,000 predictions, so you would need to have an unreasonably good random
guess to get anywhere near the top of the leaderboard.

------
DrNuke
In my limited experience in London at the trading level, they do not want
collective intelligence at all: they do the bulk with algorithms at very low
level latencies and want outliers (unpredictable singletons are less
reproducible than the average from millions heads) to run high risk - high
reward books.

------
Dowwie
Is this an execution of "algorithm(model) as a service"?

------
exit
i suppose this could be, uh, useful for insider trading

------
blazespin
[https://xkcd.com/1570/](https://xkcd.com/1570/)

------
lasermike026
How ominous.

In today's news anonymous scientists form human genetics laboratory to improve
the human species.

------
markovbling
Awesome!

------
late2part
This is darn cool

------
cheiVia0
Things I do not understand:

* How does a logloss relate to earnings?

* They only receive predictions based on old data (by definition) and not the models, how do they just the predictions them to make trading decisions?

* How can I invest in this hedge fund?

------
DeBraid
tldr - novel encryption method allows for sharing of data sets and better
machine learning models. Aggregation of models into a portfolio offered as a
hedge fund.

Found this company interesting, read a bunch of blogs from them and
tweetstormed:
[https://twitter.com/Royal_Arse/status/787725301908242432](https://twitter.com/Royal_Arse/status/787725301908242432)

------
dschiptsov
Someone got financing for a crowdsourcing of data mining of commercial data
set. Very clever.

It seems that only very few nerds are taking the theoretical impossibility of
predicting the future seriously and we are missing the opportunity to get
funding for some crappy models from the greater fools.

------
martinko
Hope they will have more luck than LTCM

