
7,500 Faceless Coders Paid in Bitcoin Built a Hedge Fund’s Brain - mhb
https://www.wired.com/2016/12/7500-faceless-coders-paid-bitcoin-built-hedge-funds-brain/
======
aresant
The company the article refers to is
[https://numer.ai/about](https://numer.ai/about)

From their homepage:

"February 27th 2016, an artificial intelligence named NCVSAI joined Numerai. .
. He uses an untraceable email address. . . He is completely anonymous. His
strongest prediction: buy Salmar ASA — a Norwegian salmon company. . .
Numerai’s hedge fund went long Salmar ASA."

Ok, so how is this anything but an anonymous proxy to arbitrage insider
information?

Wrap confidential, non-public information under the guise of developing "AI"
for trading?

I am serious, I don't necessarily understand the need for the data scientists
to maintain anonymity, seems like the only functional reason is to let them
break the law?

~~~
antognini
I've been participating in Numerai for a few months now. (I've only made some
beer money from it, nothing serious.) When you get the data, you have no idea
what it is. It's just a file with ~70,000 data points, each of which has 21
features. Each feature is uniformly distributed between 0 and 1. All you have
to do is make a binary classification of 0 or 1 (or, more accurately, the
probability that the data point is in class 0 or 1). They don't tell you what
the 21 features represent.

As far as you know, these predictions could be used to make currency trades,
or stock predictions, or real estate purchase, or something more exotic. You
really have no idea. And since you don't know what these data points represent
you can't use any insider knowledge about anything to help you.

~~~
iamthepieman
EDIT: This is total conjecture.

only 1 or at most a few of those 21 features represent real data. The real
data represents similar information to the insider information which they wish
to act upon.

Example data prep:

1\. Insider source says that a contract is falling through, a patent is being
filed for, quarterly numbers have been missed/surpassed etc.

2\. Similar information is gathered from historic performance data of the
company, similar companies or market segments.

3\. The information is correlated with whatever metric they wish to move along
and encoded in one of the 21 feature classes.

4\. Repeat for whatever relevant information that can be linked to the insider
source - i.e. competing companies, re-encoding separately for long and short
positions etc.

5\. Fill in remainder of 21 feature classes with noise.

6\. Profit.

~~~
gohrt
What would be the point of all this? Step 1 is the illegal part, and the other
steps don't erase that.

~~~
poikniok
Well of course it is illegal, you just don't want to get caught, and this
allows you to have plausible deniability.

------
markovbling
I went to school with Richard and so clearly remember him explaining this to
me over coffee less than a year ago when the website had an under-construction
style landing page and it's been amazing to watch this grow so fast.

The platform is great and I'd strongly recommend anyone wanting to get machine
learning experience or who has played with Kaggle to check out Numerai!

The homomorphic encryption piece is fascinating and I think it'll be an
important piece in balancing the privacy vs. utility of personal data as
machine learning seeps deeper into the fabric of our lives.

~~~
CN7R
> The trouble with homomorphic encryption is that it can significantly slow
> down data analysis tasks. “Homomorphic encryption requires a tremendous
> about of computation time,” says Ameesh Divatia, the CEO of Baffle, a
> company that’s building encryption similar to what Craib describes.

> According to Raphael Bost, a visiting scientist at MIT’s Computer Science
> and Artificial Intelligence Laboratory who has explored the use of machine
> learning with encrypted data, Numerai is likely using a method similar to
> the one described by Microsoft, where the data is encrypted but not in a
> completely secure way.

Doesn't this imply that homomorphic encryption isn't being used, but something
like it instead?

~~~
BickNowstrom

        """
        https://arxiv.org/abs/1508.06574
        "An encryption scheme is said to be homomorphic 
        if certain mathematical operations can be applied 
        directly to the cipher text in such a way that 
        decrypting the result renders the same answer as 
        applying the function to the original unencrypted 
        data."
        The function = GradientBoostingRegressor
        the cipher text = X_encrypted
        original data = X
        same answer = mean absolute error
        """
        import numpy as np
        from sklearn.metrics import mean_absolute_error
        from sklearn.ensemble import GradientBoostingRegressor
    
        # Replicability
        np.random.seed(0)
    
        # Create a data set with 1000 samples and 3 features
        X = np.random.randint(0, 60, (1000,3))
    
        # Create ground truth (the product of the three 
        # features - 100) / 11
        y = (np.prod(X, axis=1) - 100) / 11.
        
        # Encrypt y
        y_encrypted = y + 20
    
        # Encrypt X
        X_encrypted = X * -0.5
    
        # Init our model
        rgr = GradientBoostingRegressor(random_state=42)
    
        # Fit model on first 500 unencrypted features
        rgr.fit(X[:500], y[:500])
    
        # Predict the remaining 500 features
        preds = rgr.predict(X[500:])
    
        # Fit model on first 500 encrypted features
        rgr.fit(X_encrypted[:500], y[:500])
    
        # Predict the remaining encrypted features and decrypt
        preds_decrypted = rgr.predict(X_encrypted[500:]) - 20
    
        # Evaluate both functions
        print(mean_absolute_error(preds, y[500:]))
        print(mean_absolute_error(preds_decrypted, y[500:]))
    
        #>>> 323.09
        #>>> 323.72

~~~
lkowalcz
The encryption here is being done by "adding 20" / "multiplying by -0.5"?

Given this "encrypted" X , y dataset, I could easily find the unencrypted
version... (even if I don't know 20 or -0.5, this still reveals so much of the
structure that I don't believe it provides any real protection against
anything except the most lazy attackers)

~~~
BickNowstrom
It is a toy example to show that a form of homomorphic encryption is possible,
without going Fully Homomorphic Encryption.

And simple linear transforms on already anonymized features are not so easy to
reverse engineer as you may think. Just try it on a few datasets from UCI.

~~~
lkowalcz
Ah ok, sure. I wouldn't call something like a linear transform on anonymized
features "encryption" (more like obfuscation?), but I guess it's good
marketing in that it lets them associate with the "recent advances in [real]
homomorphic encryption"

~~~
BickNowstrom
If you desire something more one-way, consider PCA, random projections,
feature expansions (with something like Random Bits Regression), hashing, or
the last hidden layer activations of your best in-house neural net. Then
combine these approaches for good measure.

Agreed on the clever marketing, but at least they put their money (expensive
dataset) where their mouth is (release it to reverse engineers the world
over).

Fully Homomorphic Encryption challenges would be interesting, but it would
disqualify our current state-of-the-art algorithms, and reduce the playing
field to a handful of people who know how to write algo's that work with Fully
Homomorphic Encryption (if any competitor at all is allowed to work on this,
and not too busy working for the NSA).

------
Uptrenda
A while ago I wrote about how AI might be trained on a data set and then used
to form the basis of an alternative cryptocurrency for data mining. I.E. maybe
you train the AI to recognize images of mountains and then use that as your
proof-of-work algorithm to reward "miners" for finding related images on the
Internet.

What I never imagined is that you might outsource the process of building that
AI itself as a separate entity (maybe even a separate blockchain.) You could
do the entire thing with commitments:

* Build a blockchain that is about rewarding data scientists for predictive models on data sets.

* Commit to a hash of a data set (the challenge to the network.)

* Hash enters chain.

* Release data set.

* Give them a dead line for a solution after which they commit to a hash of their solution / predictive model along with an ECDSA pub key for a reward.

* Solutions are released after N blocks.

* Top N solutions automatically receive rewards in this new cryptoasset.

* Zero trust would be required since it operate on outcomes that anyone can check.

* (This could also be done as an Ethereum smart contract instead of a blockchain.)

Scaling that would be quite hard though and you would need to use standard
proof-of-work to avoid attacks.

IMO: What I find the most interesting about this article is that they're using
masked data as the input which I think is the kind of futuristic cipherpunk
vision a lot of people had in mind for Ethereum when it launched. Ethereum +
AI + crypto + game theory is a match made in heaven and we probably haven't
even scratched the surface of what's possible in this space. I can't wait to
see the kinds of things people come up with in the future.

Edit: formatting.

~~~
billconan
great idea! although sharing data or model with blockchain is very costly I
think?

~~~
surrey-fringe
That would make it a _bad_ idea.

------
lkowalcz
Anyone want to speculate about how the data is being "encrypted"? It seems
like they don't want to say, which immediately sets off red flags in my
head...

I am pretty sure homomorphic encryption is not being used. I think if they are
doing anything rigorous, _maybe_ they are using order-preserving encryption
([http://www.cc.gatech.edu/~aboldyre/papers/bclo.pdf](http://www.cc.gatech.edu/~aboldyre/papers/bclo.pdf)).
This would mean that the only valid operations on the ciphertexts are
comparison operations. I can't seem to find anywhere where numer.ai actually
says how to interact with the "encrypted" data. I think it's a little strange
that they would suggest that they are using homomorphic encryption yet have
only comparison operations actually make sense on their "encrypted" data.

A second hypothesis would be that no encryption is being used at all, and this
is just unlabeled features that have been renormalized within [0,1].

A third would be that order-preserving encryption is being used, but in an
ineffective way which is basically just resulting in the second scenario.
(understanding the security guarantees of order-preserving encryption is
practice is very complicated)

------
Twisell
I might be wrong but this look like Bernard Madoff in disguise.

"Give me your money and trust my big black box that nobody understand!"

Then just to be more credible add some meaningless data point "We have 7 500
anonymous data scientists" because it is known, the less you now about people
who write abstract model that manage your money the more you should be
reassured.

"My son told me about anonymous they are a hell of good hackers!"

~~~
usefulcat
Unless they require some sort of financial contribution from the 7500, it
seems like the biggest risk (apart from wasting your time) is that if you do
come up with something useful you have no way to know the value of it; you
basically just have to trust them.

~~~
azernik
I think parent's concern is scamming of investors, not programmers.

~~~
phpnode
Why not both?

~~~
azernik
Sure. I was just responding to the original comment that spoke exclusively to
the investor risk.

------
bluetwo
That all sounds awesome and impressive, but I have the same problem with it
that I have with any of the previous neural network hedge funds or, for that
matter any of the 24-hour business news stations:

 __* If you truly had the ability to predict the price direction of ANY stock
/fund/option in ANY consistent time frame (24-hours/week/month/quarter/year)
ahead of time, with ANY accuracy greater than a coin flip, it would be
_ridiculously_ easy to turn this information into piles of cash.

So, if you are NOT talking to me from atop your piles of cash, I have to
assume your accuracy is no better than the flip of the quarter in my right
pocket.

~~~
spuz
The article says: "Numerai’s fund has been trading stocks for a year. Though
he declines to say just how successful it has been, due to government
regulations around the release of such information, he does say it’s making
money. And an increasingly large number of big-name investors have pumped
money into the company"

So perhaps they are sitting on piles of cash?

~~~
vkou
This screams 'Ponzi scheme'. If you are thinking of putting money into this,
don't walk away - run away.

Above-the-board trading institutions don't have problems with nebulous
'government regulations', when talking about how profitable their funds are.

~~~
tomp
You've no idea what you're talking about. Hedge funds (legally) can't
advertise to retail investors (if they do, they're no longer hedge funds,
which substantially increases their regulatory burden and narrows the range of
strategies they can trade). Having said that, even if they could, they
probably wouldn't want retail investors - if they're good, they can get rich
investors (e.g. $10M+), reducing their administrative costs and allowing them
to trade with longer horizons.

------
cwyers
The idea that data science is some kind of numerical alchemy where you can
just anonymize the data so much that the people doing the modeling don't even
know what the problem domain is just irks me. It's utter nonsense.

------
tinco
I wonder if in the end there will be just a single model, and a single
faceless coder who actually makes the best predictions, and the whole of
Numerai will be just an elaborate way of finding that particular person and
his model to make the market.

~~~
BickNowstrom
With meta-modeling you can build an ensemble (single model) that beats any
individual model. This should also have less variance (See Breiman's Bagging
Predictors). I'm saying that even if there is a super talent on Numerai, one
should still use, say, 0.95 to 0.05 weights, to "hedge your bets" and improve
accuracy. With enough competitors, it becomes near impossible for a single
agent to beat all the others combined.

Kaggle competitions also see this.
[https://www.boozallen.com/content/dam/boozallen/documents/20...](https://www.boozallen.com/content/dam/boozallen/documents/2015/12/2015-FIeld-
Guide-To-Data-Science.pdf) [page 93] shows a graph where a simple average of
the top models in a competition gives an ensemble model with higher accuracy
than any of the individual competitors.

I think that is largely the beauty of Numerai. Using adversarial agents to
build a "collaborative" model. Disclaimer: Won a few bitcoin on Numerai.

------
NumberCruncher
This remembers me of the "neural network black box based forex trading"
companies 10 years ago, based in countries where finance fraud did not count
es a crime.

------
tlrobinson
This is about 2 (big) steps from the plot of "Daemon".

~~~
ph0rque
This one? [http://a.co/6lbILZm](http://a.co/6lbILZm)

~~~
ersii
Yep! It's a good read. Highly recommend it.

It has a follow up called Freedom(TM). Both books by Daniel Suarez.

~~~
ph0rque
Thanks! I placed a hold for it at my library.

------
saycheese
>> " If you are a US taxpayer and have tournament winnings, you will be
required to submit to Numerai a Form W-9 with your Taxpayer Identification
Number, and you will receive from Numerai a Form 1099."

Bit of a stretch to say they do not know any of the names of coders.

Anyone know how many valid W-9s they've received? Seems like if the user was
connecting from an IP in the US that the user would need to prove they're not
a US taxpayer.

~~~
HappyTypist
You have the option to self declare to be not a US resident.

~~~
arbuge
If you actually are a US resident and declare that, you would be breaking the
law.

And if you receive payment in a US bank account and declare yourself a non-
resident, expect questions to be asked.

~~~
HappyTypist
Payment is in bitcoin.

~~~
HappyTypist
It's much more difficult to track. You can convert your bitcoin to cold hard
cash at an ATM anonymously.

------
macandcheese
How does this compare to Quantopian?

~~~
CN7R
> Write your algorithm in your browser. Then backtest it, for free, over 14
> years of minute-level US equities data, and soon, US futures. (source:
> Quantopian website)

Numerai encrypts the data they give you, so you can't just take your algorithm
and start your own hedge fund (because you won't have any data to improve
with).

------
popol12
So it could be 1 dev who supplied 7500 different models.

------
irln
No sarcasm intended here: Does it matter in terms of social utility whether
these types of predictive models are making predictions of human decisions to
buy or sell a stock versus predictions of AI based decisions to buy or sell a
stock?

~~~
lmm
A model that we understand seems more useful than one that we don't. If the
hedge fund is saying "buy company X but not company Y because company Y leases
its equipment and that will cost them in the long term" (say), that's
notionally more valuable than "buy company X because the computer says so",
because other companies can learn the lesson of this and not lease their
equipment.

In practice given that hedge funds tend to be secretive anyway maybe it makes
very little difference.

~~~
huac
It's reasonably easy (not trivial, not impossible) to track the trades of
hedge funds, since they do have to disclose their transactions. And some guys
will even tell you! [1] [2] [3]

 _Why_ they make their decisions is another question.

[1]:
[https://www.bloomberg.com/gadfly/articles/2016-09-14/herbali...](https://www.bloomberg.com/gadfly/articles/2016-09-14/herbalife-
leave-it-for-icahn-and-ackman)

[2]: [http://etfdb.com/etfdb-category/hedge-fund/](http://etfdb.com/etfdb-
category/hedge-fund/)

[3]: [https://whalewisdom.com/](https://whalewisdom.com/)

------
Paul-ish
Without knowing much how they obfuscate the data, if anyone could figure out a
way to unblind the data, they could reasonably predict what trades Numerai
will make, right? Even more important, they could influence those trades.

~~~
svantana
No, because the predictions are not public. If they were, everyone could just
copy the top predictions.

~~~
Paul-ish
It may be possible to anticipate the predictions of others working from the
same data if you know what techniques people tend to apply.

------
CN7R
Doesn't this assume that market is composed of completely rational actors? On
the same note, why not just do sentiment analysis on the people trading and
buying stocks?

------
siliconc0w
I'm pretty sure they're using Order Preserving Encryption and not true Fully
Homomorphic Encryption. Marketing is great though - I love the look and
messaging.

------
stokilo
Investing like this it is like implementing Texas Holdem Poker AI without
knowing history of previous hands and your opponents attitude. Just won't
work, all marketing around it is just a bubble.

