We're in the midst of a data gold rush. People who have data are struggling to m...

inputcoffee · on Dec 4, 2018

Alternative take: there isn't that much low hanging fruit there.

Hear me out.

"To the person who only has a hammer, everything looks like a nail."

The data in front of your is the data you want to analyze, but it doesn't follow that that is the data you ought to analyze. I predict that most of the data you look at will result in nothing. The null hypothesis will not be rejected in the vast majority of cases.

I think we -- machine learning learners -- have a fantasy that the signal is lurking and if we just employ that one very clever technique it will emerge. Sure random forests failed, and neural nets failed and the SVR failed but if I reduce the step size, plug the output of the SVR into the net and change the kernel...

Let me put an example: suppose you want to analyze the movement of the stock market using the movement of the stars. Adding more information on the stars, and more techniques may feel like you're making progress but it isn't.

Conversely, even a simple piece of simple information that requires minimal analysis (this companies sales are way up and no one else but you know it) would be very useful in making that prediction.

The first data set is rich, but simply doesn't have the required signal. The second is simple, but has the required signal. The data that is widely available is unlikely to have unextracted signal left in it.

heurist · on Dec 4, 2018

I've been selling good data in a particular industry for three years. In this industry at least, the so-called "low-hanging fruit" only seems low-hanging until you realize that the people who could benefit most from the data are the ones who are mentally lazy and least likely to adopt it. Data has the same problems as any other product and may even be harder because you need to 1) acquire the data and 2) build tools that solve reliably difficult problems using huge amounts of noisy information...

rademacher · on Dec 4, 2018

Isn't there utility in accepting the null hypothesis? It's almost as valuable to know that there is no signal in the data as there is in the opposite, i.e., knowing where not to look for information.

I think your example is really justifying a "machine learner" that has some domain expertise and doesn't blindly apply algorithms to some array of numbers.

whatshisface · on Dec 4, 2018

I think his argument is that some null hypotheses can be rejected out of hand, but that people are wasting time and effort obtaining evidence that, if they had better priors, would be multiplied by 0.0000000000001 to end up with an insignificant posterior. That's what the astrology example indicates.

cpb · on Dec 4, 2018

The effort to evaluate the null hypothesis can be costly. In the competitive environment found in most hedge funds, how would you allocate to accepting the null hypothesis?

As in, if you worked at a data acquisition desk, and spent a quarter churning through terabytes of null hypothesis data, what's your attribution to the fund's performance?

losteric · on Dec 4, 2018

I think they're describing the "look-elsewhere effect": https://en.wikipedia.org/wiki/Look-elsewhere_effect (aka https://en.wikipedia.org/wiki/Multiple_comparisons_problem)

inputcoffee · on Dec 4, 2018

Accepting the null hypothesis has utility only if you have some reason to believe it would not be accepted.

Accepting it per se has no particular value. You could generate several random datasets, and accept/reject the null hypothesis between them ad infinitum.

To put it another way, its only interesting if its surprising.

rafiki6 · on Dec 4, 2018

Bingo. You nailed it. I work in finance. Developed markets have efficient stock markets. They are highly liquid. The reality is that there's lot of people competing for the same profits. In reality when there's that many players, if there's a profit to be had from a dataset you will be buy from a vendor, chances are one of your many competitors already bought it and found it. This is why we now say don't try to beat the market, you likely can't and mostly just need to get lucky having the right holding when an unforeseen event occurs. Too many variables at play that we just don't understand. Most firms are buying these datasets to stay relevant but they really make no difference in their actual investing strategies.

rdlecler1 · on Dec 4, 2018

This is where you might use a genetic algorithm or to learn which data to use on a particular prediction. Good AI won’t use all data just trim down to signal.

pplonski86 · on Dec 4, 2018

I would like to see use case when AI selects data source to use that humans will never consider.

rdlecler1 · on Dec 16, 2018

It's about weight relative importance, especially in conjunction with multivariate information that may be correlated.

jfoutz · on Dec 4, 2018

I read a neat criticism of ai techniques. The author pointed out humans can pick out a strong signal as well or better than ai. Humans could pick out signal from an array of weak sources. Ai would identify that case with fewer weak signals required, but it was hard to trust because it was sometimes wrong.

I wish I could remember the source. I’m sure it was an article here a few years ago. I want to say it was medical diagnosis based on charts.

Anyway, the point was there is a very narrow valley where ai is useful beyond an expert. And that valley is expensive to explore. And, there might not be anything there.

claytonjy · on Dec 4, 2018

For finance in particular, I'd say we're drowning in a massive volume of shitty data.

A client of mine purchases several fundamental feeds from Quandl, and I email them regularly to point out errors. Not weird, hard, tricky errors, but stuff like "why are all these volumes missing" or "there's a 1-day closing price increase of 1200%" or "you divided when you should have multiplied". This tells me neither Quandl nor the original provider (e.g. Zacks) do any serious data validation, despite claiming to.

If the companies people have been paying for decades for this data get it wrong this often, how can I trust any weirder data they're trying to sell me? I thought the point of buying these feeds was to let the seller worry about quality assurance.

vasilipupkin · on Dec 4, 2018

this doesn't matter - any sophisticated user will have their own software to clean the data anyway. Their concern is getting the data, they know how to clean it once they have it.

claytonjy · on Dec 4, 2018

We're not talking about data cleaning, but about data validation. I can fix a weirdly formatted field (cleaning), but I can't reliably impute most kinds of missing data. I can detect errors, but can't fix most of them without additional information...which is exactly what I'm paying the provider for.

vasilipupkin · on Dec 4, 2018

you can, there are ways to do it. interpolation, etc. Sometimes the data is missing just because it's not available, you still have to handle that case. proper way of filling in this missing data will depend on what you are using it for - so for provider to do it would be kind of wrong actually.

claytonjy · on Dec 4, 2018

I think we're talking past eachother here. I don't expect the provider to do imputation for me, but I shouldn't have to bug them to get the best version of the data they have. Sure, sometimes missing is missing, but in my experience with Quandl/Zacks, its usually an error on their end. The price jumps are sometimes because they conflated two different tickers. If they divided instead of multiplying (split factors), I have to have external information to even detect the error! Same goes if they get a date wrong somewhere.

thwy12321 · on Dec 4, 2018

this is what people in this thread dont really understand, investors want the raw feed. Theres nothing to be gained from an aggregated, cleaned feed that everyone has

pplonski86 · on Dec 4, 2018

You are right, extracting insights from data is a low-hanging fruit. From what I observe there is huge lack for proper services and tools that can automatically produce insights. There are of course automatic machine learning solutions, but they focus more on machine learning model tuning (in the kaggle style) rather that giving users understanding and awareness of the their data.

rainboiboi · on Dec 4, 2018

I think data scientists needs to produce more actionable insights as oppose to living in their own world. I suspect there will be an rising group of people who can understand data science techniques and communicate them effectively to drive business decisions. These people will be the ones who can clinch the top posts.

pplonski86 · on Dec 4, 2018

I'm running automatic machine learning SaaS for 2 years, and after this time, I can tell you that it is huge problem that data scientist are living in their own world (including me! and including data science tools).

I had such situation: my user created 50 ML models (xgboost, lightgbm, NN, rf) and ensemble of them. Let's say that single best model was 5% better than single best model with default hyper-params and ensemble was 2% better than tuned best single model. For me it was a huge success, but the user didn't care about model performance. He wants to have insights about data, not tuned blackbox.

rainboiboi · on Dec 4, 2018

I understand every single word you said and fully agree with you. Good point!

cpb · on Dec 4, 2018

In an interview, Ryan Caldbeck from Circle Up describes two categories of models: brainy models and brawny models. The ensemble described above sounds like a brawny model: you don't care how it made the decision, you're glad it did the heavy lifting and you might even double check the result.

However, the user's concern about the black box suggests they wanted what Ryan refers to as a brainy model, one with explicable decisions. Even within the features of the model there could be things to learn about the data.

How else are data scientists stuck within their own world?

pmart123 · on Dec 4, 2018

Nasdaq already makes more money on data licensing than on trading fees or IPO's. Each time a professional in the financial services industry wants real-time display data, for example, they have to pay Nasdaq a monthly fee. Nasdaq and NYSE compete for listings less so now for trading fees, but because it makes its data licensing package more valuable.

momentmaker · on Dec 4, 2018

There is Ocean Protocol (https://oceanprotocol.com/) that lets you sell your data.

There is ChainLink (https://chain.link/) that lets you sell your data via API service through decentralized oracle nodes.

https://blog.goodaudience.com/the-four-biggest-use-cases-for...

Monetization is coming soon... in a big way.

cpb · on Dec 4, 2018

How do these services make it easier to evaluate data? The medium article starts with a disclaimer about DLT... Talking with investors buying data, one shouldn't be surprised to hear them request uploads to their FTP. Their data teams are overcommitted when it comes to the evaluation side of consuming data. They aren't (yet) resourced like a tech startup.

How should they prioritize learning about ingesting data from a DLT? They have data brokers (like Quandl) coming to them with assurances of frictionless integration, with data they can understand and use, today!

fancyfish · on Dec 4, 2018

In addition to the FTP, half the work in getting these alt data feeds in finance is getting the metadata right, such as getting each record tied to an entity or security, knowing how much of a lead/lag the data is available and how soon it's tradable. Quandl helps with the technical friction and also this metadata and security mapping aspect.

cpb · on Dec 5, 2018

Oh sure, I'm with you there! But how about those blockchain upstarts mentioned above?

abakker · on Dec 4, 2018

I'm calling it here: the most useful data is private, or, can't be sold due to confidentiality. The fact that data is confidential is great evidence that we know it is useful, but also that we hope others won't use that signal.

momentmaker · on Dec 4, 2018

What about banking initiatives in EU causing every bank to open up their data via API. That seems pretty private and confidential.

https://www.evry.com/en/news/articles/psd2-the-directive-tha...

ChainLink has an option to use Intel SGX along with TownCrier to provide trusted execution at the processor level. That ensures confidentiality without exposing the data at all.

http://www.town-crier.org/

throwawaymath · on Dec 4, 2018

Confidentiality is not an obstacle to using material data. As long as you can independently obtain it through unprivileged research, you're fine to use it, sell it or trade on it.

hobofan · on Dec 4, 2018

We also pitched something in that direction with Rlay at Techcrunch Disrupt Berlin: https://techcrunch.com/2018/11/29/rlay-startup-battlefield/

cpb · on Dec 4, 2018

In the ESG use case, how do you measure inter-rater reliability? How do you control for exposure to MNPI?

pmart123 · on Dec 4, 2018

This is something I've thought about and worked on over the last several years. I'm more than happy to have an in-depth conversation on it if anyone is interested.

Immortalin · on Dec 4, 2018

Fwiw that's what I have been trying to do for the past few years, infrastructure for easier access to algorithmic trading.

Shameless plug: https://KloudTrader.com/narwhal