
Ask HN: Rules of thumb to test feasibility of Machine learning applications? - adamwi
With the recent developments in the field of Machine learning there are plenty of problems that can now be solved that were not possible a couple of years ago. As someone with only a basic understanding of the field I find it hard to judge the feasibility of product ideas involving machine learning.<p>Given that unique and relevant data sets are often hard to come by (for obvious reasons) I&#x27;m wondering if there are any good rules of thumb to judge feasibility of different ideas.<p>Let me give you a concrete example; build a system that looks at medical records and approximate the risk for a certain illness. It is possible to fairly easy get overall data on the how common the illness is, which symptoms that are relevant and even if those symptoms are commonly recorded in medical records. But the granular data in the actual medical records are fairly hard to come by and would require a significant effort to collect. In this situation it would preferable to do some approximations on e.g. how many medical records are needed to get certain precision before pursuing the idea and start collecting data.<p>A less defined example would be; build an application that identifies if a picture contain a golden retriver dog with a red scarf around the neck. Also here relevant to have rough numbers on the number of data points needed etc (even if the actual data in this case is probably much easier to come by).<p>In the first case I could probably get ok approximations using statistics assuming normal distributions, but less straight forward in the second example.
======
PaulHoule
If you're interested in commercialization you should start from day one with
some estimate of the value the application creates. That is, "saves $X
dollars" or "creates $X in revenue".

I do work in the natural language and item matching areas and in those cases I
do what I call "preliminary evaluation" by working a small number of cases
(say 10-20) in depth and putting together some story about what kind of
outputs would be expected, what the actual requirements are, and what a
decision process is going to have to take into account. You've got to put
together a plausible story that the decision process exists.

For your case I would say the dog example is more feasible than the health
care one. The caveat is what the negatives are like for the dog: are we
looking at photos that have a lot of yellow and red? Are we looking at photos
of dogs, etc? As for health care, prediction just adds to the health care
boondoggle unless you can make the case of making a difference in outcomes and
cost as opposed to just getting a better score at Kaggle.

In the case of text examples I'd say you want 10,000 examples of items in the
class and at least that many out of it if you are doing a problem that bag-of-
words is able to do to get results that you'd really be proud of. You might
get that down to as little as 1,000 if some dimensional reduction is in use.

The center of my approach, when precision matters, is case-based reasoning,
where you really find that there is one simple strategy that works say, 70% of
time, and then a patch that gets you to 80% and then you keep adding
exceptional cases to work up the asymtope. In a lot of cases like that you can
establish a proof as to a lower bound of how accurate the results are and work
up to handling more and more cases.

A core issue though is evaluating what matters, which is why I say follow the
money. There is no better way to destroy evaluators than making them split
hairs that don't matter.

~~~
adamwi
Interesting, first of all fully agree with you regarding the business side.
Always need to have the customer, their problems, and the revenue model in
mind. E.g. for the health care example I think there is a big upside from the
insurer side to able to proactively identify illnesses and treat them early
(typically cheaper than emergency care, but not always), not to talk about
minimizing the human suffering.

But back to the actual question, rules of thumb to estimate feasibility of
machine learning application without having access to a actual data set for
the specific problem. Make sense to break it down in different problem domains
as you mention, NLP, words, image classification, etc.

The 10,000 examples is for bag of words is something I will keep in mind going
forward, thanks! When it comes to image classification I guess a fairly good
benchmarks can be achieved by looking at available image datasets and public
models built on top them (e.g. ImageNet and later versions) and then
extrapolate on the precision and number of images needed to achieve it
(assuming similar image datasets).

Anyone aware of other relevant rules of thumb for other problem domains?

------
bioboy
A lot of what machine learning offers is beyond correlation and more about
interaction between variables to get a result. So think multivariate analyses.
If you can do a multivariate analysis to get to something that is
statistically significant for a certain disease, then it would probably be
worth checking out.

Think of it this way: machine learning is all about grabbing features from
what we can normally say "duh its right there that's what is causing it." but
in an automated manner. So how do we make the rules for it?

We need many, MANY, examples. If you can provide CONCRETE examples for each
occurrence, then you MAY have a chance at giving it some sort of predictive
capability.

The more important issue is HOW you plan to extract these features, the things
that make you go "duh, that's whats causing it." So focus on this last part,
and the rest will come easier.

