Hacker News new | past | comments | ask | show | jobs | submit login
Neglected machine learning ideas (scottlocklin.wordpress.com)
319 points by rbanffy on Aug 4, 2014 | hide | past | web | favorite | 49 comments

I think this article exemplifies the difference between the kinds of things you see codified in books and on the internet, versus what's active research and well known folk lore in academia. And maybe it highlights the substandard search mechanisms for published research, or the difficulty of learning from published research papers. But it's definitely not about neglect, at least not for most of the topics listed in the article.

For example, the recent AAAI 2014 conference had a bunch of papers on online algorithms for various problems. [1] Likewise, COLT had five or so papers on online learning. [2] Same with KDD [3], SODA [4], and the many other conferences this year that accept papers about ML.

And learning in the presence of noise? Unsupervised learning? Feature engineering? I am literally doing multiple research projects in all of these areas right now! The only way I can imagine that you think they're neglected is that you just don't know where to look for them, because these topics are all over the place in my world. For example, one common term for "feature engineering" is "representation learning," and this was a big topic at this year's SDM conference, specifically w.r.t. data mining in networks.

Why can't you find a book you like for topic X? Maybe it's because researchers have little incentive to write books. You folks in industry could fix that. What with all your ridiculous market valuations of various mobile apps, surely you could scrape together enough funding to convince the experts in their field to write a book.

[1]: http://www.aaai.org/Conferences/AAAI/2014/aaai14accepts.php [2]: http://orfe.princeton.edu/conferences/colt2014/the-conferenc... [3]: http://www.kdd.org/kdd2014/program.html [4]: http://www.siam.org/meetings/da14/da14_accepted.pdf

One often neglected idea is transduction. The idea is that you can learn to label data better by augmenting your labeled dataset with unlabeled data.

The unlabeled data helps discover the structure of the data which in turns helps the supervised learning.

There is a lot of 2-phase approach to this problem, by first discovering feature with unsupervised learning and then using those features in supervised learning, but in many cases it's possible to do both at the same time.

If you have a generative model for your data, you can treat the labels as potentially missing values and learn the joint distribution.

Afaik what you're talking about is more commonly referred to as "semi-supervised learning". Transduction is a more specific case of semi-supervised learning where you know the test set, i.e. you know which data points the model will have to make predictions for. That means you can exploit this data for training the model, for example by unsupervised pre-training or with a pseudo-labeling approach.

Bayesians don't need a test set 8)

Why is that?


Nuh uh, anyone doing LOCV is basically using AIC. There are also other principles, such as MDL which do not rely on a test set.

Because the prior on your parameters smooths out the prediction.

Most cookbook techniques such as ridge regressions, cross-validation, etc have a Bayesian interpretation as a prior on the parameter.

Bayesian techniques allow you to use all the data available.

That said, sometimes they are computationally expensive, and it's better to approximate them by using a test set.

Unless you have an infinite regress on priors for your priors, and uncomputable Komolgorov penalties on the structure of your model, I think you need a test set. (This means you need a test set.)

There's an interesting chapter in MacKay's book on Occam's razor. I'm not sure how I feel about it, but it's very thought-provoking.

If your priori are that strong, why bother with the data?

You don't need a validation set. I'm pretty sure you still want a test set.

I think the last point is important. Since it connects the ML and Stats ideas.

It comes down to verbiage really - some general form of label propagation vs censored data.

All ML is density estimation using a parametric probability distribution.

Some algorithm make the estimation easy and the distribution quite implicit, others make the distribution explicit but training is harder.

> All ML is density estimation using a parametric probability distribution.

Umm is there a typo there ? Otherwise it is so wrong that I cannot even begin to pick the flaws.

In fact the key revolutionary idea in classification theory, courtesy Vapnik, was that one needn't estimate the density at all.

I cannot over-emphasize the revolutionary part, prior to Vapnik, that over-arching consensus was that one should try to learn the the density and then threshold it to obtain the classifier. The motivation for this is the result that the thresholded conditional probability is the optimal classifier for a {0,1} loss.

What Vapnik showed was that it is better to learn the classifier directly (he called it structural risk minimization, SRM, for short), without any parametric assumptions whatsoever. The main reasoning is that density estimation is freaking hard, and depending on a freaking hard task as a pre-requisite is backwards. Learning to classify is actually easier than learning densities. It was one of those paradigm shifts that come to a field only once in a while.

Even if we let SRM be, even in traditional statistics there have been methods as old as time that do not assume any parametric model. Those into non-parametric statistics (I am not one) would be very upset with your claim :) The problem with parametric models is that you almost never know the parametric family. You can of course test, but how many would you test, there are infinite number of such families. Non-parametric methods dont make parametric assumptions, (they do make some and considerably weaker assumptions. Assumptions are necessary for learning). The drawback is that they are more expensive, similarly if you happen to know the correct form of the family parametric model would be hard to beat.

@srean -- I agree with you. To make the GP statement true, you have to adopt sweepingly wide definitions of the terms "parametric" and "density estimation", and a very narrow concept of ML to boot. You have given some good background and I feel a need to add more examples.

The idea of boosting comes to mind. A quintessential ML concept, widely applicable, but really nothing to do with density estimation.

All the PAC learning results are another example. They show very strong results over very wide problem classes, but don't have to do with density estimation. They apply, for example, to problems where no density exists (just a probability measure), much less a smooth-enough density to estimate.

Another example of ML research that comes to mind is time and space efficient learning. There is a whole body of work on probabilistic analysis of data structures used for ML (I'm thinking partly about Alex Gray's work, http://www.cc.gatech.edu/~agray/). The work fits solidly into ML, but it's not about density estimation, it's about making provably fast algorithms for large problems.

SVM induce an implicit probability distribution over labelled sampled of points, whether you like or not.

Yes, "parametric" is often meant to imply a finite parametrization, but that's a dumb terminology.

For conversation, at least, it helps to stick with the agreed upon definitions, for example, "parametric" means finite parametrization. Otherwise the discussion is moot, if you use too broad definitions thens "everything is parametric" becomes a tautology, a vacuous statement.

PAC and MDL arent as much as at odds that you make it up to be. If you are familiar with PAC-Bayes and PAC-MDL theorems you will know what I mean. My main issue was with the ridiculous (or with your definition of whats 'parametric' then a vapid) claim that all of ML is parametric.

Secondly, wanted to highlight that its a bad way to go about ML, by estimating data distribution first and then find the decision function. This is the precise reason why SVMs broke out of the then state of the art. The two stage 'plug-in' approach used to be the preferred way then. A decision function might be Bayes with respect to some density, but thats besides the point, it is ill advised to match the density and then derive the decision function, except for very special and narrow cases.

Even if you throw PAC out, parametric stats (am using the standard definition here) is a tool with very narrow scope. If it happens to be in the current scope then by all means use it.

BTW you make good point about transductive learning, it got compared to semisupervised, but they are not the same thing. Both use unlabeled data, but for transductive, the points were you seek labels have to be given ahead of time. This enables very strong guarantees. I think its potential has not come to fruition yet.

EDIT: @murbard2 > Re: See what I did there.

Color me violently unimpressed. That is a not a parametric distribution.

I continue to stress that plugin estimates are in general not a good idea. Even die-hard Bayesians would not use it, they would rather estimate the MAP or the mean aposteriori decision function directly, rather than fit data distribution and then derive the optimal decision function from it.

I find Bayesian inference to be a good idea, except that it is costly unless you use conjugate priors (these are motivated more by convenience than by data) and its "turtles all the way down" problem.

Its not that you have advocated plug in estimators, but given your claim about everything about ML is about fitting parametric densities, it is likely to mislead readers into believing that the two step way is a good idea.

I agree that Bayesian techniques can be costly. The argument I'm making is that Machine Learning techniques are best understood as approximations to Bayesian methods. Not for some math reason but for some epistemological reason.

You like SVM? Fine, then why do large margins matter at all? "Intuition" ?

The decision function can be seen as a distribution too. Say you're thinking of building a classifier, you can build the distribution over the domain x {-1,1}

Give me f: A -> B and I'll give you P : (A x B) -> R+ See what I did there?

And yes, you can try very hard to be a Popperian and avoid the problem of induction by inventing a contrived framework like PAC. Or you can embrace description length as a universal prior.

That sounds interesting, do you have a good reference for reading on SRM?

Its hard to pick the one true reference. It is of course covered in Vapnik's two books. Just to set expectations right, these arent what I would call hacker friendly books, they need a fair bit of maths. I think another good one is the book by Devroye, Gyorfi and Lugosi, you will find more examples here. You could also google "statistical learning theory" that would give you lots of relevant hits, in particular course notes.

DGL is great, but also not "hacker friendly"

These are great topics that are often neglected by machine learning textbooks. Some of the reason has to do with machine learning textbook writers not really doing research in Reinforcement Learning or Time Series. For things like Online Learning the author cites a great book but nothing for a more mainstream audience has been written yet.

Much of this stuff is being actively worked on though. If I could give one practical tip. Read KDD conference papers. Those are very applied and usually very accessible demonstrations of what techniques are out there, what problems they are typically applied to and importantly how well they worked.

Excellent post.

Really good summary. I find it interesting that Reinforcement Learning, Online Learning, and Time Series modeling are all in the neglected category. They are all methods that seem to fit very well in both autonomous robotics and finance. I would venture to guess that they aren't really neglected, just hidden behind the firewall of Intellectual Property.

I agree. I wouldn't consider online learning neglected. The article mentions vowpal rabbit which is being used at Microsoft

> "online learning" ... and the subject is unfortunately un-googleable

That's not a surprise; "online learning" is a pretty misleading name. Here are some better alternatives:

  * incremental learning / training
  * stream-based learning / training
So, let's use these terms instead -- then it will be easier to find in your favorite search engine.

Here is one example using the term "stream-based learning":


Just qualify it:

  "online learning" algorithm
That gives me 10 relevant hits out of 10 on the first page.

I wouldn't say its "misleading", it is a correct usage of the word online (especially contrasted with offline learning) and is the name of the sub-area. You won't find many (most?) papers in the area with those search terms unfortunately.

These so-called "online" techniques do not require being online (as in connected to the internet); they simply require being exposed to a stream of ongoing training data. They learn incrementally. This is why I say the term "online" is misleading.

See: http://en.wikipedia.org/wiki/Online_machine_learning

> Online machine learning is a model of induction that learns one instance at a time.

Online learning existed before the internet, dammit.

This is actually a great overview/review of some of the things that I've encounter regularly at work, but never seen discussed seriously in a book or class.

Hierarchical Temporal Memory based algorithms such as the Cortical Learning Algorithm (CLA) implemented by NuPIC [1] should also be on this list (unsupervised, online, realtime)

[1] http://numenta.org/

I think most people would classify those as AI rather than machine learning.

I would love to see more books/articles/blogs on unsupervised learning and ensemble techniques. E.G. - can I use k-means clustering as input to train a naïve Bayes classifier?

I'm actually doing some research on that example now. The problem I've run into is that you need to use parameters that are relevant to making classifications in the clustering algorithm. This is unfortunately kind of a chicken/egg problem, because removing parameters from the clustering algorithm changes the clusters.

Interesting...I should write a blog post about some techniques I've used that are similar to your example (modulo specific algorithms). Is there something specific you are trying to accomplish?

I would read it! I work with a specific kind of high(er) dimensional medical imaging data, and I think unsupervised learning could be used for classification and foreground separation. K-means is giving me some promising preliminary results, but I'd like to assign samples continuous probabilities rather than binary classifications. I'm relatively new to ML but trying to incorporate it into my research, so I apologize if any of that doesn't make sense!

K-means clustering has been successfully used to extract features in a "deep learning" style architecture (with good results at image recognition). You'll probably find this useful: http://web.stanford.edu/~acoates/papers/coatesng_nntot2012.p...

A lot of these ideas, with the possible exception of conformal prediction, have quite a lot of academic literature. They are hardly neglected.

I've scanned a lot of literature on various AI fields over the last few years (And Jeesh, we should say AI if we're talking getting superb, actually-intelligent algorithms as opposed to the work-a-day, reliable algorithms that "machine learning" arguably already has).

I would contend that there can be a "significant" seeming amount of literature on field X but field X may still wind-up not pursued in the larger scheme of things.

Often what happens is a single individual or small circle, gets interested in a given field and researches it among things for as long as the funding persists and then once the funding dries up they move on. Or one person has tenure, keeps researching but everyone else moves on because it doesn't look like a way to keep getting funding.

Even more, as the author mentions, a big question is what approaches are taught as the way to do it (and I guess it again comes down whether you're aiming for just machine-learning/a-better-heuristic-statistics-for-big-data or if you are aiming for moving towards intelligent algorithms, even if intelligence means just flexible adaptivity).

Yes, you can find lots of results if you search for "online learning", say. Otoh, for whatever given algorithm that has mindshare currently, is there a quest to find an online version? My sampling of the literature says no and I happen to agree with the author that online processing could be an important piece of artificial intelligence advances.

Was anyone else taken with the images interspersed throughout that blog post? Does anyone know where they are from?

Edit: found them thanks to Google image search: http://www.darkroastedblend.com/2014/01/machines-alive-whims...

I saw a nice attribution at the end of the post. (But possibly it was added belatedly, e.g. after seeing your comment.)

> Images by one of my heroes, the Ukrainian-American artist Boris Artzybasheff. You can find more of it here.

Agreed, very nice attribution. It wasn't added in response, I noticed it 8 hours ago (last night for me).

The invisible hands re-ranking HN posts continue to astound me with their reach..

We're experimenting. Expect to see more of this.


Interesting, I'm glad to see more thought applied to rankings. What kinds of posts are you hoping to see more often?

Substantive posts that would otherwise have fallen through the cracks.

Yup, this is not so revealing

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact