Hacker News new | past | comments | ask | show | jobs | submit login
Machine Learning 101: An Intro to Utilizing Decision Trees (talend.com)
192 points by markgainor1 on Sept 30, 2016 | hide | past | favorite | 32 comments

Here's the truth.

If you want to work in ML, and are all excited about the (justified) deep learning hype, you are much better off learning decision trees, random forest and gradient boosting first, and then learning neural networks.

Most of the time, for most non-binary data XGB/VW will outperform a NN, be easier to use and more interpretable.

Plus you'll learn all the shitty programming stuff you need to do to get your data in a reasonable shape for a NN anyway.

The upside of teaching decision trees first is that it allows me to spot the people that dropped out of the online course after week one right away ;)

Seriously, I see an amazing amount of people that deploy basic vanilla decision trees (no boosting/bagging, no particular thought about features etc) without any real thought or reason. It has become one of my main "ML smells".

In m experience, 95% of DT optimization comes from pruning; not boosting/bagging.

How does someone use a DT without some feature engineering?

I mean - I like people who say they'd try a random forest first for a given problem because it's almost always a good answer.

But yes - feature engineering is almost always the difference then.

At the end, the most important step is good feature engineering. (including dimensionality reduction, regularisation and stuff like that)

Can you tell me more about what you mean by "features"? I taught myself part of how computer vision works, and now I'm wondering if "feature points" in CV map to... something, elsewhere, like NP problems map to NP problems.

Or do you mean features like a website?

Feature is an individual measurable property. Features are the input variables you give to your model.

Feature engineering is the work process where you massage, select, transform, normalize, regularize, standardize, adjust, whiten your input into format that is easy to digest into next phase.

To confuse things even more, another name for feature learning is representation learning. Deep does just this. One layer learns new features that are inputs for next layer.

Real world example I'm working with:

I have raw EEG data from multiple channels and I want to use machine learning to classify it. The raw input is rows of integer values. That represent EEG reading 1 per channel. There are 300 readings per second.

Consider all these issues: 1) every person and every electrode has different level of impedance and noise every time they are attached, 2) during a measurement the impedance changes and noise level changes. 3) there are several periods where person moves messing with the measures. These measurements should be removed automatically.

The pipeline from raw data to the beginning of deep learning pipeline may be as follows: linear and constant detrending, normalize to standard deviation, remove 50 Hz noise, whiten, STFT into 0.5 second frequency segments, (remove epochs where patient moves) run trough carefully adjusted logit function.

So, if my understanding from this (and the other comments) is correct, then yeah; kinda.

In CV, the first-ish level of processing determines the points on the image that will be used as "individual measurable properties" (or, maybe more accurately, the sources for those). These are the features, and then they go into the rest of the system for recognition, etc.

The basic theme is stripping something analogous to "noise" and "bias" - noise being the stuff you don't want to pay attention to, and "bias" being the predictable/regular offset per thing - because you want to recognize that two things are actually the same thing, despite differences in noise and bias.

(An example of "bias" might be the color of the lights where the picture was taken).

Then the idea is pretty much that, if the same thing is seen in two different places, after stripping the noise and the bias from it, they "look" identical (diff is within an epsilon).

That sound about right?


Manually correcting the bias by engineering a feature with it stripped out is a good example.

Features are variables which are inputs into your model.

Imagine you are predicting a basketball game. Features would be things like:

Season Win/Loss ration of each team

Number of wins in a row

Proportion of points by best player on team

etc etc.

These are pretty easy, but then sometimes you need to do work to generate the features - this is called feature engineering.

Some features which might need more "feature engineering":

Average sentiment of social network posting by team members

1-on-1 effectiveness of defensive players vs opposition offense (something like this might require video analysis and sometimes manual annotation)

In the computer vision world, things like edges and points are often used as features.

Deep learning is nice because it automatically learns these features given enough time. Otherwise people spend a lot of time building algorithms like SIFT etc.

Good examples. Lets do another one, more text oriented.

Say you want to classify if text is spam or not. Features could be each word. Then you can build additional features on top of those words, like: - is header all caps? - is text contain some particular keyword (viagra, enlargement...) - are there typos? - link to weird domains?

and much more. You can create tons of those features. Some would be helpful, some neutral, some will decrease quality of your model.

Core of the machine learning is creating those features, sending them to algo and measuring impact. Choosing algo is actually smallest portion of your job (on most projects)

NGrams. Text processing almost always benefits from NGrams features.

Features is another term for inputs, variables, etc. It has a connotation of being recognisable as a property of the thing being modeled, that you might have cleaned up from the raw data.

Functionally speaking, the "features" are a transformation of the image pixel values values,

    F: R^{n x n} => R^d
So `F(image)` is a vector of length `d`, containing `d` derived features from the image. Note that `F` could be a an identity transformation of sorts, which simply flattens the image. In this case, the "features" are simply the image pixel values. However, in general, the feature vector attempts to use a priori information about the problem domain to make things easier down the line.

Now, for binary classification, the classifier is a map,

    C: R^d => {0,1}
To obtain a "good" classifier, one optimizes the parameters of the classifier so that

    C( F(image) ) = label
fairly well, by using database of `( F(image), label )` pairs.

You could a imagine a database of 100px by 100px images of circles and squares. In this case, you might have a hard time obtaining an accurate classifier, if the "features" are simply the image pixel values. On the other hand, a small feature indicating the presence of corners and curvature through some mathematical transformations would likely perform well.

That's a very image-centric explanation, and I'm not at all sure it makes things any easier.

Conceptually, in the image case, features are "things" in images that ML tools use to perform tasks.

A image with lots of blue it it has a chance that it is of sky.

An image with lots of hard edges might be something human made - a house, a book etc.

Pixels are really a distraction - there are alternative representations of images which don't use pixels at all. Think wavelet based compressed sensing techniques.

What resources would you recommend? And in what order of consumption?

It really depends on how you learn.

Traditionally the best answer is to do Andrew Ng's Machine Learning course[1]. It's a great course, and you won't regret doing it, but it is kind of annoying that it is in a language (Matlab/Octave) you'll (hopefully) never use again.

A lot of people now recommend working through CS229[2]. I haven't looked at it depth, but I've been impressed with a lot of the class projects.

If you like books, then Statistical Learning in R[3] is generally well regarded.

If you like doing stuff, then Kaggle and SciKit-Learn will throw you in the deep-end. Just be aware you can't just program, though - you really do need to understand some theory. It's good to run into a problem, and then really, really understand the reasons behind what you are seeing.

[1] https://www.coursera.org/learn/machine-learning

[2] http://cs229.stanford.edu/

[3] http://www-bcf.usc.edu/~gareth/ISL/

If you want to do Andrew Ng's ML course, but want to do it in python: https://github.com/icrtiou/coursera-ML

Thank you.

I agree with trying out xgboost and vw first, they are quite efficient for getting a baseline, maybe even for the final model. But it only takes a little more to whip up Keras and try a MLP, CNN or LSTM. It's quite accessible.

No argument.

But for most of the datasets that most people see you won't actually do any better - you just don't have enough data for a deep neural network to work, and you probably have to do the feature engineering work anyway.

When I finally learned about tree based methods, I was quite surprised this isn't covered in Andrew's Coursera course.

It's actually a lot easier than stuff like Support Vector Machines.

Honest question - I have several machine learning with Python books that I'm about to dive into but before doing so, which resources would you recommend a complete beginner in ML to read?

I gave another answer further down, but specifically:

First, learn and understand the central limit theorem. That will force you to understand enough statistics to not be dangerous.

Then I'd work through https://www.kaggle.com/omarelgabry/titanic/a-journey-through... until I can do the whole thing myself.

Then I'd do a Kaggle. I like Kaggle because the datasets are well prepared, and the problems are well stated. There are plenty of other similar datasets etc if you don't want to do that.

I don't know any specific Python books though.

[1] https://www.khanacademy.org/math/statistics-probability/samp... maybe? I haven't watched these, but generally the Khan Academy stuff is a good place to start.

Blog post from a while ago, with nice visualizations, but heavy browser computation:


In theory, aren't decision trees universal approximators like Neural Networks?

Yes. And in theory, k-nearest-neighbors is also a universal approximator. (Being a universal approximator doesn't tell you much since so many ML algos are).

This [D]NN are "universal approximator" claim that gets trotted around really gets my goat. Sorry I have to vent now, and its no fault of my parent post or grand parent post.

Its not that we were short of universal approximators. As far as learning is concerned, whether the class is universal or not has little or no import. For the sake of argument lets even assume that there is no noise in the training set [in other words I having a white horse on wings with a horn on its forehead that shoots laser beams with its eyes and farts indigo rainbows]. There will be infinitely many functions that passes through your training points and not all of them will have the same error on an unseen point.

Learning is a very different ball game compared to approximation.

"Universal", but in what class of functions ? All differentiable functions ? (the typical NN is restricted to this class), all continuous functions ? all measurable functions ? all computable functions ? Its surely not _all_ functions

Here is a function: +1 on rational and -1 on irrational, can it approximate that ? here's another that is harder, +1 on a non-measurable set and -1 on a measurable set. You say these functions don't matter, indeed, I would be scared if they did matter. But it shows the real deal is characterizing that class of functions that "matter".

On MNIST KNN will get you 95% but a deep net will get 98+. I find that interesting.

On MNIST KNN with clever feature engineering will get you 99.4%, and a crazy deep net ensemble will get you 99.8%. I'm not sure what your point is.

MNIST is the todo list of machine learning - it's a necessary but not sufficient condition for knowing which algos are good. In other words it's only useful for finding out which algos aren't good (e.g. if your code only gets 90% on MNIST you know there's a bug, but if your code gets 99% on MNIST it doesn't really mean much).

Yes, but there's no free lunch[1].

In theory, on average over all possible problems all algorithms are the same.

In practice, that's a pretty useless piece of information, since the only thing that matters is your problem.

Tabular data, and not huge amounts? Tree-based systems are your friend. Vast amounts of binary data (or increasingly text)? Deep neural networks will get you the best result.

[1] https://en.wikipedia.org/wiki/No_free_lunch_theorem

Random forests don't get to be as 'deep' as NNs.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact