Random Forest vs. Neural Network

eden_h · on June 3, 2019

This is a very thinly disguised advert for the author's product, and doesn't really advance on the benefits of either approach, as it doesn't go into any depth on why Random Forests/NNs are applicable to each type of data provided.

They're both generalised solvers, but default Random Forests aren't the most common Forest these days - LightGBM/XGBoost are both using Gradient Boosted Forests by default, which would be a much more interesting comparison to a NN's Gradient Boosting.

turingbike · on June 3, 2019

I don't know why it isn't as popular, but CatBoost should be on the list too https://catboost.ai/

eden_h · on June 3, 2019

I tried Catboost when it came out. It should be very popular, as working with categories is where a lot of people seem to fall down in Random Forests.

The 'typical' response is either to make them into numeric variable, so 1-3 for 3 categories, or to make an individual column for each one. The first approach makes sense for ordinals, but not so much for actual categories, and the latter makes it difficult to group categories when a group of two categories together has more predictive capability than any single group. I know that LightGBM did a lot of work in this to optimise testing groups of variables, as testing every possible group in a large set is very intensive.

When I tried Catboost in R, I remember it downloading a large binary to work with, which put me off considerably, and predicting with it was pretty fragile, even for R. I trust Yandex about as much as I'd trust Google, but it seemed 'odd'.

autokad · on June 3, 2019

in kaggle, I often turn categorical into numeric and call it a day (even if its not ordinal). I have even found that forcing ordinality (like software versions in the microsoft malware competition) usually makes things worse in hold out.

spending too much time on categoricals is a waste of time, there are other things you can improve in your limited time, and even 'doing the right thing' results in poorer performance in hold out.

catboost is great, it ensembles wonderfully with xgboost. if you find it being fragile, you probably have outliers that need droped - tree algs are really just fancy nearest neighbors so an outlier can ruin predictions considerably.

In general, lgbm trains fast and lets you try many things quickly, but almost always under performs catboost and xgboost. catboost performs really well out of the box and you can generally get results quicker than xgboost, but a well tuned xgboost is usually the best. since xgboost and catboost build trees differently and both perform really well, they make great friends in ensembles.

I have done pretty well on kaggle though I haven't invested much time, top 100 in zillow home price prediction

ScoutOrgo · on June 3, 2019

I think it is actually preferable to start by converting categorical variables to numeric most of the time, even if they are not ordinal. The RF algo can separate off individual classes with 2 splits (e.g. <=7 then >=7) if a single class is very important. The "pool" of features for RF sampling also doesn't get diluted with one hot encoded classes from the one feature.

I am pretty sure I've seen this done successfully in kaggle a bunch before, but don't have any sources on hand for evidence that this method is "better". It does however make it much easier to just throw the data into the RF and check the feature importances to see which features are helping the most.

eden_h · on June 3, 2019

The only case it struggles with is when the grouping is difficult to achieve in a small amount of splits, such as 1,3,5 against 2,4,6,7, especially when each split will need to show more predictive capability against any of the other column options.

pplonski86 · on June 3, 2019

You are right, CatBoost is an amazing algorithm. However, you will be shocked when you will talk with many older senior data scientist, that never heard of Xgboost or LightGBM. CatBoost for them is far too new.

ramraj07 · on June 3, 2019

Thank God I came across it in kaggle. Definitely amazing and it also gives easy python export of the model so you can put it as an udf in databases!

vladf · on June 8, 2019

Interesting -- could you elaborate on your use case here?

misterman0 · on June 3, 2019

>> as an udf

What exactly is that?

kurthr · on June 3, 2019

Just a guess on my part, but UserDefinedFunction?

sheeshkebab · on June 3, 2019

User defined field perhaps?

cheez · on June 3, 2019

I looked into the site but I didn't see how CatBoost actually accomplishes its goals compared to other gradient boosting algorithms. Is there a summary somewhere?

sanxiyn · on June 3, 2019

Read "CatBoost: unbiased boosting with categorical features". https://arxiv.org/abs/1706.09516

minimaxir · on June 3, 2019

With the current low cost of cloud computing, there's no reason not to just try everything and see what happens (which is why AutoML has become more popular).

It's more pragmatic than trying to rationalize which framework is "best" for a given dataset, as the results are often counterintuitive.

oneshot908 · on June 3, 2019

On the contrary, I think this is one of the biggest emerging blockades to progress in ML/AI research, especially in academia. It has always been more cost-effective to run ML algorithms on consumer HW such as GeForce GPUs and gaming CPUs. It's frequently even faster than contemporary cloud offerings when the consumer HW gets ahead of existing enterprise HW. And it's so effective that HW companies starting changing their EULAs and crippling previously available aspects of APIs to herd AI back into the datacenter where they seem to think it belongs.

And that IMO is a reinvention of the "Walled Garden" of academic HPC (ask any grad student begging and pleading for supercomputer time) which has always sucked and its new commercial incarnation is even worse because it's unclear how to get commercial cloud time on government grants.

OTOH it's fine for large shops like OpenAI, DeepMind, AWS AI, FAIR, MS Research etc because they have deep deep pockets. So if you're content with most future groundbreaking research coming from a small tribe of market leaders, well great, but I suspect innovation is already slowing down because of this.

dlphn___xyz · on June 3, 2019

this approach is also very inefficient- autoML takes hours searching for parameters when you could build the same model manually in a fraction of the time.

minimaxir · on June 3, 2019

There's no reason you can't do both: prototype a simple model to get a baseline performance, then use AutoML to fine-tune it, plus you now have a value to sanity check against.

Those hours of hyperparameter search aren't blocking. You can do other things while it's searching, or do the search when not actively using the resources (e.g. overnight).

pplonski86 · on June 3, 2019

It depends on what algorithms you use in AutoML. If you decide to use simple algorithms: logistic regression, decision tree, random forest then you will have a simple model very quickly. Using Neural Networks in AutoML requires much more computational resources.

riku_iki · on June 3, 2019

if you know optimal hyperparameters from the beginning, then yes, you can build such model manually. But in most cases this is not the case.

pplonski86 · on June 3, 2019

I don't like such brute-force approach. Even if you have low cost of computing the number of possible combinations of hyperparameters is huge! Google Cloud AutoML Tables solution cost is 20 USD per 1 hour of computing (I guess that's because of inefficient Neural Architecture Search algorithm). Running few ML experiment can easily end with huge bill.

minimaxir · on June 3, 2019

For independent ML hobbyists, sure. But if you're a business which can get a positive ROI from it, it's well worth the time.

mochomocha · on June 3, 2019

It depends. Complexity creeps up easily in ML systems. A kitchen-sink of gazillions of algorithms (and code bases) in an ensemble creates a very brittle system I wouldn't want to deal with.

windsignaling · on June 3, 2019

Not to mention the article is full of grammar and spelling errors. Not a great look if you are trying to promote your product.

turingbike · on June 3, 2019

The article recommends RF for tabular data because it is easier. In general I agree, but newer tools are making NN for tabular data as easy as can be... see, for example, fastai https://docs.fast.ai/tabular.html

ramraj07 · on June 3, 2019

The biggest reason to use RFs is that with sufficient trees it's basically impossible to overfit your data. You also don't need to spend days optimizing your hyperparameters. Hence, if you need a quick model where time is tantamount, and you want to err in the side of caution, I feel like an RF is the best choice.

claytonjy · on June 3, 2019

It's quite possible to overfit with an RF; if you grow deep trees on a self-similar feature set, 1000 trees can overfit just as much as 10.

ramraj07 · on June 3, 2019

Never heard anyone around me talk about self-similarity in feature sets for ml models, could you explain some more?

SubiculumCode · on June 3, 2019

I think the poster might be indicating the case where training data are higly redundant...

claytonjy · on June 3, 2019

right, it could be simple multicollinearity, or more complex relationships. Because RF is such a good first-try model, I often want to use it on feature sets I haven't carefully pruned, which can be dangerous if you're measuring the same underlying thing in multiple ways.

SubiculumCode · on June 4, 2019

Since it seems you know a bit of data science, may I ask you a quick question?

In my line of research I am frequently trying to use high dimensional data, but with few examples (<100 per class). Thus methods like SVM are used. I've been thinking about how I might leverage my sample to artificially simulate new training examples via pairwise warping of images within each class, with the assumption that informative features will be preserved with warping.The training examples within class are already quite variable, so I don't think a little increase in redundancy will hurt me much..but I am not sure.

Without knowing more concretely, do you have thoughts on such a strategy?

Data are 3D brain images and classes are disorder groups.

claytonjy · on June 4, 2019

this can be tricky, because it varies so much by domain. I imagine you have a good handle on the domain, so you can hopefully do a good job defining reasonable noise on each measure.

You can also try more generic upsampling techniques, like SMOTE, which should be easy from python or R. It's never actually helped me, but I assume it's useful somewhere.

I suspect at some point you're going to need to take an axe to some of your inputs, preferably based on human priors rather than a sketchy feature-selection process.

SVM's are great, but once you get past linear boundaries there's enough tuning complexity that I'd rather use that effort tuning a GBM. That's largely because of tooling though; I know there are modern SVM libs, but I haven't used them. Definitely try a random forest if you haven't!

SubiculumCode · on June 4, 2019

Thanks for your input.

pplonski86 · on June 3, 2019

You are right! The more trees in the RF the more stable the model is. What is more, you got feature importance for free!

claytonjy · on June 3, 2019

careful, the "free" feature importance (e.g. decreasing Gini) has a lot of known issues: https://explained.ai/rf-importance/index.html

yters · on June 3, 2019

I would have assumed the opposite, with enough trees you are guaranteed to overfit your data? Boosting increases the VC dimension of the aggregate model, which makes it more prone to overfitting.

pplonski86 · on June 3, 2019

Random Forest doesn't increase overfit error when adding more trees. I did the experiment on toy dataset to check it https://mljar.com/blog/random-forest-overfitting/

What is more, Leo Breimain wrote on his website: "Random forests does not overfit. You can run as many trees as you want" https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home...

rq1 · on June 3, 2019

These are (false) claims, therefor not proofs.

Deep trees will fortunately overfit your dataset.

Any binary tree of depth log2(P) can completely separate your P points.

ramraj07 · on June 3, 2019

But we are merging hundreds of trees each of which has been handicapped by removal of multiple features and a fraction of the data. Sounds to me like overfitting is not easy (no single data point or feature contributes to every tree so it can't be represented all the time).

False claims as they maybe, these are claims I've seen in at least two of the most commonly studied statistical learning text books, so given that it makes sense and that it's in the text books, it seems reasonably not false to me. Someone else posted that if too many features or data points are very similar then it will overfit, and that totally makes sense. Whatever you say doesnt. Clarification would be useful.

yters · on June 3, 2019

Adding bunches of trees will overfit the accidental patterns in your data.

I have an explanation here why reducing variance is not the same as reducing overfitting: https://news.ycombinator.com/item?id=20089890

thom · on June 3, 2019

Yeah, at a certain point it's just playing 20 questions with your data.

olooney · on June 3, 2019

Random Forest doesn't use boosting. Tree models like XGBoost which do use boosting are indeed very prone to overfitting.

gbrown · on June 3, 2019

Nope, RF works very different to boosting. RF trees give unbiased fits, but they're high variance. Bootstrapping is used to reduce the variance of the parallel tree fits.

Boosting is sequential, and relies on early stopping to control the magnitude of bias.

rq1 · on June 3, 2019

> RF trees give unbiased fits, but they're high variance.

Sounds like overfitting.

olooney · on June 3, 2019

Individual trees are high variance. The random forest itself is an ensemble of many trees - a "forest" if we've being cute. Each tree in the forest is randomized training on a bootstrap sample. This is sometimes called "bagging", a portmanteau of "bootstrap" and "aggregating." Each tree may be further randomized by selecting a different subset of dimensions to consider each time we split a node. The end result is that each tree uses very different rules to make its prediction. When all of these predictions are combined (by voting for classification, or by averaging for regression) error due to overfitting tends to cancel out, while signal due to the same true pattern being discovered independently by many trees is amplified. Thus we can continue to add trees indefinitely without worrying about over-fitting. Other hyperparameters of random forest, such as max tree depth or the minimum number of samples in a node necessary for splitting, can result in overfitting or underfitting so need to be tuned. However, because forest will eventually fit the data set even if we use so-called "stump" learners (max tree depth=1) we can choose very conservative parameters like max-depth=3 which makes trees less likely to overfit. And if they underfit, well, that's not a problem, the ensemble will take care of that. The number of trees in a forest can be cranked up as high as we want without worrying about overfitting; the only downside is that training takes longer, the model takes more space on disk and in memory, and predictions take longer to run.

rq1 · on June 3, 2019

Yes that’s exactly overfitting the bootstrapped samples, thus the high variance.

The « variance » won’t just magically vanish as you average things out[1], you need to change the scale and check out the asymptotic law of your estimator (CLT, Kolmogorov-Smirnov… etc.) and confront it to your data.

[1] the variance of the estimator itself vanishes thanks to LLN (in case of convergence), but that’s not actually the quantity of interest

Edit: don't get me wrong, I'm not saying that RFs are good or bad, just reacting to the bias/variance thing.

yters · on June 3, 2019

Based on the downvotes, it seems people think reducing variance is the same as reducing overfitting.

Think of the bias/variance tradeoff as a spotlight, and we are shining the spotlight on a bunch of cats, who reflect back the spotlight when their eyes are open. Eyes are open or closed randomly. Cat eyes are either green or brown. We want to know the distribution of cat eyes in parts of the population, which in general is an even 50/50 split. We determine the distribution in a certain location by taking the average of the eyes we see.

If variance is large, then the spotlight is very large, and we don't learn anything because we just average the entire population.

If the spotlight is small, then we can learn something, but only if there are enough samples in the region we shine the light.

So, what if we start with a large spotlight, and then when we see a region with a large number of open eyes of one color, we narrow the light down to just that region? Won't that allow us to avoid overfitting, while maximizing our ability to learn?

It unfortunately does not, because with a large enough population that is evenly distributed, there will always be pockets that exhibit what appear to be a pattern, but is just an accident of which cats happened to open their eyes.

This scenario of starting with the spotlight large and then zooming into a patterned region is the same as reducing variance with the training data. With a large enough dataset it is always possible to find these accidental patterns and then zoom into them by reducing variance.

yters · on June 5, 2019

More thoughts on this seemingly controversial claim.

https://stats.stackexchange.com/questions/20714/does-ensembl...

yters · on June 3, 2019

Ensemble models increase VC dimension so they are more prone to overfitting.

Example, you could have a tree that individually segments each data point and memorize your dataset. That's the definition of overfitting.

gbrown · on June 4, 2019

Sometimes, but even simple trees are high variance given that they're estimated using greedy algorithms rather than some more global optimization. Overfitting in RF does not occur as a function of the number of trees.

pplonski86 · on June 3, 2019

But still, NN has much more hyperparameters than RF and training good NN model requires far more experience.

ScoutOrgo · on June 3, 2019

fast.ai has some pretty good defaults. The main HP to choose in their tabular learner are the neurons in the linear layers and the number of embeddings for each category.

vbarrielle · on June 3, 2019

Random forests can be applied to images. The RF algorithm only needs to be tweaked to split its trees by comparing the difference of two pixels to a threshold.

meesterdude · on June 3, 2019

Been using the annoy library lately (https://github.com/spotify/annoy) after becoming frustrated with TF not behaving as I'd expect. There's another lib in python, nmslib, but I can't get seem to get it to work right - and the docs are crap.

Anyway, would love to pick someone's brain about this stuff to help fill in my gaps.

RocketSyntax · on June 3, 2019

No mention of backpropagation