What is happening in this past decade is a convergence of stats & ML, primarily driven by data scientists working in the domain of big data. The stats folks are slowly incorporating ML techniques into stats & finding rigorous heuristics for when they should be employed. Similarly ML guys, who are mostly CS folk who unfortunately have taken only 1 course on undergraduate stats & probability, are discovering you can do so much more without resorting to needless large-scale computation, by sampling intelligently & leveraging plain old statistics.
This schism between stats & ML can be leveraged very profitably during interviews :))
When I interview data science folks, I usually ask very simple questions from plain stats - how would you tell if a distribution is skewed...if you have an rv X with mean mu, and say rv Y = X-mu, then what is the mean of Y...if you have an rv A with mean 0 variance 1, then what are the chances of being 3 standard deviations away from the mean if you have no clue about the distribution of A ? What if you knew A was unimodal ? What if A is now normally distributed ?
Now if its a stats guy, I ask very simple ML....what is perceptron, have you heard of an neural network etc.
surprisingly, the stats guys do much better on ML than the ML guys on stats!
I say this as an MLer still struggling to find out what R^2 is, among other things ..
Here I just coded up a 10-liner for you:
Correlation induces an inner product on the set of zero-mean random variables. The regression coefficient is precisely the projection coefficient <x,y>/<y,y> and R^2 is precisely the Cauchy-Schwarz ratio <x,y>^2 / <x,x><y,y> (i.e. the product of the two projection coefficients between x and y).
It is a theoretically natural measure of linear quality-of-fit. It has the added bonus of being equal to the ratio of modeled variance to total variance (variance being the square-norm of a random variable in the norm induced by the correlation inner product).
It's also very very cheap to compute. Though there are more practically useful measures of "predictive power", like mutual information, R^2 does an admirable job for an O(1)-space and O(num data)-time predictiveness metric.
I suspect that I learned more about R^2 by reading these comments in the order presented (informal, then formal) than I would have had they been reversed.
It was great how he spent a lot of time on logistic regression before delving into SVM's or Neural Nets - it was much easier to understand the cost functions & regularization for other types of classifiers after having understood those for logistic regression.
My takeaway: if you can avoid adding risk to your systems by using more complicated models, you should.
"Another extension to the naive Bayes model was developed entirely independently of it. This is the logistic regression model."
To oversimplify, each of the many, many decision trees is going to do a poor job building a classifier. Each tree only uses a subset of the data and a subset of the features. So, the results of each tree won't generalize well (and may not even do well classifying the data it was trained on). But, for each tree that misclassifies in one direction, another tree will misclassify in the other direction. The theory (which appears to be validated in practice) is that poor decisions cancel each other out but good decisions don't get canceled out.
Although, it isn't quite a simple as a "good" tree and a "bad" tree. All decision trees are bad but there should randomly be good parts in the bad trees. Each tree is going to be "mostly crap" (as Jeremy Howard puts it). But "mostly crap" means "a little good." The crap in one tree cancels out the crap in another tree. The good parts are left behind... and... like magic... you have a good classifier that is fast although somewhat of a black box.
Jeremy Howard from Kaggle is a big believer and where most of my knowledge of Decision Tree Ensemble algorithms come from: https://www.kaggle.com/wiki/RandomForests
You can, however, calculate importance scores for the features used. Brennan's original paper gives a good algorithm for doing this (in short: for each tree, permuting the data along some feature for an out of bag sample and seeing how much worse it does.)
But then, some models are pretty much impossible to understand or interpret.
No advantage over taking a decent statistically valid subset and running C4.5 (or a variant with floating point splits) apart from it's less work.
To the extent that the errors of the individual trees are uncorrelated, the random forest keeps the same low expected error of a single tree while reducing the high error variance of a tree. This is the same reason a mutual fund is better than a single stock -- on average they have the same return, but the latter has far more variance.
Another thing I sometimes do is run a few iterations of the algorithm myself on paper.
It's from CMU's course titled "Concepts of Math" The professor, Brendan Sullivan, just wrote the textbook in an easy to understand style.Here's a link to the course: https://colormygraph.ddt.cs.cmu.edu/21127-f12/ and if you click the AnnotateMyPDF website, you can get a copy of the textbook, practice problems, etc for a more in depth understanding.
You're kind of looking for the most similar records to match each record to. Am I understanding that right?
In the K-means case, it is good to know beforehand how many groups(clusters) you are expecting. Otherwise, there are other clustering algorithms(like gdbscan) which detect the number of cluster automatically(although you have to provide the average expected density). But once again all depends on the question you want to answer, or if you are just cleaning the data a bit.
Know your data, know your algorithms and always question the results.
That isn't to say the algorithm can not be used. It just is not bullet proof.