
Machine Learning: A Summary on Feature Engineering Using R Language - unbuckledbee
https://blogs.msdn.microsoft.com/microsoftrservertigerteam/2017/03/23/feature-engineering-using-r/
======
graycat
_Over fitting_ can, indeed, be "fitting the data too perfectly"; e.g.,
Lagrangian interpolation where have some X-Y pairs and want a polynomial in x
that fits the X-Y pairs. How does that work? You can derive it for yourself:
For each of the X-Y pairs, write a polynomial that is zero at all the X points
except the one pair you want to fit perfectly and then add all those
polynomials. Done. Presto. Bingo. "Look, Ma, fits anything perfectly!" (as
long as all the X points are distinct). "I should be able to find a polynomial
that fits points from a square wave!" Yup, but graph the polynomial and see
what it does away from one of the given X points -- the graph _goes wild_ ,
that is, shooting off to plus or minus infinity and appears nearly to get
there.

But in linear fitting, _over fitting_ is simpler: There, of course, do get a
linear function which doesn't _go wild_. And the function can fit the data
nicely. The usual problem is that the coefficients in the linear function are
not unique.

------
asavinov
What they describe is actually (very) shallow feature engineering and these
methods have been used for decades. These techniques can hardly help in really
difficult data analysis tasks where abstract features need to be either
manually defined or automatically mined.

~~~
shas3
Yeah! And they use "feature engineering" where "feature selection" would
suffice. The hard part of feature engineering is usually not selection, but
creatively generating features based on expert knowledge, etc. Also, something
like a wrapper would be an overkill if you had roughly, #data/class >
#features and used a relatively overfitting-resistant classifier like SVM,
regularized regression, etc. (more rigorous statement of this would involve
VC-dimension).

------
habaryu
What does this paragraph even mean?

"On the same note, though perfectly correlated variables are redundant and
might not add value to the model (and if removed, would be computationally
efficient), a high-variable correlation could have additional information to
add. In other words, two variables that are not correlated could still be of
importance to the model. When in doubt, it’s safer to train the model and
observe it’s performance."

The "In other words" part seem to talk about uncorrelated variables while the
sentence before that talked about highly correlated variables. I always
thought to discard one of two variables that were highly correlated.

~~~
bunderbunder
Yeah, I'm not sure either, that seems to be an editing mangle.

The bit about "it's safer to train the model and observe its performance",
though, is reasonable advice. The main problem with multicollinearity is that
it gives you a model with poorly-defined coefficients - that is, their
standard errors are high. That's a big problem if you're coming at the problem
with a statistician's mindset and trying to come up with a parsimonious model
with statistically significant parameter estimates. If you're just going for
the best predictive model you can get, though, then you don't necessarily care
about super tight standard errors on all your coefficients, so it's not such a
big deal.

------
hellrich
A colleague of mine is quite happy using caret for streamlined feature
selection
[http://topepo.github.io/caret/index.html](http://topepo.github.io/caret/index.html)

~~~
wakkaflokka
Is there a Python equivalent of this package?

~~~
gnaddel
The closest equivalent I can think of is scikit-learn [1], which also gives
you a unified way of using many different algorithms. I wouldn't say the two
are really equivalent but both are excellent and a joy to use. One major
difference is that, as far as I know, caret is mainly a standardization
wrapper for other R packages' functionality, while scikit uses its own
implementations.

[1] [http://scikit-learn.org/stable/](http://scikit-learn.org/stable/)

------
platz
See ISLr (linked in OP's post) for a good exploration of feature selection
[http://www-bcf.usc.edu/~gareth/ISL/](http://www-bcf.usc.edu/~gareth/ISL/)

------
stewbrew
This seems specific to Microsoft R or can I install the MicrosoftML package in
regular R and/or use it under Linux? I really wonder what MS's investment in R
means for the future of R.

~~~
digitalzombie
They're just trying to not lose in the ML space that's all. Wants people to
use their cloud server.

Just cause it's specific to microsoftR and the package doesn't mean those
algorithms are own by microsoft. You just have to google other packages that
have those algorithms. All of them are just statistic... R have tons of
statistic package more than Python. I will be highly skeptical if you can't
find a LDA, PCA, etc.. in a package somewhere.

The only important part here is the technical knowledge of what it are these
algorithms for and when to use them and when not to.

~~~
stewbrew
MS has a history of "embracing" technologies. The arrogance is visible in the
title that doesn't mention it's specific to MS R.

I didn't say you can't do this with standard R. I didn't say they invented or
own anything.

~~~
digitalzombie
> MS has a history of "embracing" technologies.

Are you kidding me? You're telling me they're embracing it and not trying to
get into the ML share or playing catch up ever since Hadoop came about?

> This seems specific to Microsoft R or can I install the MicrosoftML package
> in regular R and/or use it under Linux?

You asked a question and I provided an answer. Just cause you didn't like it
doesn't mean it's not an alternative to Revolution R. You're just taking it as
if it's some attack on you, grow up. I'm stating that if it's not possible
then you can always find package for it. They're statistic algorithms that you
learn in stat grad courses and R is a statistic language. So you can find it
if you can't use Microsoft stuff. So you don't have to worry if it works or
not.

I also don't get this fascination on this library. You can just use packages
that is agnostic to what R version (R Revolution microsoft or regular R).

They're trying to sell their version of R and hopefully their cloud stuff like
Azure.

> The arrogance is visible in the title that doesn't mention it's specific to
> MS R.

I have no idea what that mean.

All the mx prefix functions are Revolution R only.

~~~
stewbrew
I didn't downvote your comment but I don't understand what you're talking
about.

------
KCFforecast
a table of similar function using R and the other R:
[https://msdn.microsoft.com/en-
us/microsoft-r/scaler/compare-...](https://msdn.microsoft.com/en-
us/microsoft-r/scaler/compare-base-r-scaler-functions)

------
iconvalleysil
Somewhat related: adding features based on trending concepts
[http://54.174.116.134/recommend/datasets/](http://54.174.116.134/recommend/datasets/)

~~~
nl
That looks interesting, but without any additional details, or even other
pages on the website it is difficult to consider using it.

~~~
digitalzombie
Seriously, it's just use these data too! They'll help your models. Nothing
about how these features were selected what algorithm used.

~~~
nl
I don't trade, so they won't help my models.

I'm interesting in how they are made though.

------
mattxxx
glad to see young kids studying!! Hope you continue to take on more advanced
projects!!!!

