Hacker News new | past | comments | ask | show | jobs | submit login
From Linear Models to Machine Learning (draft) [pdf] (ucdavis.edu)
151 points by sonabinu on Aug 6, 2016 | hide | past | favorite | 23 comments

Thanks for sharing. I also recommend "An Introduction to Statistical Learning - with Applications in R": http://www-bcf.usc.edu/~gareth/ISL/

A direct link to the PDF for ISL is here:


The grownup version of that, ESL, is also available free:


And for people who are genuinely curious about how this segues into graphical models, NNs, and the autoencoder (maybe the most interesting part of modern NNs), there's


The more curious or research oriented may appreciate


I doubt Gareth or Daniela (the primary authors of ISL) would mind my pointing you towards Hastie's archives since both of them were advised by Trevor Hastie during their PhDs.

Matloff is a great guy. The chapters on shrinkage and dimension reduction aren't yet written in his book, and since these are important topics, you should consider reading the others. These things are mostly of interest for people who want to draw inference about underlying processes that may be generating observed outcomes. If all you care about is prediction, fit a Random Forest or xgboost GBM or a DNN and be done with it. But if you're actually curious about how complex descriptions of rare events can be thoughtfully analyzed, this is the standard progression.

Matloff's book is a great introduction. I particularly like the example on page 204. /ducks

Has anyone come across something similar using Python?

Here you go: https://github.com/JWarmenhoven/ISLR-python

R is a popular language for doing any form of stats/ learning theory work for research/ academia. Productionwise not as popular.

It's pretty popular.

I've put substantial amounts of R code into production - it's a nightmare. Both for development and operationally. I think 2-3 years ago R was still a superior language for ML/data science dev work. But Python's library support has really caught up is now mature. The policy I put in place on my current team is to minimize R in production, with Python and Scala preferred. R in some cases still has the best machine learning libraries, which is really the only reason I've found to use it in the production stack. Even then, I prefer to just keep it at a few lines of R code (load the data, build the model, handle errors, export the model).

For analyzing data I love R and almost always prefer it to Python.

> R in some cases still has the best machine learning librari

Would you mind naming some specific machine learning techniques that are still better in R? I've been studying machine learning and linear algebra the past few months, and I'd love to have a try at implementing one myself in Python, as a learning exercise.

Glmnet and Cox proportional hazards regression (survival analysis) are two recent examples I came across missing Python implementations.

Did you look in statsmodels? I appreciate the suggestions, and for a moment I was hopeful about the need for survival analysis models, but it looks like both that and GLM are well-covered in the latest version of statsmodels (don't be misled by the old sourceforge site, there's been a huge flurry of recent activity in statsmodels, hundreds of new PRs merged, look at Github and the docs site linked from that repo: http://www.statsmodels.org/stable/).

There's even a Jupyter notebook comparing the R, Stata (that takes me back, used Stata in survival analysis class 10 years ago), and Python versions of proportional hazards regression: http://nbviewer.jupyter.org/urls/umich.box.com/shared/static...

Glmnet has quite a bit of functionality that is lacking in the Python elastic net implementations. The most notable is the regularization parameter sequence grid search (alpha in statsmodel, lambda in glmnet) which works remarkably well and can be orders of magnitude faster than a traditional grid search.

Last time we tried Cox PH regression from statsmodel it gave a (very) different result than R's, and we weren't comfortable with the lack of tests. Will give another try.

not the OP, but glmnet is still the standard for lasso'ing, and I'm not aware of a Python implementation for post-selection inference and statistical testing yet.

That said, the secret sauce in all of those is FORTRAN.

Thanks for the suggestions, I appreciate it. Alas, at the least, there are Python implementations of all those between scikit-learn and this project: https://github.com/selective-inference/Python-software. statsmodels has GLM capabilities, and there are even Python bindings for R's glmnet FORTRAN library.

I'm not sure how well those compare to the R implementations, but they look well-built at first glance.

Any other ideas out there?

Python does not lack in models. It has broad enough coverage and good enough infrastructure to construct the rest if you need it. It lacks in everything else that you need in your day-to-day modeling. DoE, graphics, utility functions, inference, etc.

Interesting! Great find on the selective inference, I should have known to look. Tucking this away for when it is needed.

I haven't seen it, maybe just me. Which companies do you know of?

Ebay, Electronic Arts, Google, Microsoft, many financial institutions.

Are these production systems? I'm a bit blown away that this can work with R.


R is a glue language. The fast bits are always written in C++, FORTRAN, or wrapping a BLAS like LAPACK.

R-vs-Python is almost never the problem in production. Interpreted-vs-compiled is almost always the issue. (I'm aware of Numba and similar efforts. Last time I tried it, it sucked on nuts. And Theano is a rather specialized bit that most people don't actually need.)

JMHO. But I've never seen anyone dealing with truly huge data and inference problems that had the low-level bits in anything other than C++ or FORTRAN. I could imagine that Scala can do a pretty good job now, especially if you use Spark a lot. But R vs Python seems like a really stupid question. Use the one that has the libraries you need.

This. R vs. Python is a very stupid debate. Especially for speed. For example pandas is slower than data.table. There are plenty of ways to have performant models in R. If you are worried about performance in deployment, run H2O, for example, there are plenty others. There is also flashr, if you want to write your own algorithms, which swaps R's basic operators and data management with C++.

Also google is doing to R what they did to javqscript with v8. Expect GA next year.

What is GA?

I think the right answer is both.

And probably Javascript, too.

Yes, those are production systems like I said. Ebay scores their search results. EA scores customer lifetime value, churn, marketing communications.

Use of non-monospaced fonts for code fragments in LaTeX composed books must stop.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact