
Data Science Rosetta Stone: Classification in R, Python, Matlab, SAS, Julia - jeffheaton
http://www.heatonresearch.com/2017/08/17/ds_rosetta_stone.html
======
minimaxir
"Rosetta Stone" implies that there is a universal stratagem for processing any
dataset in any language.

One of my pet peeves with everyone using the Titanic dataset as a Hello World
for data science is that real-world _datasets are not as clean or intuitive_.
ETL and variable selection is half the battle, if not more.

~~~
digitalzombie
No but the Titanic set let's your practice essential skillset and dip your toe
into kaggle.

You still need to clean those missing data and do some sort of imputation.
(edit you cannot use randomforest until you deal with those missing values in
R at least and from a theory perspective I don't recall CART handling missing
data.)

If you want to eek out those percentage of accuracy you have to do feature
engineering which gave me the chance for the first time to actually understand
and practice feature engineering.

And of course you gotta do EDA on it.

The only thing I've see that is bad about the Titanic is the data leakage.
People are getting 100% accuracy because you can look up who's actually dead
or use the test data with your train data and increase your model accuracy.
But it also introduce you to the concept of data leakage.

I think the titanic dataset is very nice and compact that it lets you practice
a variety of skill sets within the datascience domain. Much better than when I
had to deal with medical genetic data.

> variable selection

You mean multivariate data?

I think there's a reason why in applied statistic you take statistic and
regression first before you jump into multivariate.

Unless you just want to blindly do PCA and factor analysis on everything under
sun without understanding the theory sure.

------
SmellTheGlove
I like this. Being in large companies for a long time, SAS has been a staple,
but I'm looking to get folks to do more in Python or R since I think the
external talent pool is going there. Articles like this help illustrate that
the transition isn't particularly horrible.

~~~
tnecniv
This article made SAS seem very unfun to work with.

~~~
SmellTheGlove
I learned SAS in college and 15 years later I'm still using it, so it's become
pretty natural to me. It's not particularly fun to learn, though, I agree. The
lack of a modern editor kind of sucks too. You're not going to get hints, code
completion and linting. You do get a feel for the madness when you work on a
mainframe implementation, though - all of a sudden, the PROC syntax starts to
make sense. I'm not going to lie, though, I do most/all data manipulation and
ETL ops in PROC SQL. The only time I touch the DATA step is when I need to
loop. It does have really good Teradata integration, though, and that's really
important for most of my use cases.

That said, the one language that's gotten me away from SAS is Python. It's
nice to have a general purpose language with so much community support,
packages, etc. And I don't need to call an account rep if I need to extend
functionality. Budgets aside, SAS is seriously messy when it comes to its
modules.

~~~
closed
> And I don't need to call an account rep if I need to extend functionality.

Thank you for reminding that this special kind of hell exists.

~~~
tnecniv
I'm thankful I've never had a serious bug with MATLAB. I have a friend who
spent two days debugging an issue in one of their distribution files with a
rep. Sounded like hell.

Similarly, MATLAB's concept of packages is also non-existent.

------
zitterbewegung
This gives me an idea to make a Rosetta Stone but not for languages but for
Models (SVM / Gradient Boosting / Linear Regression...). Probably would have
to do it using two datasets or more.

