Data Science Rosetta Stone: Classification in R, Python, Matlab, SAS, Julia

minimaxir · on Aug 17, 2017

"Rosetta Stone" implies that there is a universal stratagem for processing any dataset in any language.

One of my pet peeves with everyone using the Titanic dataset as a Hello World for data science is that real-world datasets are not as clean or intuitive. ETL and variable selection is half the battle, if not more.

digitalzombie · on Aug 17, 2017

No but the Titanic set let's your practice essential skillset and dip your toe into kaggle.

You still need to clean those missing data and do some sort of imputation. (edit you cannot use randomforest until you deal with those missing values in R at least and from a theory perspective I don't recall CART handling missing data.)

If you want to eek out those percentage of accuracy you have to do feature engineering which gave me the chance for the first time to actually understand and practice feature engineering.

And of course you gotta do EDA on it.

The only thing I've see that is bad about the Titanic is the data leakage. People are getting 100% accuracy because you can look up who's actually dead or use the test data with your train data and increase your model accuracy. But it also introduce you to the concept of data leakage.

I think the titanic dataset is very nice and compact that it lets you practice a variety of skill sets within the datascience domain. Much better than when I had to deal with medical genetic data.

> variable selection

You mean multivariate data?

I think there's a reason why in applied statistic you take statistic and regression first before you jump into multivariate.

Unless you just want to blindly do PCA and factor analysis on everything under sun without understanding the theory sure.

jeffheaton · on Aug 19, 2017

Neither rosetta stone nor "hello world" implies anything about "universal". Actually, quite the opposite. The real Rosetta stone is a very small subset of the 3 languages, but paved the way for greater understanding. The real Rosetta stone text is actually pretty mundane. And "hello world" apps are anything but "universal". This is just how to do a simple (non universal) data science task in 5 languages.

SmellTheGlove · on Aug 17, 2017

Of course, but the objective of this writeup appears to be illustrating the same basic problem in a few different languages.

It does strike me as a good idea to do something similar for data manipulation/cleansing. If I ever find some free time, I'll write it and post it somewhere.

SmellTheGlove · on Aug 17, 2017

I like this. Being in large companies for a long time, SAS has been a staple, but I'm looking to get folks to do more in Python or R since I think the external talent pool is going there. Articles like this help illustrate that the transition isn't particularly horrible.

tnecniv · on Aug 17, 2017

This article made SAS seem very unfun to work with.

SmellTheGlove · on Aug 17, 2017

I learned SAS in college and 15 years later I'm still using it, so it's become pretty natural to me. It's not particularly fun to learn, though, I agree. The lack of a modern editor kind of sucks too. You're not going to get hints, code completion and linting. You do get a feel for the madness when you work on a mainframe implementation, though - all of a sudden, the PROC syntax starts to make sense. I'm not going to lie, though, I do most/all data manipulation and ETL ops in PROC SQL. The only time I touch the DATA step is when I need to loop. It does have really good Teradata integration, though, and that's really important for most of my use cases.

That said, the one language that's gotten me away from SAS is Python. It's nice to have a general purpose language with so much community support, packages, etc. And I don't need to call an account rep if I need to extend functionality. Budgets aside, SAS is seriously messy when it comes to its modules.

closed · on Aug 17, 2017

> And I don't need to call an account rep if I need to extend functionality.

Thank you for reminding that this special kind of hell exists.

tnecniv · on Aug 17, 2017

I'm thankful I've never had a serious bug with MATLAB. I have a friend who spent two days debugging an issue in one of their distribution files with a rep. Sounded like hell.

Similarly, MATLAB's concept of packages is also non-existent.

zitterbewegung · on Aug 17, 2017

This gives me an idea to make a Rosetta Stone but not for languages but for Models (SVM / Gradient Boosting / Linear Regression...). Probably would have to do it using two datasets or more.