Hacker News new | past | comments | ask | show | jobs | submit login
Software development skills for data scientists (treycausey.com)
53 points by mirceasoaica on May 21, 2015 | hide | past | favorite | 17 comments

Going the other direction (from pure software engineering to data science), a great resource I found is a book called "Think Stats": http://greenteapress.com/thinkstats/. The book is available as a free PDF download in addition to the print version.

The book covers statistics concepts by prompting the reader to explore a large dataset with Python, writing statistical functions along the way. A lot of it is fairly basic, but it's a good primer.

I'm doing the John Hopkins/Coursera "Programming in R"[1] course at the moment.

This course seems to be written for statisticians to learn programming, but as someone going the other way it is painful.

There's an assignment[2], where a function prototype is

  makeCacheMatrix <- function(x = matrix()) {}

makeCacheMatrix doesn't make a matrix (it wraps a preexisting matrix in a structure that lets it cache the inverse of the matrix along with it).

While I understand that the problem is a constructed one to teach R scoping, the completely wrong naming of the function makes me wince everytime I see it.

[1] https://class.coursera.org/rprog-014

[2] https://github.com/rdpeng/ProgrammingAssignment2/blob/master...

I have taken the same classes, and they get much more interesting. Chances are that you could complete the work for the first courses in a couple hours without the lectures, or maybe just skimming the notes. I recommend doing that quickly and moving on to the later material, so that you don't get discouraged thinking that you are completely wasting your time.

Do they really get better? I'm either going to jump straight to the (R) Statistical Inference[1] course from JHU, or switch to the Berkeley/EdX Spark course[2].

I use a lot more Spark in my day job than R, but I really should learn statistics more formally.

[1] https://www.coursera.org/course/statinference

[2] https://www.edx.org/course/scalable-machine-learning-uc-berk...

I thought they got better compared to the first few classes, but they do really revolve around R. For a rigorous treatment of the subject matter, the MITx course on Probability is really good. [1] You could also take a look at the two JHU "Mathemtical Biostatistics Bootcamp"[2] courses. Those are also quick compared to the MITx course, but a little more careful about the math than the courses in the data science specialization are.

I haven't ever used Spark, and I like R, but I am going to take the Berkeley/EdX course.

[1] https://www.edx.org/course/introduction-probability-science-...

[2] https://www.coursera.org/course/biostats & https://www.coursera.org/course/biostats2

'Software Carpentry' has some good lessons on the same material:


I think this would be appropriate for all self taught programmers.

R, python or ruby would be a start C would be a dream

spend the time to find folks that know R or similar and have written extensions. You want doers, not academics.

Would not FORTRAN make more sense than C if a real compiled language is required for performance.

Actually Fortran and C++ would be better options than C.

Actually you most probably want to learn Java or .NET, not make an enormous investment in C++ so you can do high performance stuff. If you want that kind of performance (very rare) you can contract it out. You can't contract out proof of concepts though.

True, but there is a certain stigma against those languages in HPC.

Otherwise, something like accelerate, Alea GPU or Aparapi might be better approaches.

yes but Java is such a swine to get to grips with for things like hadoop.

Back when I did MR I found Pl/1G much easier to get to grips with then having to spend 90% of your development time trying to work out the Godammed class-path and getting eclipse to work.

Actually R, Python and Scala are the best languages for data science.

All three have great library support e.g. Spark and are the primary languages for big data.

What about the other way around?

You could start with a book like "Practical Data Science with R". This review http://www.amazon.com/review/R3IOXEI69G6044 gives a very good summary of it's contents and the intended audience

I may be a bit biased, but my book "Data Science from Scratch" is a good choice. :)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact