

Software development skills for data scientists - mirceasoaica
http://treycausey.com/software_dev_skills.html

======
mojoe
Going the other direction (from pure software engineering to data science), a
great resource I found is a book called "Think Stats":
[http://greenteapress.com/thinkstats/](http://greenteapress.com/thinkstats/).
The book is available as a free PDF download in addition to the print version.

The book covers statistics concepts by prompting the reader to explore a large
dataset with Python, writing statistical functions along the way. A lot of it
is fairly basic, but it's a good primer.

------
nl
I'm doing the John Hopkins/Coursera "Programming in R"[1] course at the
moment.

This course seems to be written for statisticians to learn programming, but as
someone going the other way it is _painful_.

There's an assignment[2], where a function prototype is

    
    
      makeCacheMatrix <- function(x = matrix()) {}
    
    

makeCacheMatrix doesn't make a matrix (it wraps a preexisting matrix in a
structure that lets it cache the inverse of the matrix along with it).

While I understand that the problem is a constructed one to teach R scoping,
the completely wrong naming of the function makes me wince everytime I see it.

[1]
[https://class.coursera.org/rprog-014](https://class.coursera.org/rprog-014)

[2]
[https://github.com/rdpeng/ProgrammingAssignment2/blob/master...](https://github.com/rdpeng/ProgrammingAssignment2/blob/master/cachematrix.R)

~~~
rz2k
I have taken the same classes, and they get much more interesting. Chances are
that you could complete the work for the first courses in a couple hours
without the lectures, or maybe just skimming the notes. I recommend doing that
quickly and moving on to the later material, so that you don't get discouraged
thinking that you are completely wasting your time.

~~~
nl
Do they really get better? I'm either going to jump straight to the (R)
Statistical Inference[1] course from JHU, or switch to the Berkeley/EdX Spark
course[2].

I use a lot more Spark in my day job than R, but I really should learn
statistics more formally.

[1]
[https://www.coursera.org/course/statinference](https://www.coursera.org/course/statinference)

[2] [https://www.edx.org/course/scalable-machine-learning-uc-
berk...](https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-
cs190-1x)

~~~
rz2k
I thought they got better compared to the first few classes, but they do
really revolve around R. For a rigorous treatment of the subject matter, the
MITx course on Probability is really good. [1] You could also take a look at
the two JHU "Mathemtical Biostatistics Bootcamp"[2] courses. Those are also
quick compared to the MITx course, but a little more careful about the math
than the courses in the data science specialization are.

I haven't ever used Spark, and I like R, but I am going to take the
Berkeley/EdX course.

[1] [https://www.edx.org/course/introduction-probability-
science-...](https://www.edx.org/course/introduction-probability-science-
mitx-6-041x-0)

[2]
[https://www.coursera.org/course/biostats](https://www.coursera.org/course/biostats)
&
[https://www.coursera.org/course/biostats2](https://www.coursera.org/course/biostats2)

------
new1234567
'Software Carpentry' has some good lessons on the same material:

[http://software-carpentry.org/lessons.html](http://software-
carpentry.org/lessons.html)

------
collyw
I think this would be appropriate for all self taught programmers.

------
lessthunk
R, python or ruby would be a start C would be a dream

spend the time to find folks that know R or similar and have written
extensions. You want doers, not academics.

~~~
mauricemir
Would not FORTRAN make more sense than C if a real compiled language is
required for performance.

------
anonymousDan
What about the other way around?

~~~
bladecatcher
You could start with a book like "Practical Data Science with R". This review
[http://www.amazon.com/review/R3IOXEI69G6044](http://www.amazon.com/review/R3IOXEI69G6044)
gives a very good summary of it's contents and the intended audience

