
The Open Source Data Science Masters - olalonde
https://github.com/datasciencemasters/go
======
nl
I don't like this curriculum very much - I think it is way too heavy on the
data engineering side and way, way too little about the actual mechanics of
the data science bit.

For example, the words "validation" (as in cross validation) and "overfitting"
aren't mentioned anywhere on that page, and yet things like data scraping are
mentioned multiple times.

With all due respect, I can find lots of people to do scraping, but it is much
harder to find someone to explain a good strategy for cross validation on time
series data (for example).

And yes, "data scientist" is a vague term that can mean pretty much anything.

Having said that, if you did everything on this list you'd be a pretty good
data scientist.

(I run a data science team, and I'm involved in building a data science
competency framework so it's something I think about a fair bit)

~~~
_coaxialcabal
Good point. That said, easy enough to augment an expert data scientist with
4:1 data engineering support, whereas a data scientist working solo will spend
80% engineering data. With all the hype and inflated expectations, IMO much
easier to hire aspiring data scientists than talented engineers who are
satisfied with the data prep and admin aspects. MSPA programs are realistic
that the bulk of their graduates will be spending much of their time as data
janitors.

~~~
nl
_IMO much easier to hire aspiring data scientists than talented engineers who
are satisfied with the data prep and admin aspects_

As someone who hires both, I can guarantee this is incorrect. Well, maybe
hiring "aspiring" data scientists is ok, but an aspirations will get me models
that do exactly the wrong thing. So that isn't useful.

------
thr0waway1239
"how can you afford not to take advantage of an open source education?"

But there is a time cost to learning. For example, suppose a Masters degree
takes two years, does the author have an estimate of how long it would take to
complete her list?

And since the list is a little old (in terms of the rate at which this field
is progressing), I would add Apache Spark to the list. I watched a video
recently about how Scala took over the big data world (probably not true) [1],
but the presenter made an interesting point about how Spark subsumes a lot of
different things (streaming, machine learning, built in support for SQL) and
it is good enough at those things even if not the best tool. Not surprisingly,
that actually makes it a good candidate for adoption in the enterprise.

And then add the recent availability of things like the Databricks Community
Edition (and similar offerings from competitors), and I could also make the
case that you can start from a completely different entry point - learn Spark
first, and then go deeper based on what you are interested in. But most of
all, using a platform like Databricks takes away a major pain point - the
often painful process of setting things up to start your work. [2]

My last point is not aimed at this particular resource, but just a general
feeling I have when I see lists which incur significant time costs for the
readers to fully pursue. I would like to see in these lists a brief statement
about some popular things which were still omitted and why - simply because
that gives some added confidence about the effort that went into the curation
process.

[1][https://www.youtube.com/watch?v=AHB6aJyhDSQ&t=10m30s](https://www.youtube.com/watch?v=AHB6aJyhDSQ&t=10m30s)

[2]I am not associated in any way with Databricks. Also, obviously Databricks
is a commercial entity, so in some sense you are not just within the "all free
and open" domain.

~~~
Eridrus
I use Spark at work, it's really good when you need to do some large scale
analysis with your own custom code. Everything else it does is just ok. It
certainly doesn't subsume ML tools.

~~~
thr0waway1239
Great to hear from someone who uses Spark at work. I see what you mean -
subsumed is a bad word choice and I should not have tried to paraphrase. What
do you feel about the phrase that the presenter uses - "One tool that fits
everything"? (I am new to data science)

~~~
Eridrus
I think it's marketing nonsense. Databricks in particular has good "proximity
to data" IMO, in that once you've figured out how to use it with your data
sources, you can just fire up a web browser and connect to them, and
connecting those data sources to your code is easy.

The problem with seeing Spark as your one tool for everything is that's only
true if it's trivial to integrate your code with Spark. Viz tools like
Plotly/Bokeh don't integrate well with Databricks' notebook, Deep Learning
tools are not really supported yet unless you're running your own clusters and
running special libraries to wire thing together.

I think Spark is a good workhorse for big data; it can do repetitive things
well at large scale, it's less good when you want to use more niche tools
since most of the Data Science community is not focussed on Spark. PySpark
exists and will probably be good enough, but only if your data fits into
memory in a single machine anyway.

And if you're not dealing with big data, Spark is overkill. It's usually
simpler to just get a box with more RAM.

------
nextos
All these curricula seem a bit too complex. IMHO, there's one thing that
should be prioritized on top of everything else. The concept of probability,
computable probability distributions, and Bayesian inference.

It's the one thing that brings a unifying umbrella to all modes of reasoning
under uncertainty. [https://probmods.org/](https://probmods.org/) and
[http://forestdb.org/](http://forestdb.org/) seem to be the best resources for
this at the moment.

Besides, I dislike Data Science which seems to be a new buzzword. Data
Engineering would be more acceptable, as I think people working in companies
are building stuff rather than developing new theories. But I don't like it
that much.

~~~
nl
I agree Bayesian techniques are important, and satisfying intellectually.

The problem is that it is entirely possibly to build perfectly good models
without ever touching anything Bayesian (excluding naive-Bayes classifiers
perhaps!), and then adding Bayesian techniques will rarely improve the
accuracy in anyway.

But I'm happy to admit my understanding of Bayesian techniques is incomplete.
It's something I'm working on ([https://probmods.org/](https://probmods.org/)
is great), but I just haven't found anywhere to use it in anger yet.

~~~
ronald_raygun
So something that no one usually admits is that there are three types of
reasoning about stuff (frequentist, bayesian, and nonparametric), and each of
them has their pros and cons and circumstances to use them.

So with frequentist statistics, it is really easy to reason about what should
be the correct estimator (it is almost always the obvious one). For example,
with functional time series (where each data point is a function and not a
real value), then it is straight forward to find an MLE - it is just the
average function. But defining a prior on the space of twice differentiable
functions isn't as easy.

------
Joof
So far, the downside of this is that many companies seem to require a masters.

~~~
pYQAJ6Zm
What about showing a portfolio of data analyses? Anybody has experience on
this?

~~~
jupiter90000
Someone got hired on a data science team I worked on with no master's degree,
partially due to scraping some of our company's data from our website and
showing some cool data science-like stuff using that scraped data to the
people interviewing/hiring. That said, this may have been a successful tactic
there too because the team would sometimes hire non-master's/PhD holders if a
person showed aptitude and 'passion' for this type of work. We had PhDs who
were really not equipped to do well at the business level, and those with
Bachelor's that made major contributions (and vice-versa). Unfortunately there
are still companies and teams out there that think a higher degree is a good
filter...

