

Getting started in data science - treycausey
http://treycausey.com/getting_started.html

======
uger
While all reasonable points, his resume doesn't actually list any experience
in industry. There is certainly a lot of data analysis skills that can be
learned while at academic institutions.

I would argue, however, that a skill that is often not put on these lists of
'what you need to become a data scientist' is some time being at a real
private sector company. In my experience, there are quite a few differences
between writing academic papers and coming up with the terse, actionable
information that is useful for a profit-driven company.

~~~
achompas
Trey is a data scientist at Zulily. Think he's been there for a few years now.

------
stanmancan
This is actually a very timely article to pop up for me. I'm currently a Sales
Analyst at a big Telco in Canada. I landed the job with no previous analyst
experience, but years of programming behind me. I never had any formal
programming experience, and everything I know is from years of experience.
Because of this, I'm fairly confident that I can get anything done, but I have
no schooling to support me. Don't get me wrong, I don't consider myself a
highly skilled or knowledgeable programmer, but I know how to break down
complex tasks and produce results. I've been writing functional PHP my whole
life since I'm unable to focus on learning a new language long enough to
actually do so (and because of this, I have lots of self doubt that I'd
actually be able to learn a new one if I wanted to).

I consider my 'Sales Analyst' role an intro to data science. Essentially,
we're collecting all types of data from different sources and producing
reports on them. How many products were sold? From what channels? From which
sales persons on which team? How many customers did they talk to? How many of
their booked installations failed? Were successes? Whats our market
penetration like? Where are we seeing trends in disconnections or connections?
Why? We're dealing with millions of records, from half a dozen data sources,
often with no unique key matching the data up.

It's a very interesting job, but without any previous experience in the field
it's all been self taught. I've looked into a few data science programs and
books, but I've felt that everything I see and read is so far above me that I
can't even get started on the books or courses. My math skills are pretty
basic, but that's never stopped me before as I've always been able to find the
results that I'm looking for. I know that with the proper knowledge and
experience I'd be able to land a data science job and get a 30-50K raise at
the same time, but just don't have anything on my resume to qualify me. Nor do
I know what an -actual- data scientist does, so I lack the confidence in even
pursuing any openings I see around. I'm getting married next year, have a
seven year old daughter, and bills to pay like anybody else, the job security
I have in my current role definitely acts as a deterrent from taking a leap
into the unknown as well.

~~~
rahimnathwani
_I consider my 'Sales Analyst' role an intro to data science. Essentially,
we're collecting all types of data from different sources and producing
reports on them._

 _I know ... I 'd be able to land a data science job ... but just don't have
anything on my resume to qualify me._

Who uses the reports you produce? What decisions do they make as a result? How
often do they make these decisions? What is the impact?

If you want to build on your current role as a way to become a data scientist,
in your current company or elsewhere, perhaps you can look at ways to:

\- Generate recommendations based on the reports you've designed

\- Automate some decisions that are being made manually now

\- Figure out a way to influence additional teams who could/should use the
data you're analysing

In short: figure out a way to generate additional $$$ impact, or
improve/automate an existing process which requires manual effort, so that
those people can focus on stuff which cannot easily be automated.

------
MrMan
Pretending to be a scientist without doing the academic work means you are
essentially a trade worker. Autodidact? Fine, but you better have put yourself
through the equivalent of at least a masters or phd in statistics along the
way.

We need vocational education to let semiskilled workers use statistical
modeling tools in an informed way, but why do we then call them "scientists"?
They are technicians.

~~~
rch
In the US it's perfectly alright to call yourself an engineer or scientist.
You're welcome in professional organizations like IEEE and ACM, you may submit
articles for journal publication, and attend our even present at conferences.
As long as you don't try to fashion your own PE or PhD out of nothing all the
rest is fine.

------
netcan
The problem here is, IMO a semantic one.

"Data scientist" is giving people the wrong impression. What this demand
really is is companies now have a lot of data because everything is digitised.
They need people to do stuff with that.

The actual demand for data related work is to data science like the actual
demand for computer related work is to computer science. Statisticians,
analysts, database engineers.

------
jobquestion123
>And certainly not quickly enough to be qualified to get a job as a data
scientist before the data scientist salary market comes crashing back down to
earth.

Do you think it's worth it to pursue if you're legitimately interested (as
opposed to primarily attracted to the $$$)?

------
nikentic
Great resources, was just about to buy a book on Linear Algerbra & ML before I
read this.

~~~
merusame
Consider this book by Toby Segaran:

[http://www.amazon.com/Programming-Collective-Intelligence-
Bu...](http://www.amazon.com/Programming-Collective-Intelligence-Building-
Applications/dp/0596529325)

------
darkxanthos
What are people using Linear Algebra for in Data Science? Aside from the stock
representing words as N-dimensional vectors I mean?

I ask because I do this kind of work as my J.O.B. and every skill he refers to
I totally see the necessity of except this one.

~~~
srean
I am actually _very_ puzzled by by this comment because its the polar opposite
of a point of view that I would have expected. In fact I can think of very few
datamining and machine learning algorithms where linear algebra does not play
a role.

Representing features of a datapoint as a vector, pervades and populates every
pore of this field. Without an understanding of linear algebra you wouldn't
have support vector machines, no kernel methods, no neural networks, no
perceptrons, no gradient descent methods, no Newton / Quasi-Newton methods, no
multi-dimensional (or as they say in statistics, multivariate) Gaussian random
variables, no matrix factorization, no Pagerank, no Markov chains, this list
can go on and on.

Take the simplest of data science problems: you have one variable x and
another variable y and you want to predict the value of y given x. Usually x
is not a single scalar but n scalars (called a feature vector). Simplest thing
you can do here is least squares and that is as linear algebraic as you can
get. There many fancy ways of dealing with this problem but almost always it
is reduced to solving a related linear system.

The bottom line is this: we understand very few things. Thankfully linear
algebra is one of the few things that we do understand, so almost every
analytical problem is reduced to this case (if, but locally) and then solved.

I would be very curious to know how you have been able to avoid linear
algebra. It will give me a new and valuable perspective, because apart from
"click button, didnt work? ok click the next button" data analysis I find it
hard how one can do much data analysis without it. So please break my bubble,
I will be thankful for it.

Canned packages often do not work out of the box. The knowledge of linear
comes very handy in analyzing and debugging why is the model not working" "oh
I see this matrix is near singular, thats why my estimates are off the park",
or "oh these two variables are very correlated, that is why gradient descent
is having so much trouble converging fast", "ah I see why I am getting NaN
here" etc etc.

EDIT: darkxanthos, appreciate your comment. I would say it is a bit like
driving. Knowing the internal mechanics is neither necessary nor sufficient,
and hardly correlated with good driving skills when things are going well. But
sometimes when things are not going as expected, it helps in debugging.

Let me try and pique your interest: Note that the decision boundary of naive
Bayes is actually a linear function of the log conditional probabilities
considered all independent, with LA you can now also consider the case that
they have dependence. Consider updating multi-armed bandit problems, the
updates are variants of gradient descent, and its nature is indeed
characterized by the eigenvalues of Hessian of the thing you want to optimize.
Consider K-means clustering, one way to get very close to its global optimum
is to solve the same cost function using linear algebraic updates (called
spectral graph partitioning). By trig I think you have the dot-product of two
vectors in mind, the related analysis actually does not rely much on
trigonometric properties but heavily on the linear algebraic properties, in
fact this what allows one to escalate affairs from simple linear feature
vectors to extremely non-linear ones because even though they are nonlinear in
the data space in some other space they are linear so people do the math in
that space (called the kernel trick although I find that term quite silly)
..This thing, linear algebra, lurks everywhere, I tell you :)

~~~
darkxanthos
Thanks for this comment. Just to be clear I am not saying it isn't necessary
but am genuinely curious what I'm missing. I think a big influence on my
comment is using ideas that touch on linear algebra and possibly even doing
the computations one might end up doing via linear algebra but without knowing
it.

For example- Least squares regression. Totally use this. Even took a semester
in college on just regression. The linear algebra underpinnings though haven't
never been shown except for a quick blurb in my linear algebra text book. I
still understand the concepts of fitting a model and when it's a bad fit (such
as non-normal distribution of residuals, co-linearity) but the theoretical
underpinnings are more fuzzy to me.

Representing features as vectors, sure. But that's also a pretty superficial
use of linear algebra since from that point forward I'm using something on the
trig side to compute results (at least in clustering).

I also tend to lean rather heavily on probability and bayesian approaches to
many areas. So Naive Bayes classification is a love of mine, finding ideal
parameter values given data coming in becomes an online updating multi-armed
bandit problem to me (which also doesn't require explicit linear algebra). A
lot of my work is also in experiment design and analysis and for this I use a
mixture of Bayseian and frequentist statistical testing.

Canned packages out of the box with parameters to tweak that I can cross
validate to evaluate how well my model is working. If I happen to venture out
to other models I'm probably reading up on common pitfalls and how to test for
them.

To me, it's entirely possible that the gap between you and I is due to
experience and even just differences in training/learning (including but not
limited to the quantity of it). These discussions are important for me since
they help to inform my future learning aspirations.

------
samirmenon
"The reason I'm skeptical is because I believe in the science portion of our
field's name. One of the primary things that separates a data scientist from
someone just building models is the ability to think carefully about things
like endogeneity, causal inference, and experimental and quasi-experimental
design."

What exactly _is_ a 'data scientist'? Shouldn't scientists be the ones
analyzing their own data, instead of 'data scientists'?

~~~
agibsonccc
My reply to this would be that everyone has to do some sort of statistics and
modeling in their studies.

Data science tends to be more about having good software engineering,
understanding how to interface with production systems to pull data out, and
enough modeling experience, almost specializing in it, to be able to make
inferences about different kinds of business activities affecting revenue, or
other parts of the business.

You can get away and even grow in to a data science role as a statistican or
software engineer (probably leading towards data engineering more than data
science).

Source: students of mine get hired by companies like facebook[1].

So: to summarize, data scientists get hired for roles at companies to focus
only on modeling, data quality, and data advocacy, and assisting product
roles.

Edit:

[1]: [http://zipfianacademy.com/](http://zipfianacademy.com/)

------
gaius
You'll need a Mac, some thick rimmed glasses, and an unshakeable belief that
what normal people have been doing for 20 years with 2 clicks in Excel can in
fact only be done on a Hadoop cluster "in the clouds".

~~~
babs474
In practice I find the bigger problem is from analysts/actuaries/statisticians
who have a disdain for programming, which sometimes is viewed as a task for
mere technicians.

Typically your excel model/analysis has not even solved half the problem of a
datascience system. It needs to be repeatable, it needs to be open to change
(source control!), it needs to be integratable with the wider system.

These things need to be considered upfront. There are plenty of reasonable
software tools for this. Yes hadoop shouldn't be your first step, but taking 5
minutes to put something on a server in ec2 (omg, the cloud) is not
unreasonable.

There is a swallowing abyss between excel and production. That is where
datascience projects die, its a shame.

~~~
jebus989
I've never met a statistician who either uses excel or has a "disdain for
programming". R or Matlab are basic tools of the trade

~~~
PaulHoule
I talk a lot of people who've had trouble with "data scientists" who are
strong in statistics and know some matlab or R or something like that, but
know nothing about the craftsmanship of programming.

By that I mean skills like using version control, writing software that is
maintainable, working with a team that uses project management software,
things like that.

A common kind of workflow is that a data scientist develops an algorithm and
makes tweaks to it, and that this gets baked into a production system.

If the data scientist throws something over the wall and it takes the
developers a few weeks to get it ready for real use, the "real time"
productivity of the team is going to be awful. The closer we come to the data
scientist checking the changes in and that's that, the more valuable the data
scientist is.

~~~
jebus989
This is absolutely a fair comment, coders but not software engineers, and is
the same problem that's permeated bioinformatics for the last decade or so.
(As an aside, it's fun hearing grand claims about data science revolutionising
medicine in 10 years [0], when the same claims were made about bioinformatics
10 years ago.)

[0]
[https://twitter.com/HanChenNZ/status/473825783874859008](https://twitter.com/HanChenNZ/status/473825783874859008)

