
The Data Science loop: the day-to-day work of a data scientist - dude_abides
http://seanjtaylor.com/2012/09/18/the-data-science-loop/
======
dude_abides
The essence of what is a Data Scientist boils down to this awesome tweet by
Josh Wills (@cloudera):
<https://twitter.com/josh_wills/status/198093512149958656>

Data Scientist (n.): Person who is better at statistics than any software
engineer and better at software engineering than any statistician.

~~~
defen
Echoes of A.J. Liebling:

"I can write better than anybody who can write faster, and I can write faster
than anybody who can write better."

\----

Leading to a generalized strategy for success:

1) Identify two orthogonal metrics by which your work will be judged by
relevant stakeholders.

2) Work / practice until you are better than anyone by at least one of those
metrics (in practice, better than the large majority of your competitors).

The degenerate case involves being the best in the world by at least one of
the metrics; more achievable is to be somewhere in the middle for each. To use
our own patio11 as an example: be better at coding than anyone who is better
at SEO, and be better at SEO than anyone who is a better coder.

~~~
sadga
Only n people are the best at something, but n^2 people are the best at pairs
of things.

~~~
jules
Actually, an infinite number of people can be Pareto optimal at just two
things.

------
hardtke
It you want to make an impact as a data scientist, it helps to prototype your
ideas as a testable product. Many product managers and engineers ignore their
data scientists' work because fighting fires almost always takes precedence.
The recent HBR article (posted here) showed how the data scientist at LinkedIn
was able to make a huge impact by running his own experiments against live
users. At my company, we always provide a non-optimized prototype that can be
run directly with our existing product. If the idea is good, we can optimize
the code later.

------
pav3l
I would be interested to see more "low level" posts about how people approach
their data analysis workflow. I do data analysis for a living and find that
efficient workflow has just as much impact on your productivity and results as
the choice in technologies that you use.

~~~
dude_abides
That's a good idea; I will try and write up such a "low level" post this week.
As a preview, the things that will prominently feature in the post are -
RStudio, custom functions/aggregates in Postgres PL/R, and Sweave. (yes I am a
R fanboy).

~~~
pav3l
Would definitely like to read it!

I personally find the need to go back and forth quite a bit between different
tools when working even on a single dataset. Say, i'll do some work with text
part of the dataset in Python, then use Matlab for some basic quantitative
analysis, then use R for some advanced statistics if I have to, port those
results back to Matlab/Python for speed, then back to R to sweave, maybe some
spreadsheet stuff too, etc... So for me, for every project I need to have my
dataset in easily accessible format (csv/someSQL/etc) and I need scripts in
every language that I use, that would communicate with the data source and get
"up to speed" right away.

It's little tips like that I think people should be sharing more.

------
rz2k
It would be great to hear more about the communication section.

For instance is it better to offer an ambitious summary first, followed by a
deep exploration into the data and methods to understand how any conclusion
may have been determined?

I am loathe to put something out there might overstate a point that will be
run with without a sincere understanding, but I also fear that if the details
are provided first without a thorough explanation of the nuances, that there
is a danger they will be used carelessly.

I suppose my question is, have others have found it preferable to start with a
headline then support that headline with caveats, or to create a detective
story and walk people through arriving at a conclusion?

It's little like the conundrum of TED Talks (or a much of science news
reporting). A very highly produced talk will establish an argument for an
important adjustment of conventional wisdom in some field, only to create new
problems because to people outside of the field it sounds like the final word.

~~~
sadga
Make the strongest claim you can support.

~~~
rz2k
To be honest, I think that it is reckless to disavow any responsibility for
what consumers of information you produce will do with it in your own
organization.

Going back to TED Talks, ridiculous analysis and comments citing a TED talk
are often based on a talk that itself was not ridiculous.

I asked about elaboration on that area of responsibility because I'd assume
that he had a relatively small audience of highly capable people, yet even so
had more intimate contact with the specifics and had to struggle with just how
to present that more full knowledge in a concise but still responsible way.

------
cynusx
This post completely neglects the part where data scientists derive
statistical models from large sets of data in order to use them for
classification, clustering or prediction purposes. Describing data science as
advanced applied accounting is only true if you have to 1) analyze big
datasets or 2) find hidden connections between data aka. data mining

------
michelleclsun
Thanks for the post, I am starting a role as a data analyst soon and am
curious to hear more about two aspects of your role:

1\. How are the dynamics between engineering team, product managers and data
scientists. Understand that LinkedIn's data team plays a huge role in creating
and improving features (like People you may know), how is it like at Facebook,
and in general other companies?

 _Engineers build things, managers make decisions, data scientists answer
questions_

2\. What are the tools you use ( I see R mentioned in the thread), what books
would you recommend? I mostly write in python, use D3 and gephi for
visualization. Am also taking the Coursera course on Social Network Analysis
(<https://class.coursera.org/sna-2012-001/class/index>) and reading the course
book Easley & Kleinberg, Networks, Crowds and Markets. Thanks for sharing
again.

~~~
dude_abides
Sorry if I didn't make this clear earlier: I just posted this link to HN, I
didn't write this blog post. (Incidentally I'm also a data scientist.)

------
dxbydt
intersections are nonsexy.

ask the hawkers who ply their wares at intersections. not only do they have to
deal with traffic from the north, there is a constant barrage of traffic from
the south, not to mention the incessant traffic from the east, and hey, how
can we forget the speeding traffic from the west...

but the same intersections are also frequented by pedestrians from the north,
and the west, the south and the east, so the hawkers' trade is lucrative.

data science is at the intersection of linear algebra, machine learning,
statistics and distributed computing.

if you ask the hawkers sitting in a nicely furnished airconditioned shop at
the mall, they will tell you that the hawkers at intersections aren't real
hawkers, they are just fly-by-night hustlers selling a stalk of unkempt roses
that will wither away, selling evening tabloid newspapers that'll be useless
to read tomorrow, unhygenic ice-cream cones and candy, unhealthy street food,
braids for the hair that'll snap if you tug at it, fake watches, imitation
handbags, ... so if you want to be a real hawker selling real healthy food,
you must open a real restaurant in a mall. you wanna sell genuine cartier
watches, get a license and open a premium retail outlet in a mall. you want to
hawk useful literature, open a barnes and noble bookstore in a mall...

so also the genuine statisticians will mock the data scientists as fake...oh
these guys don't grok industrial strength SAS and S-PLUS, they fool around
with unproven toys like R.

the genuine linear algebraists are too busy submitting academic papers to the
MAA to worry about trivialities like data science.

the genuine distributed computing programmers know that data scientists
operate with a very tiny subset of distributed computing - usually just hadoop
or bigtable, and even that they dodge with syntactic sugar like pig, cascading
and scalding. they are not even real programmers - they don't refactor their
code, some of them just write ad-hoc scripts that they don't even check in to
the repository, they don't do agile, hell they don't care about readability of
code - they call their chebyshev decompositions "def cbd()" instead of "public
static void chebyshevDecomposition( DenseDouble2DMatrix inputMatrix)", how can
you trust these jokers...

the genuine ML guys work on world-changing technology like genome sequencing
and autonomous vehicles, natural language processing and credit card fraud
detection, not fluff like mining information out of tweets and facebook likes
and linkedin profiles and foursquare check-ins.

but you see, the same intersections are also frequented by pedestrians from
the north, and the west, the south and the east, so the data scientists can
put food at the table and make rent :)

------
elviejo
How do you become a self made data scientist? What are the approiate books,
courses, majors to study?

