
Data science tips and tricks from the developer community - pplonski86
https://blog.algorithmia.com/becoming-a-10x-data-scientist/
======
minimaxir
It's worth nothing that the article does not discuss the _reproducibility_ of
results (e.g. with a Jupyter Notebook) and the implementation of said results
(e.g. deploying/validating models), both of which matter _much_ more than any
code style conventions for data-related projects.

~~~
flaviuspopan
Nothing feels cleaner than storing everything (notebook, raw data, cleansed
data, misc scripts, etc.) in a docker image when you're finished with the
project. Data science and docker are meant to be besties.

~~~
Angostura
Potentially problematic for those who want to check your findings in 30 years
time?

~~~
gilbetron
Is there a good solution to that problem, though? (Serious question). I
recently did a laptop refresh and am using it as an opportunity to solidify my
approach to ML development, and would love to hear if there is a good solution
to long-term reproducibility. I'm currently leaning towards Docker, but maybe
Vagrant or another "pure" VM approach is better...

~~~
delazeur
Carve it onto stone tablets.

~~~
KirinDave
You joke it, but it's a major problem that our tech for very stable WORM media
has lagged behind demand.

Our use of data has grown so much faster than our network capacity (and
indeed, it seems like we're going to hit a series of physical laws and
practical engineering constraints here). "Data has gravity" but the only way
to "sustainable" hold a non-trivial volume of data for 20 years right now is
to run a data center with a big dht that detects faults and replicates data.

------
mattmanser
A 10x developer is not ten times more productive than the _average_ developer,
they're 10x more productive than the _worst_ developer.

Wish this myth would stop perpetuating, they're very clear in the original
study.

Here's some more details from the horses mouth, Steve McConnell who
popularized the concept in Code Complete:

[http://www.construx.com/10x_Software_Development/Origins_of_...](http://www.construx.com/10x_Software_Development/Origins_of_10X_%E2%80%93_How_Valid_is_the_Underlying_Research_/)

~~~
escribmac
The worst developers I've worked with take 2 weeks for tickets that should be
simple. That would mean doing 1 easy ticket every day or two makes you a
10x....

I've always hated this term and the mindset around it. I think organizational
practices, intelligent engineering strategy, etc are far more important to the
output of a team than hiring one genius dev.

~~~
mattmanser
Did it ever occur to you they might not be bad developers, they're just
goofing off because there's no consequence for being slow?

Like when my old work actually started measuring ticket closure times, our
best developers were only 2x more productive than our worst ones. But suddenly
a lot more tickets were getting closed.

I mean,I know that some complicated tasks needed the best developers, as the
worst ones literally were incapable of understanding the code, but then again
doesn't that say something about the code itself and how poorly it
communicates its intent? Perhaps clever code is simply confusing code...

~~~
escribmac
I agree, I think that falls under organizational practices

------
TCM
Although I agree with the top post that reproducibility of results is
important I think software engineering principles are severely lacking in many
data scientists. I attempted to deploy other peoples models as a research
assistant and the lack of understanding of code style conventions was a big
issue. Even now when I go through some new ML system on github many of them
have code style issues.

As an aside in academia to share other peoples results you basically need to
create a virtualbox image to make it reproducible. I think docker would work
but it may be too complicated.

------
JPKab
The first step to being a 10x data scientist:

Know how to actually write code, and also understand a broad range of modeling
approaches and the math behind them.

The majority of people passing themselves off as data scientists in the
traditional corporate world these days are at best unqualified and at worst
outright frauds.

~~~
carlmr
I.e. hire engineers and computer scientists who actually have the right math
background and know how to build software.

------
jupiter90000
In my opinion, this article would make more sense if two things were first
defined: 1) What is a data scientist? 2) What would it mean for someone
defined in 1 above to be 10x more productive?

Then it says nothing or little about understanding foundations of research,
math, and computer science, instead going into superficial things like
'understand the business' and code examples that could be produced by a
beginner level programmer.

This is not how to get to 10x, more like barely, possibly competent.

------
acdha
The advice doesn't look bad but the click-bait title really inclined me to
skip it, especially since it meant opening with a digression about whether 10x
developers even exist rather than the actual content.

------
tw1010
Funny that learning actual mathematics isn't mentioned.

~~~
kolbe
The whole time I was sarcastically thinking, "yeah I'm sure you get 10x by
choosing consistent naming conventions. That will make up for the months of
tearing your hair out trying to learn how ANNs work without a very solid
understanding of math/stats."

~~~
gipp
90% of data scientists do not use neural networks. Of those who do, 90%
shouldn't be, and are letting what's fun/interesting get in the way of
actually producing value.

The fact of the matter is that if you're not FB/GOOG/AMAZ, the vast majority
of what companies need from their data scientists actually requires very
little advanced mathematics, and much more focus on rigor, reproducibility,
and good deployment/engineering practices.

~~~
kolbe
The job you're describing may have the title of Data Scientist, but it isn't
data science if it doesn't involve advanced methods.

~~~
gipp
That's a semantic battle you've lost already.

~~~
kolbe
My point is that this 'data scientist' may be called one at fartapp.io, but
not at Google/Microsoft/DARPA/MIT.

------
gghyslain
I highly recommend "Best Practices for ML Engineering from Google" [1], which
contains one of the best piece advice on the topic:

> "To make great products: do machine learning like the great engineer you
> are, not like the great machine learning expert you aren’t."

[1] Previous HN discussion:
[https://news.ycombinator.com/item?id=13414776](https://news.ycombinator.com/item?id=13414776)

------
PLenz
The main job of a DS is to have creative ideas about how to solve difficult
problems. Creativity doesn't come on a schedule. Sometimes I have a rush of
ideas that come all together - often because one idea unlocks lots of other
ones. Sometimes I spend weeks or months reading papers and tinkering with only
of <strike>failures</strike> learning how not to do it to show for it. The
only thing that a '10x DS' indicates to me is that you have a lot of low
hanging fruit to pick.

------
the_af
Is writing docstrings with argument types a thing in Python? If so, wouldn't
these developers benefit from using actual type annotations (or a language
with static types)? This is one area where types actually help a great deal
with rapid prototyping!

Also, I disagree with the Scala examples and the argument against brevity, but
I guess this is the stuff of flamewars. Not only do I not find his more
verbose examples any clearer, they also lack context: presumably the full
snippet looks like this:

    
    
        allClothesCount.sortBy(-_._2)
    

in which case any additional variable names don't help, and brevity makes the
snippet _clearer_.

~~~
nerdponx
Argument types in Python docstrings predate type hinting. They're part of a de
facto standard started by NumPy.

~~~
the_af
Fair enough. Shouldn't the advice be updated to use type hinting then? It
seems like a terrible standard these days.

------
mooneater
Most of this nit-picky advice will not increase my leverage.

What I think would: occasional access to domain experts in specific niches of
data science.

~~~
mikeyanderson
What would that look like? Data-scientist AMAs? Or in person? What would you
like to see?

~~~
mooneater
\- a marketplace for getting small doses of top-level expert advice \- more
written about real-world, messy, data-science and machine learning
implementations \- the vast majority of writing about ML/DS involves the
elements. There is a lack of writing about how full systems integrate.

------
kolbe
The author seems to confuse "potentially competent data scientist" with "10x
data scientist."

------
amrrs
One big takeaway from the article is how easy Algorithmia makes Data
Scientists and Teams in productionising Data Science Models which is still a
bigger challenge with DevOps and Data Engineers scratching their heads with
the model output from Data Scientists.

