I cannot describe how many times I've been shown results and when asking how to reproduce them, after several notes (and sometimes complaints to higher ups) I eventually get a series of command line arguments or a barely functioning R-script.
These conclusions are too important to be so sloppily produced. We need verification, validation and uncertainty quantification for any result provided to decision makers.
I was very happy to have that background when I took part developing statistical models used for wind hazard analysis on nuclear powerplants in my first job out of college.
Which is to say I think starting with the idea that you're aim for x10 is pernicious and tends to create dysfunctional teams. The claim some developers in some circumstances are ten time more productive than others may or may not be true but software development needs processes whose goal is to help an entire team rather than helping an individual to that "level".
Developers should strive to better themselves, but it's important not to fool yourself, too. Having a strong team is almost always better from a business point of view.
I shun "rockstar" and "10x" (and whatever other bullshit moniker they will come up with next) team members. Give me a group of smart people that gel well together, and are highly self-confident without egos getting into the way, and we can move mountains.
And this definitely didn't happen after those working on tools to help said analysts make reproducible results encouraged the analysts to use said tools... No, that would be crazy.
Reproducibility, like you say, however, is something that is an issue far more particular to data science, and worth more serious consideration and discussion. Hand-in-hand with that is shareability. I'm a fan of what airbnb has open sourced to address some of those issues in their knowledge repo project: https://github.com/airbnb/knowledge-repo
Docker could be that stable build process but it requires the additional assertion that there wasn't, say, a truckload of changes made using `docker exec` or a bunch of customizations to files which were copied into the image. Simply putting a note on the source repo which says that might be enough.
(I really like what C Titus Brown has written about reproducibility in computation research over the years: http://ivory.idyll.org/blog/tag/reproducibility.html)
There needs to be an immutable, high performance read data store that has a 30+ year plan for survival if we're really going to retool our world around expert systems.
Our use of data has grown so much faster than our network capacity (and indeed, it seems like we're going to hit a series of physical laws and practical engineering constraints here). "Data has gravity" but the only way to "sustainable" hold a non-trivial volume of data for 20 years right now is to run a data center with a big dht that detects faults and replicates data.
A lack of reproducibility is a major problem for DSEs and practitioners right now. In fact, I'd argue its the single biggest problem.
- easy to deploy
models, they are going to hate it. It just kills the fun and/or makes obvious if they made a mistake.
Fun I almost get, obviously good for productivity (though I think you'd really be sacrificing productive output for non-productive output), but I just don't get where you're even coming from with the "making mistakes more obvious" angle.
I don't understand why we couldn't have some system, perhaps using strace and friends, which tracks everything I ever do, and how every file was created. Then I could just say "how did I make X?"
Wish this myth would stop perpetuating, they're very clear in the original study.
Here's some more details from the horses mouth, Steve McConnell who popularized the concept in Code Complete:
I've always hated this term and the mindset around it. I think organizational practices, intelligent engineering strategy, etc are far more important to the output of a team than hiring one genius dev.
Like when my old work actually started measuring ticket closure times, our best developers were only 2x more productive than our worst ones. But suddenly a lot more tickets were getting closed.
I mean,I know that some complicated tasks needed the best developers, as the worst ones literally were incapable of understanding the code, but then again doesn't that say something about the code itself and how poorly it communicates its intent? Perhaps clever code is simply confusing code...
Those organizational practices and strategy make the best developers better.
If you hire shitty/unqualified developers who cannot communicate, don't know the tools and aren't functional, even the most amazing developer is kneecapped from a productivity point of view because she must be accountable for everything, forever -- the idiots drag her down.
It's like anything else -- if you work at McDonald's, a bunch of slow unmotivated workers will slow down a fast/hard worker. It's just that the value of the labor + output for cheeseburgers is much lower than software!
It may simply be that "10x" people who do exist do so in ways that are challenging to observe. As an example, not making difficult-to-detect mistakes early in the software lifecycle that cause major problems later (classic real world example: mongodb). Or that their influence on a software org causes overall productivity improvements.
In any case, it's a toxic myth that pits individuals against each other for demonstrations of productivity. I'm of the opinion it's a "self-defeating prophecy" or a good example of the "basilisk" effects in game theory.
Which is why metrics driven organizations in my experience with their disincentive to help others, slow everything down.
It's a matter of perspective.
As an aside in academia to share other peoples results you basically need to create a virtualbox image to make it reproducible. I think docker would work but it may be too complicated.
Know how to actually write code, and also understand a broad range of modeling approaches and the math behind them.
The majority of people passing themselves off as data scientists in the traditional corporate world these days are at best unqualified and at worst outright frauds.
Then it says nothing or little about understanding foundations of research, math, and computer science, instead going into superficial things like 'understand the business' and code examples that could be produced by a beginner level programmer.
This is not how to get to 10x, more like barely, possibly competent.
The fact of the matter is that if you're not FB/GOOG/AMAZ, the vast majority of what companies need from their data scientists actually requires very little advanced mathematics, and much more focus on rigor, reproducibility, and good deployment/engineering practices.
> "To make great products: do machine learning like the great engineer you are, not like the great machine learning expert you aren’t."
 Previous HN discussion: https://news.ycombinator.com/item?id=13414776
Also, I disagree with the Scala examples and the argument against brevity, but I guess this is the stuff of flamewars. Not only do I not find his more verbose examples any clearer, they also lack context: presumably the full snippet looks like this:
As far as other languages, I think Julia has the best chance of eventually overtaking Python and it has a more deeply-embedded awareness of types.
What I think would: occasional access to domain experts in specific niches of data science.