Hacker News new | past | comments | ask | show | jobs | submit login
Data science tips and tricks from the developer community (algorithmia.com)
204 points by pplonski86 on Sept 5, 2017 | hide | past | web | favorite | 67 comments

It's worth nothing that the article does not discuss the reproducibility of results (e.g. with a Jupyter Notebook) and the implementation of said results (e.g. deploying/validating models), both of which matter much more than any code style conventions for data-related projects.

^ this.

I cannot describe how many times I've been shown results and when asking how to reproduce them, after several notes (and sometimes complaints to higher ups) I eventually get a series of command line arguments or a barely functioning R-script.

These conclusions are too important to be so sloppily produced. We need verification, validation and uncertainty quantification for any result provided to decision makers.

Learning uncertainty propagation in engineering statistics was one of those concepts that seemed to be immediately useful and have far more implications than any textbooks emphasized.

I was very happy to have that background when I took part developing statistical models used for wind hazard analysis on nuclear powerplants in my first job out of college.

To put it cynically, if you other people can reproduce your results, you might not be demonstrating that you are x10 more productive than them.

Which is to say I think starting with the idea that you're aim for x10 is pernicious and tends to create dysfunctional teams. The claim some developers in some circumstances are ten time more productive than others may or may not be true but software development needs processes whose goal is to help an entire team rather than helping an individual to that "level".

This is why the only time I use 10x is "your team will become 10x more productive."

Yes, it's better to have an x10 team, rather than an x10 developer.

Developers should strive to better themselves, but it's important not to fool yourself, too. Having a strong team is almost always better from a business point of view.

Indeed. I have seen several teams with a bunch of (self-styled) "10x" devs, and found that the productivity and quality of the team decreases in direct proportion to the amount of "10x" devs on the team.

I shun "rockstar" and "10x" (and whatever other bullshit moniker they will come up with next) team members. Give me a group of smart people that gel well together, and are highly self-confident without egos getting into the way, and we can move mountains.

This 100%. I have read postmortems of some "significant discoveries" which have turned out to only be reproducible on a particular build or software on a single analyst's machine. Or not at all. One "result" turned out to hinge on the iteration order of python dictionaries.

And this definitely didn't happen after those working on tools to help said analysts make reproducible results encouraged the analysts to use said tools... No, that would be crazy.

I completely agree. Almost all of this article appears to have little to do with being a Data Scientist in particular and more to do with some good practices for writing code in general. So the advice itself is fine, just not what I was hoping for based on the title.

Reproducibility, like you say, however, is something that is an issue far more particular to data science, and worth more serious consideration and discussion. Hand-in-hand with that is shareability. I'm a fan of what airbnb has open sourced to address some of those issues in their knowledge repo project: https://github.com/airbnb/knowledge-repo

Hey thanks for the comment! I'm the author of the talk-turned-post :-) You are completely correct about reproducibility being super important in data science workflows. While I did mention it in the post (and in the talk the post was based on), I mentioned it as a part of version control tools. That said, I think it's not something that is focused on enough (obviously I'm guilty of that too) so I plan on doing a follow up post focused on reproducibility and the tools that can help you recreate your results. Kinda putting the "science" back in data science. Really I want an excuse to play around with tools like https://dataversioncontrol.com/ which looks super useful and I mentioned it in the post, but haven't had a chance to use.

'Kinda putting the "science" back in data science. ' Exactly! This is the primary goal in DVC project.

Nothing feels cleaner than storing everything (notebook, raw data, cleansed data, misc scripts, etc.) in a docker image when you're finished with the project. Data science and docker are meant to be besties.

I would prefer recommending a stable build process: — a Docker image can be just like having a VM image or that one PC in the corner of the lab nobody is sure is unneeded. It's far better than having nothing or just the result file but it still has the possibility of needing to reverse-engineer the internal state and given how fast the Docker world moves I would not want to bet on format compatibility 5 years out.

Docker could be that stable build process but it requires the additional assertion that there wasn't, say, a truckload of changes made using `docker exec` or a bunch of customizations to files which were copied into the image. Simply putting a note on the source repo which says that might be enough.

(I really like what C Titus Brown has written about reproducibility in computation research over the years: http://ivory.idyll.org/blog/tag/reproducibility.html)

Potentially problematic for those who want to check your findings in 30 years time?

Is there a good solution to that problem, though? (Serious question). I recently did a laptop refresh and am using it as an opportunity to solidify my approach to ML development, and would love to hear if there is a good solution to long-term reproducibility. I'm currently leaning towards Docker, but maybe Vagrant or another "pure" VM approach is better...

Not perfectly, but a good start is to keep all the software assets AND data assets you used to train the model.

There needs to be an immutable, high performance read data store that has a 30+ year plan for survival if we're really going to retool our world around expert systems.

I think the only real problem times you'll find are when the architectures are changed. x86, arm, you'd probably want to port your solution images then if ever. There will always be folks emulating hardware in software on new architectures.

Well you can always fire up LaTeX and write a report. If detailed enough, that in conjunction with the data set should be enough to survive anything.

Carve it onto stone tablets.

You joke it, but it's a major problem that our tech for very stable WORM media has lagged behind demand.

Our use of data has grown so much faster than our network capacity (and indeed, it seems like we're going to hit a series of physical laws and practical engineering constraints here). "Data has gravity" but the only way to "sustainable" hold a non-trivial volume of data for 20 years right now is to run a data center with a big dht that detects faults and replicates data.

I prefer gold.

I've never used Docker. Searching "reproducible research with Docker" yields lots of results. Any stand-out resource suggestions?

I would simply familiarize with just the basics because you don't have to go much further than that to make use of it for research purposes. My usual process involves breaking the process down into multiple stages (cleansing, conformance, testing, reporting), including a data dir, and finally creating a dockerfile that simply adds the data/source to a simple hierarchy and includes all dependencies. As long as you know how to build a dockerfile, you're golden. You can then upload the image to dockerhub, and have somebody else pull the image and run it to reproduce your entire environment. Helps a ton for online DS courses and MOOCs.

This doesn't guarantee reproducible results though.

A lack of reproducibility is a major problem for DSEs and practitioners right now. In fact, I'd argue its the single biggest problem.

Thus making you a 1x Data Scientist because your result can only be demonstrated once? ;)

In my experience if you develop a "data science pipeline" forcing the data scientist to build

- reproducible

- validated

- back-tested

- easy to deploy

models, they are going to hate it. It just kills the fun and/or makes obvious if they made a mistake.

So we should sacrifice all the things that actually make a Data Scientist's work valuable in the name of fun and obscuring mistakes?

Fun I almost get, obviously good for productivity (though I think you'd really be sacrificing productive output for non-productive output), but I just don't get where you're even coming from with the "making mistakes more obvious" angle.

I think he was being facetious. Of course we need all these things, but data science right now is still not that mature I guess.

I blame software.

I don't understand why we couldn't have some system, perhaps using strace and friends, which tracks everything I ever do, and how every file was created. Then I could just say "how did I make X?"

Make it happen!

A 10x developer is not ten times more productive than the average developer, they're 10x more productive than the worst developer.

Wish this myth would stop perpetuating, they're very clear in the original study.

Here's some more details from the horses mouth, Steve McConnell who popularized the concept in Code Complete:


The worst developers I've worked with take 2 weeks for tickets that should be simple. That would mean doing 1 easy ticket every day or two makes you a 10x....

I've always hated this term and the mindset around it. I think organizational practices, intelligent engineering strategy, etc are far more important to the output of a team than hiring one genius dev.

Did it ever occur to you they might not be bad developers, they're just goofing off because there's no consequence for being slow?

Like when my old work actually started measuring ticket closure times, our best developers were only 2x more productive than our worst ones. But suddenly a lot more tickets were getting closed.

I mean,I know that some complicated tasks needed the best developers, as the worst ones literally were incapable of understanding the code, but then again doesn't that say something about the code itself and how poorly it communicates its intent? Perhaps clever code is simply confusing code...

I agree, I think that falls under organizational practices

You nailed it.

Those organizational practices and strategy make the best developers better.

If you hire shitty/unqualified developers who cannot communicate, don't know the tools and aren't functional, even the most amazing developer is kneecapped from a productivity point of view because she must be accountable for everything, forever -- the idiots drag her down.

It's like anything else -- if you work at McDonald's, a bunch of slow unmotivated workers will slow down a fast/hard worker. It's just that the value of the labor + output for cheeseburgers is much lower than software!

The worst programmers in a team sometimes have a negative contribution....

But also, even this data is questionable in the extreme.

It may simply be that "10x" people who do exist do so in ways that are challenging to observe. As an example, not making difficult-to-detect mistakes early in the software lifecycle that cause major problems later (classic real world example: mongodb). Or that their influence on a software org causes overall productivity improvements.

In any case, it's a toxic myth that pits individuals against each other for demonstrations of productivity. I'm of the opinion it's a "self-defeating prophecy" or a good example of the "basilisk" effects in game theory.

The real 10x developers in my experience are the hardest to measure. Because they pull the whole team by always being helpful and improving things where they see potential. But that doesn't necessarily show up in their results, but in the whole teams results.

Which is why metrics driven organizations in my experience with their disincentive to help others, slow everything down.

Do they? Or are they just wasting time refactoring instead of getting stuff done?

It's a matter of perspective.

Sure from a "task" stand point, but take quality, reusability, unique approaches, business sense, etc. and the best devs add easily 10x value if not more. Right?

Most of that comes from exposure to the code base, not innate skill. Experience.

Although I agree with the top post that reproducibility of results is important I think software engineering principles are severely lacking in many data scientists. I attempted to deploy other peoples models as a research assistant and the lack of understanding of code style conventions was a big issue. Even now when I go through some new ML system on github many of them have code style issues.

As an aside in academia to share other peoples results you basically need to create a virtualbox image to make it reproducible. I think docker would work but it may be too complicated.

The first step to being a 10x data scientist:

Know how to actually write code, and also understand a broad range of modeling approaches and the math behind them.

The majority of people passing themselves off as data scientists in the traditional corporate world these days are at best unqualified and at worst outright frauds.

I.e. hire engineers and computer scientists who actually have the right math background and know how to build software.

In my opinion, this article would make more sense if two things were first defined: 1) What is a data scientist? 2) What would it mean for someone defined in 1 above to be 10x more productive?

Then it says nothing or little about understanding foundations of research, math, and computer science, instead going into superficial things like 'understand the business' and code examples that could be produced by a beginner level programmer.

This is not how to get to 10x, more like barely, possibly competent.

The advice doesn't look bad but the click-bait title really inclined me to skip it, especially since it meant opening with a digression about whether 10x developers even exist rather than the actual content.

Funny that learning actual mathematics isn't mentioned.

The whole time I was sarcastically thinking, "yeah I'm sure you get 10x by choosing consistent naming conventions. That will make up for the months of tearing your hair out trying to learn how ANNs work without a very solid understanding of math/stats."

90% of data scientists do not use neural networks. Of those who do, 90% shouldn't be, and are letting what's fun/interesting get in the way of actually producing value.

The fact of the matter is that if you're not FB/GOOG/AMAZ, the vast majority of what companies need from their data scientists actually requires very little advanced mathematics, and much more focus on rigor, reproducibility, and good deployment/engineering practices.

Agreed. But you could add "domain expertise" and "basic statistics" to that list as well.

The job you're describing may have the title of Data Scientist, but it isn't data science if it doesn't involve advanced methods.

That's a semantic battle you've lost already.

My point is that this 'data scientist' may be called one at fartapp.io, but not at Google/Microsoft/DARPA/MIT.

There's no hype in talking about mathematics!

I highly recommend "Best Practices for ML Engineering from Google" [1], which contains one of the best piece advice on the topic:

> "To make great products: do machine learning like the great engineer you are, not like the great machine learning expert you aren’t."

[1] Previous HN discussion: https://news.ycombinator.com/item?id=13414776

The main job of a DS is to have creative ideas about how to solve difficult problems. Creativity doesn't come on a schedule. Sometimes I have a rush of ideas that come all together - often because one idea unlocks lots of other ones. Sometimes I spend weeks or months reading papers and tinkering with only of <strike>failures</strike> learning how not to do it to show for it. The only thing that a '10x DS' indicates to me is that you have a lot of low hanging fruit to pick.

Is writing docstrings with argument types a thing in Python? If so, wouldn't these developers benefit from using actual type annotations (or a language with static types)? This is one area where types actually help a great deal with rapid prototyping!

Also, I disagree with the Scala examples and the argument against brevity, but I guess this is the stuff of flamewars. Not only do I not find his more verbose examples any clearer, they also lack context: presumably the full snippet looks like this:

in which case any additional variable names don't help, and brevity makes the snippet clearer.

> Is writing docstrings with argument types a thing in Python? If so, wouldn't these developers benefit from using actual type annotations (or a language with static types)?

In theory yes (at least for Python 3) but it's a bit more nuanced. Python type annotations are still kind of... annoying? You have to import the things you need from `typing`, for example, and refer to classes differently if their definition comes after. A lot of this is based on convention and what IDEs implement since there's no fully fleshed-out standard. I think it's a good idea if done right -- I find myself more productive in TypeScript than JavaScript so it is possible to add value by transplanting a type system into a dynamic language -- but I don't think it's there yet.

As far as other languages, I think Julia has the best chance of eventually overtaking Python and it has a more deeply-embedded awareness of types.

I was thinking the same thing about those sortBy code snippets. I use shorthand for lambdas all the time because the individual item names are implied by the collection.

Argument types in Python docstrings predate type hinting. They're part of a de facto standard started by NumPy.

Fair enough. Shouldn't the advice be updated to use type hinting then? It seems like a terrible standard these days.

Most of this nit-picky advice will not increase my leverage.

What I think would: occasional access to domain experts in specific niches of data science.

What would that look like? Data-scientist AMAs? Or in person? What would you like to see?

- a marketplace for getting small doses of top-level expert advice - more written about real-world, messy, data-science and machine learning implementations - the vast majority of writing about ML/DS involves the elements. There is a lack of writing about how full systems integrate.

The author seems to confuse "potentially competent data scientist" with "10x data scientist."

One big takeaway from the article is how easy Algorithmia makes Data Scientists and Teams in productionising Data Science Models which is still a bigger challenge with DevOps and Data Engineers scratching their heads with the model output from Data Scientists.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact