
Ask HN: Too much code cleaning, not enough results (data science) - throwawaystress
How important is it for data scientists to have clean, modular, reusable code? Here’s my problem: while working on a project, I’ll start off in Jupyter notebooks, toying around with the data, doing some EDA, etc. Eventually I’ll pull out some of that code into functions in a Python file, and call those functions from the Jupyter. Neat.<p>The problem is, as I get more and more functions, I want to organize them more, make them more generalizable and consistent, etc. I’ll also get carried away with organizing files and source control, cleaning up my notes, and making documentation to explain what models&#x2F;data&#x2F;source files&#x2F;results exist, what they mean, etc.<p>And then I realize I’ve been spending less and less time getting results, and more on this “overhead”. I struggle to balance the desire the rush ahead and get results with the compulsion to make the code “beautiful” and to have the project in the cleanest possible state. I’ve seen plenty of other projects with terrible organization, no documentation, and confusing, poorly formatted code. But if I’m not producing value, my neatness doesn’t matter.<p>All in all, I’m feeling pretty unproductive because of these habits. Any advice?
======
lordkrandel
It depends, so I'm asking you some questions to give you ideas.

How much of this code is going to be read, reused, modified, studied by you or
other people?

Is it opensource or foundational?

Is it of any interest for the general public?

Could you actually spend more time in doing something else which is more
productive?

Is this refactor make you learn a new tecnique?

Can you find or develop an auto-formatter that makes messy code just neat and
clean?

If you are building models for a process or phenomenon, can the results be the
subject of an article, maybe in the future, to show your tecniques and ask for
feedback? Notebooks are just great for that.

------
itqwertz
A good rule to follow is to get it done dirty, add some tests, then refactor.
Real-world code is not always pretty or academic quality.

Automation is also a good way to get rid of monotonous tasks and boilerplate.

~~~
throwawaystress
Does that work with data science work, though? Along the way you build many
models and many kinds of ad hoc analyses that can build up. I’ve yet to see
someone write tests. For the most part, I’ve only seen people write big long
scripts that they call, setting some global constants at the top. I’m aspiring
to be better than that, but it seems counter to the goal of getting results
quickly.

