Hacker News new | past | comments | ask | show | jobs | submit login
How to write better scientific code in Python? (zerowithdot.com)
87 points by EntICOnc on Feb 19, 2022 | hide | past | favorite | 76 comments

I think to any audience other than fairly hardcore software engineers this is going to read a little... mad. The first code example was exceptionally clear, from then on, we get increasingly incomprehensible. "Better" it isnt.

Scientific code is often write-once, run-once, done. And since mathematics/statistics is largely already formalised at the level of distributions, sampling, and so on -- we shouldn't expect to need to "software engineer" this type of code.

This article should establish a specific audience it has in mind, presumably data scientists in long-runing scientific projects who need to write generic maintable code to be shared across the organization. Then the article should establish when the first example is GOOD, and when it fails in this specific use case.

I agree, the bad code at the start is clearer than 99% of the scientific code I've read, without exaggeration. The typing & the abstractions proposed are not only unnecessary but also unrealistic for research code.

If scientists want to improve their code, the first and most important step is to get them to use descriptive verbose names for their variables and functions, and to learn the single responsibility principle.

I disagree with the first half. Scientific code is meant to be read with the corresponding journal paper in hand. The journal paper didn't use "descriptive, verbose variable names", and foisting that into the code makes it even less comprehensible. The more the code looks like the paper, the better.

A paper is not a sufficient specification for code, the gap is too large. Most code I've read uses the naming scheme you propose and I've never seen it succeed, and I always read it with the paper in hand. It always degenerates when the author has to glue together those few well defined variables into a working piece of code. You end up with sub-variables named after the original ones whose purpose are not well defined, and with objects whose purpose and nature you aren't sure of (is q a function or number? whats q_enc, qp and qs for? Only q is defined in the paper.). And it only gets worse when they try to use latex syntax. In the end you have to read through a 500 lines function of obfuscated 1 character variables, it's a nightmare.

A verbose name makes which variable it corresponds to in the paper immediately obvious, and makes the code understandable.

A paper is a better specification for code than pretty much any design doc I've seen. It's usually not that hard to figure out what qp and qs and q_enc stand for from the context. It's much preferable to listening to the software engineers who want to explode a simple 10 character expression into an unreadable rats nest of line-continuation logorrhea.

It may be _possible_ to have a verbose variable that makes it obvious what it corresponds to in the paper, but I've never seen it. That really rings like a No True Scotsman argument to my ear. I much prefer the creole of latex style naming that most researchers tend towards. For example, take this code/paper pair


The first line in the main loop is

    z = A*x-y;
Yes, they could have written

    residual = dictionary_matrix * optimization_variable - sensed_data 
or some such nonsense, but it is just exhausting to read. Further, it's actually more confusing because it conflates an application (compressed sensing) with what is actually a generic algorithm for l1 regularized least squares. This is a persistent problem with verbose names. In the search for an expressive name, people reach for names from a specific problem. This is a violation of separation of concerns.

I beg to differ. Having dealt with the insanity of abbreviated variables/field names in cancer research, I can tell you that abbreviations often (1) do not immediately indicate what the actual underlying field is, and/or (2) will mislead, especially when variants are involved.

I don't find your second example exhausting to read at all - in fact I find it far _easier_ to read, especially to an astute reader (who is trying to make sense of a particular scientific literature).

No person will have any idea what `z = A*x-y` could mean 'in context', when it's so damn abstract of a reference.

Brevity in variable names is horrendous when cross-checking against papers. It is also a waste of time for scientists to reiterate what the legends/variable names mean, especially when their code could have used clear, unambiguous, verbose names instead.

> Scientific code is often write-once, run-once, done.

I don't buy that in the sense that you mean. In my experience, code generally lives on no matter what, and code that remains write-once is often because no one can read it and thus update it (i.e., write it again). I've seen a decent amount of scientific code, and although it was never meant to be ran more than once by anyone other than the original author, it almost always was done so. And because it wasn't designed well, written well, or even commented well, code like that basically becomes this immutable artifact that people simultaneously don't want touched but also want it updated and keep on running.

I did an REU in mathematics back in the day where I wrote a decent amount of MAPLE code. I wasn't even a software engineer at the time, since I was a math and EE major. I honestly had little concept of what computer science even was and was not a programmer. It was not hard at all to write decent code that was commented and organized.

The problem is that most scientists, in my experience, view software and code as somewhat below them. It's like a hammer and a box of nails to them.

I inherited someone's Ph.D. dissertation Python code that had lived on across several labs. It had basically reached the "needs a rewrite" state because it was impossible to get it to run. It was stuck on Python <2.7 and basically couldn't be upgraded without replacing a lot of it because the Python distribution it was based on had become deprecated and none of the packages could be upgraded without changing a lot of the code. I'm not sure I ever saw a single comment. The person who originally wrote the code actually ended up working at one of the major scientific Python shops, somewhere I once applied to and was unceremoniously turned down. An interesting turn of events.

Thanks for the level headed critique. I indeed began reading the article in earnest and the moment he started making ABCs I skipped to skimming and came here and was about to rant. This is the right perspective, people need to understand their audience first when writing and at the very least justify why the level of abstracts they employed are needed vs for example, their first suggestion for improving the die example which is what I'd go with, even better, you have numpy.average which would be even more clear for scientific programmers because they already know numpy has arithmetic average built-in and will recognize it.

I want to disagree with you. But I just can't. We go from having to read ten lines to about forty? With about five or six new concepts that are not related to the original.

The thing is the abstractions on the last code are optimized for solving a far more general version of the problem, that just to exercise (not even exhaust) all the kinds of variation supported would take hundreds, probably thousands, of lines with code taking the initial approach. But you don't really have even a hint of that, because the article doesn't show at each abstraction level a broader problem, the cost of solving it without the new abstractions introduced, and then the benefit of the abstractions.

No. I actually got that. But it doesn't help on a science report.

That is, I agree that this is better engineered code. I disagree that this is better science code.

To a software engineer this will also read mad. This article is ultimately explaining why abstraction is good and why it's helpful to build classes in python. That is already obvious to SWEs, not at all specific to "scientific computing", and explained elsewhere much more succinctly.

I see it more as an example of anti pattern: unnecessary layers obfuscating what happening. Good abstraction is hard and it is not free. It is better to make mistake of under using abstraction than overusing (the latter is much harder to maintain).

I would understand if the list comprehension were replaced with the corresponding numpy/scipy/pandas/etc code.

I agree. I didn't mean to imply this was a good explanation or example. I still think "abstraction/classes are good" was the intention/gist of the article.

I get the author’s idea and I’ve been there before. When you’re still exploring a problem domain it’s really useful to abstract and compose because it helps you both understand the problem better and iterate more efficiently.

At some point you’ll want to come back to code that looks more like the first example, ideally with plenty of documentation that explains how you got there. Some other comment put it really well: you create a kind of personal mapping of scientific concepts to programming abstractions. You don’t want to have colleagues (or your future self) figure out that mapping to understand your results in terms of the science you’re working on.

> Scientific code is often write-once, run-once, done.

If only it were so!

Lots and lots of "scientific" code turned out to be re-formatting data. There were some things I wrote in Perl, then re-wrote in Java (which was the standard for non-compute-bound code at the time).

But then there was the actual computation...

Holy cow the code from the 1970s we had to learn, to transform into freaking Java code. And holy cow the poor souls that came after me, howling in torment at my code.

> The first code example was exceptionally clear, from then on, we get increasingly incomprehensible.

I think the problem here is that each iteration is solving a different and more general problem, but while the use of the flexibility at each level is discussed, it's not framed as code solving one problem and the difficulty as the same approach is applied to a broader version of the problem.

IMO, from a pedagogical point of view, this could be improved a couple of ways:

(1) First, the initial code is not, and should not be described as, “bad code”. Given a simple initial case, it is clear, straightforward, good code. It's not abstracted to handle diverse cases, but it's not intended to be. It should be held up as an excellent model of solving a simple initial problem and a good starting point.

(2) Each of the subsequent levels of abstraction enables the code to solve a more general problem more clearly than just layering on more special-case code. So, each step should expand the problem, show code to solve the expanded problem at the pre-existing level of abstraction, and then introduce the refactoring that is appropriate for it. (An extra nice thing would be to show another extension to the problem after each that fits into the new level of abstraction well, leveraging the abstraction introduced.)

We should get out of the mindset that either more or less general code is “better” in some universal sense: different levels of abstraction and generality are suitable for different problems, and often you’ll start out solving, and coding for a narrower version of the target problem than the eventual goal. Eventually, you may develop a level of intuition for the level of abstraction necessary to solve the general problem initially, but it's also easy to overabstract a solution if you don't have a keen awareness of the limits of needed generality. It's a lot easier to avoid overabstraction—which makes code harder to read and understand—if you start simple and refactor incrementally as you generalize up to just what you need.

I think some degree of abstraction can lead to more clarity – especially if it hides a bunch of crap, that is done over and over again throughout the code. But these abstractions need to be very well chosen, clearly named and unless they really give you more clarity (e.g. by hiding some confusing details to bring the general point you are trying to make across more clearly) I would advice against them.

If your scientific code is meant to be used in actual software or if you want to write tests for the functions used, a little bit more abstraction might be a good idea, however.

I wish the article had discussed abstraction more, because it's especially tricky in "scientific" code.

The initial abstraction is actually pretty good, conceptually. You have a function that returns a value, like roll(sides=6), or an object you can manipulate to get a value: Die(sides=6).roll(). You then take those samples, somehow analyze them, and get a result. That code matches people's mental model very well.

If you were going to do this in C++, I'd stop here. Things only get more complicated because Python (Matlab and R too) penalize you heavily for leaving the BLAS sandbox.

Yeah, coding like this ("pure" functions, input value -> output value, less/no side effects) is probably best "improved by just having unit tests.

I'm not a TDD zealot, just that the unit test can provide the assurance of the computation, regardless of the spaghetti inside the computation.

> I think to any audience other than fairly hardcore software engineers this is going to read a little... mad

So most of what gets posted to HN?

Scientist here, who has been writing code for > 20 years. I don't buy pretty much anything in the article. This was a bunch of opinionated examples. Science programming in my opinion has many different levels which require different approaches and techniques.

Very often the first code written will be just a quick and dirty. If the idea worked out, I may refactor it, make the code prettier, speed it up a bit. Very-very rarely the bit of code will be something that I'll need to constantly reuse, and there I have to think about interfaces, organisation etc. Also a situation that often comes up, you have a working code that does the job, and then you keep adding more and more functionality to it, and that slowly requires making the code more generic, better organized. But this is not the first thing you do, you only do that if it is needed. That's why I think there are very few very generic recommendations for scientific programming.

(non-practicing scientist here, who has been coding for research since the late 1980s) This 1000x.

Having something that works quickly, which matches the paper you are looking at, or your own/groups notes on the calculation, is far more important than the various obscuring "optimizations" that followed. The critiques that the article author made, for the very first example, were simply wrong.

The code worked. Did the thing. Was easy to reason about. Far more important is not adding additional, literally unnecessary cognitive burden on your tool than you need. Does this make the code "bad" in some people's eyes? Mebbe. Does it matter? No, really, it doesn't.

Something I've seen in recent years has been the emergence of people providing critique of scientific code, claiming that "modern development techniques" would help fix all the "evils" they see in the practice of scientific coding. In this case, it is a case of premature optimization, and maybe ego flexing over esoteric language elements that are at best tangential to the problem, and significantly complicating to the process.

Some of these efforts are laudable. Source code control, CI/CD, unit/system tests, etc. Many of these you learn by "fire" when you develop the codes, some you gain through diffusion of knowledge.

Some of these efforts are, fundamentally ridiculous, attempting to replace good functional practices with the critics opinions of what they should be. We see that everywhere. Think of the rust vs the world debates, like "the linux kernel would be so much better if written in rust, because of 'safety'", which are frequent, wasteful of everyone's effort, and generate far more heat than light.

To be blunt, if you create something that works well, that you can understand, that was developed quickly, you only need to consider replacing it if it becomes a problem in the future. If you over-optimize at the outset, you have a rigid code base which will be very hard to maintain/adapt in the future, and will require a similar cognitive load to comprehend when you need to revisit it.

Exactly! The real decision you're making with scientific codes in lightly-explored problem domains is, how permanent and versatile of a solution do I need here?

Alas, the OP is deploying some CS abstractions to genericize very simple problems that have already been very well-explored ("find the sample mean").

It's stirring up antibodies among all the investigators who ever over-engineered an approach that was subsequently junked! (shuffles feet awkwardly)

> It's stirring up antibodies among all the investigators who ever over-engineered an approach that was subsequently junked! (shuffles feet awkwardly)


And I was supposed to be a CS major, a programmer. But I came along right when the industry was crazy into "objects". Then we had to have more "Java".

Simplicity was an accident.

I felt like I was losing my mind, so I returned to scientific computing.

Simple is hard. Simplicity via the correct application of abstraction, while keeping in mind the performance impact of your design changes, is what we do, on a good day.

This article starts with naive code that is unlikely to be written by working scientists unless they are really quite new this type of work.

And, quite quickly, the article reaches a level of detail that might not engage the initial readers.

As a scientist with several decades of programming experience, it seems that those who write this sort of naive code are at an early stage of learning, and they might be better off learning Julia, which offers high performance even if written in a naive style. Granted, Python has superior (formal and informal) documentation, and it's error message are a lot easier to understand than those spewed by Julia. But it is quite freeing to be able to express things in a manner that is closer to mathematical notation, without paying a high performance cost. And, if the task really requires it, Julia will let you tune things to Fortran-level speeds (and sometimes better ... but that's another story).

Came here to say the same thing. With multiple dispatch and JIT compilation, you’ll go a long way in Julia with the naive version of the code.

I think the author of this article should read the Mythical Man Month (https://en.wikipedia.org/wiki/The_Mythical_Man-Month), in particular on the difference between (all quotes are from that book)

- a program, “complete in itself, ready to be run by the author on the system on which it was developed”

- a programming product: “a program that can be run, tested, repaired, and extended by anybody. It is usable in many operating environments, for many sets of data”

- a programming system: “a collection of interacting programs, coordinated in function and disciplined in format, so that the assemblage constitutes an entire facility for large tasks“

- a programming systems product: “This differs from the simple program in all of the above ways. […] But it is the truly useful object, the intended product of most system programming effort”

They all have their place, and if you need a program, spending time on writing a product, a programming system or a programming systems product is a waste of effort.

Most scientific code falls in the program or, maybe, somewhat in the direction of the programming product category, and there’s nothing wrong with that.

(Note that quality is a concept that’s orthogonal to this distinction)

Former academic scientist, now software engineer.

I think this article misses the point that rarely in science is the code the actual product/deliverable. The way a scientific code should be judged is it performant enough to answer my question in a reasonable amount of time, clear enough to another reader (which face it 99% of the time, in science the only other reader will likely be your future self or a student with little coding experience), and does it give the correct result in a reproducible way. To my eye, the first "bad" example is actually the best to all these questions.

To my mind it feels more like how can I use my scientific coding I have to do as a way to improve my software engineering skills. Which isn't a bad endeavor and might make leaving academic science easier for you, heck I did the same thing. However, in my humble opinion, it won't make you write "better" scientific code.

The terrible irony of this post is that they use a completely wrong definition of expected value. They write down an integral...but then implement something totally different.

You cannot calculate the EV of a random variable X by taking the mean of random samples drawn from the distribution of X (e.g. try it with the Cauchy distribution—see if you get something approaching the actual EV). The only thing they accomplished is building up a very convoluted way of calculating a sample mean—which is already trivial with Numpy or just standard Python. Why do this? I’m perplexed.

This is the issue with software engineers writing scientific code—they often flagrantly misunderstand basic mathematical definitions, and then obfuscate this misunderstanding with their “pure” and “robust” code.

Sorry for the harsh criticism, but I’m tired of seeing this kind of thing.

Well the integral doesn’t work for a Cauchy distribution either so that seems a little unfair for criticism. And this kind of procedure is not unusual for more complex distributions that have no well-understood density function to integrate, like the results a physics simulation, aren’t they?

> You cannot calculate the EV of a random variable X by taking the mean of random samples drawn from the distribution of X

My understanding of statistics is rudimentary so forgive me but doesn't the sample mean of a normally distributed variable tend towards the expected value for the population?

The sample mean of a Normal RV does tend towards the population expected value, sure.

The GP comment is talking about a Cauchy RV, which has heavy tails. So it has enough probability mass at large values that the expected value is infinite. Discarding constant scale factors, in this case:

  E[X] = Int{0..infty} x * p(x) dx 
       = Int{0..infty} x * (1/x^2) dx 
       = Int{0..infty} (1/x) dx 
       = +infty
So, the sample mean of Cauchy random variates will not converge to any real number.

The uncomfortable truth is that merely subscribing to the tenets of current software development “best practices” does not compensate for limitations in knowledge of applied mathematical sciences.

Note that the actual EV of a Cauchy random variable is undefined...

For some reason, this post reminded me of Fizzbuzz Enterprise Edition [0]

[0] https://github.com/EnterpriseQualityCoding/FizzBuzzEnterpris...

Great writeup, generalisation and data seperation is difficult but will more often than not improve code. Learning how to structure your code around general concepts has the advantage that there many low-level vectorized libraries available increasing compute efficiency.

This is especially true for numerical programming (numpy), stats (scipy) and ML (torch, tf, jax), but the difficulty is often in finding the name of functions to corresponding concepts. Better get good at formulating queries against the API documentation too!

I agree with the others commenters who say the code "improvements" lead to unwarranted complexity and unreadability. Academic version of enterprise fizz-buzz this is.

IMHO, the quality of the code is not so important but more the documentation and testing around the code, and the use of best practices like git, versioning, etc. A good place to learn about these things is Patrick Mineault's The Good Research Code Handbook which is available here https://goodresearch.dev/ (intended for scientists who must produce code artifacts as part of their research).

I've been doing scientific programming for almost 40 years. I'm still learning ways to write better code.

What I've come to realize is that there are some unique things about scientific coding, related to hyper-agile development and the use of specialized domain knowledge. But at the end of the day, good code is good code. I don't say this as an expert, but as an amateur noticing that the issues I read about resemble problems that I've experienced.

That doesn't necessarily mean we should follow the latest software development methods and tools. For one thing, the software development world doesn't even agree on those tools. For another, we're typically (ballpark guess) 10 or even 20 years behind the software world in terms of the complexity of our programs, and the size of our teams. We can safely get away with older techniques that are easier for us to learn and apply.

Also, there are things that we don't need, such as intuitive GUIs, seamless error recovery, installers, devops, monetization, file formats, and so forth. For a very small script, crashing is a perfectly reasonable way to handle something like bounds checking, and a language with decent error messaging will be 100x more sophisticated than anything we're likely to come up with ourselves.

I think the way to help us learn to code is to include good programming practices inline with instruction on coding. My Pascal textbook was that way. For instance variable scoping was introduced with an explanation about how to choose good scoping, such as why to avoid globals.

Refactoring might be the biggest thing: Teaching the discipline of going through old code and improving it. Human nature says that we will cut corners when we're having fun and focusing on the science and results. Also, most of us are learning this stuff on the go -- many of us were not hired to program, and our bosses may be unaware that we're doing it. So we have to circle back, and tune things up. This is how we do exploratory science anyway. My habit is, when I learn a new technique, I go back into some old code and try to apply it.

Testing is important too, but maybe in a different sense: A huge tool for testing scientific code is visualization of intermediate results.

Having worked in the field long enough, I always feel the need to add:

A) If you're writing high-performance code DON'T you _will_ hit bottlenecks which are non-trivial to the language unless you're a guru. If you're a scientist you're going to have to work to get to this level.

Apart2) Python is excellent for saying "perform analysis A with variables B,C,D,E,F on X,Y,Z 30 times using this random seed library". What Python is not excellent at is saying, "I have 20GB of raw doubles and I need to push these through non-euclidian algebra as fast as a modern CPU can cope". The biggest surprise is that modern Python really can do both, it just struggles a lot with the latter over the former. IPython and JuPyter notebooks as well are amazing for the former!

B) For the love of whatever deity you worship. PLEASE DOCUMENT YOUR DEPENDENCIES CORRECTLY. This is non-trivial and doesn't mean "Works on CentOS6". I mean at least tested on Python x.y with depA version 1.2.3. Python isn't as bad a npm, but it's getting there...

Bpart2) Unless you're starting out, _avoid_conda_. Installing/maintaining things through this is not nice. This is the sort of system that makes your local sysadmin groan when you come to him and say "it's easy because it has a UI". Behind the scenes it makes so many assumptions that are great for people starting out, but are often painfully wrong for people looking to run code at hyper-scale on clusters. I recommend investing _some_ time in learning how python packages work (not much!, packaging, is not coding). But as a result you will learn or employ a lot of skills which will help you with larger python codebases. There are large performant python packages/tools out-there and they all tend to leverage advanced features for good reasons.

C) Never! re-invent the wheel unless you know you explicitly need to. And avoiding pulling in a dependency (unless the dependency is _huge_ is not always the correct answer).

D) Ask an expert. It's fine to acknowledge you don't know how to use the tool that is modern python. It's a swiss-knife with a shotgun attached pointed at your face. When it goes wrong it will come back to bite, but when you get it correct it's so efficient you will marvel at how much you can do with just 5/6 lines of code.

E) Have fun, just try things and see if they don't work or work really well, python is great for avoiding too much boilerplate for mid-scale projects :)

>B) For the love of whatever deity you worship. PLEASE DOCUMENT YOUR DEPENDENCIES CORRECTLY. This is non-trivial and doesn't mean "Works on CentOS6". I mean at least tested on Python x.y with depA version 1.2.3. Python isn't as bad a npm, but it's getting there...

That is why for R code, I really like the renv project.


very similar to virtualenv by the sounds of it,, and doesn't really address things any more/less I would argue other than maybe snapshotting things which I'd argue isn't the same as understanding and testing.

Although I like the separation between deployment and OS that both solutions provide.

Doesn't address documenting dependencies? This is scientific code we are talking about right? Perhaps we're talking past each other.

This is what I mean. For application or package development, say in python, or even in R, then you will want to test for a wide range of dependencies. Then a snapshot such as provided by virtualenv or conda is inadequate. Better to have a requirements.txt file, or something equivalent (like a poetry toml file).

But for a lot of scientific code, such as is published along side a paper, a snapshot is sufficient, and more than is usually provided. How many times have I downloaded a paper's R code only to find it won't run because of some dependency kludge that isn't explicitly stated? Too often.

Insightful writeup!

I am just not sure what did you mean to discourage in part A.

> A) If you're writing high-performance code DON'T you _will_ hit bottlenecks which are non-trivial to the language unless you're a guru.

DON'T do what? don't try to hyper-optimize your code? don't use all types of "fancy" constructs like how the post shows?

Don't write code designed to burn CPU with numerical calculations. There is hyper optimisation and then there is the right tool for the job. The speed loss from doing mathS in an interpreter is huge and frankly usually not worth paying.

Python like this is great for low CPU hyper responsive code, or for increasing efficiency of existing constructs. As a rule of advice to anyone who asks me this, go away and read about the benefits of compiled over interpreted code. There's to much to conver in 30s and frankly if you're worrying about extreme performance you shouldn't be using python... That's not too say I don't think python is awesome, but, write a module in something else :p

This is an article about how to write theoretically better code (from a CS perspective) for a scientific use-case - but that is not the same thing as "better scientific code". For most purposes, the effort a scientist will go to in understanding how classes work (as an example) will outweigh any benefit they might see from using them.

One of the nice things about Python is that the programming style arrived at here is totally unnecessary - a point which was soundly missed by the author.

I don't like the code. My start point would be that

    import pandas as pd
    import numpy as np

    if __name__ == "__main__":
        n_samples = 10000
        samples_np = pd.DataFrame(np.random.randint(1, 7, n_samples), columns=["face_value"])

Speaking about abstraction, I don't know math, so first thought would be to look for *existing* abstractions. When I work with relational data, my first option to check is SQL. For math looks like DataFrame is a *standard* abstraction. To be fair, maybe first I would be using build-in `random.randin` I am not very familiar with `numpy`, but I would definitely google "pandas random sample", that would bring https://pandas.pydata.org/docs/reference/api/pandas.DataFram...

    if __name__ == "__main__":
        n_samples = 10000
        sample_pd = pd.DataFrame({'face_value': [1, 2, 3, 4, 5, 6]})
code uses lambda functions in some examples, it probably kills advantages of `numpy` performance. Using DataFrame API at least helps to avoid those pitfalls.

Type annotation, I like the idea, but in the end code looks like Java, but doesn't performs like Java. It is very hard to make it right in Python, also some of them wrong.

( @dataclass(frozen=True): - don't need ":" Gaussian.sample - missing return )

when return added it doesn't return `-> Sequence[float]:`

    >>> dtype('float64')
-> Sequence[Union(numpy.float64, numpy.float32, numpy.float16)]: # ?

I don't believe "scientific code" is fundamentally different from any other code, I would go with following normal development practices

1) review design ("don't reinvent wheel")

2) add tests

3) make code review

4) version control


> I don't believe "scientific code" is fundamentally different from any other code, I would go with following normal development practices

That depends on the field. In the part of bioinformatics I work (mostly combinatorial algorithms; floating point numbers are rare) normal software development practices often cease to be relevant the moment someone mentions design.

Writing research code is a part of doing research, and a key feature of research is that you often don't know what you are supposed to do. When I start writing code, I tend to expect that the code will solve the wrong problem in the wrong way. Once I have something that runs, I start experimenting with data to learn more about the problem domain. Eventually I have a better idea what the code is supposed to do (and maybe even how it should do that), and then it's time to rewrite and iterate. Design only becomes relevant in late stages of the project when I'm confident I know the problem I'm supposed to solve.

There are some similarities to prototyping, but it's prototyping over problems rather than over solutions to a particular problem.

> I don't believe "scientific code" is fundamentally different from any other code, I would go with following normal development practices

> 1) review design ("don't reinvent wheel")

> 2) add tests

> 3) make code review

> 4) version control

> etc.

In my experience, this is pretty quixotic and will lead to your failure as a scientist. Basically nobody is writing tests. Code review is pretty much unheard of. There only "design review" comes from your journal's peer review process and generally has nothing to with the code. You can jump through all those hoops while your colleagues keep cranking out papers.

I just got reminded of this scene from Big Bang Theory.

Sheldon: laughing at his own joke Howard: I haven't seen him laugh that hard since the day Leonard made that multiplication error. Sheldon: laughing hysterically Oh. Oh, Lord! That multiplication error. He though he carried a 1. But he didn't! Leonard: It's not funny. That mistake got published. Sheldon: Stop! I'm gonna wet myself.

Yes, that part is different, reusability of the code is not expected to be the same as normal software code. Not different that the code starts from requirements (which are different) and it should be correct, which basically is the "following normal development practices".

This isn't a bad article; the title is just completely misleading. More appropriate would be:

"How to implement the simplest possible version of TensorFlow Probability [1], a probabilistic programming framework".

[1] https://www.tensorflow.org/probability/overview

On a related note I recently wrote some scientific codes with Go. MPI bindings were ergonomic enough and generics is a big win for writing reusable & fairly performant codes... Built once and deployed on multiple nodes without a hitch. And zero mucking with venvs.

I feel like if you're reaching for interfaces and type hints in Python, it's time to revisit the ole toolkit.

Excellent to see functional programming ideas like deferred computation and clean interfaces make their way into the scientific computing space.

One thing though: I know these are good ideas. But to someone not as familiar with these patterns, they may wonder "why go through all this trouble?"

I thought the deferred part could be better.

For example, one easy win would be to replace the list comprehensions with a generator: there's no point in allocating that entire list just so that statistics.mean can iterate over it.

At the opposite end, the switch to a Distribution class also enables a huge speedup: keep the sampling machinery for situations where you actually need (Cacuhy?), but while die.expected_value() returns (self.n_sides+1)/2, which is effectively free.

>One thing though: I know these are good ideas. But to someone not as familiar with these patterns, they may wonder "why go through all this trouble?"

It's not only why, it ignores the fact that the vast majority of scientific code is written for the science. Usually you're already dealing with layers of abstract theory in the science you're working in, you often don't want to deal with additional cognitive load of then abstracting that for a computer because that's now a new problem. If you take this approach in scientific software, unless you're a commercial vendor implementing some battle tested theory with a paying market, you're going to quickly find yourself without a job--that or you're a wizard at efficiently abstracting abstract theory quickly. I've yet to see someone more efficiently abstract such an implementation compare to simply implement a working, good enough, concrete implementation. You only abstract and optimize what really has to be or will clearly result in a net savings of resources for your specific project context.

The vast majority of scientific code isn't written to be robust by design its a known cost cutting measure to meet budgetary constraints, it's written for a few target goals, not generalization. I've worked in scientific computing and applied science for quite awhile and the ideas like this just don't make sense in most contexts. From a software design perspective, sure, it makes complete sense but it's just unnecessary additional complexity and overhead in most contexts. People with software backgrounds often walk into scientific computing with naivity but often good intent that practices should mirror enterprise software approaches and it's just flat out misguided.

A lot of theory is iterative with a short half-life so to speak, so any code you write surrounding it as a grounding basis will often become outdated quickly. You're often not implementing a method or function like the toy problem here of computing expected value which is a common shared statistical idea that will long outlive the majority of the research you're involved in. You'll instead be writing something highly specific to some toy new theory/model that may later be found to be false or iterated on to find an all together better abstraction you need to again abstract in your software.

You're going to be using these sort of clean robust numerical abstractions, like expected value, when you need them though because they already exist and you can simply call them. It's the unique aspect tightly tied to the research (which is often inherently unique in terms of software it needs) you're doing that will be slapped together rapidly. The vast majority of it will be tossed away. If you hit something successful, that's when it's time to start thinking about these ideas because now you should consider refactoring your nice generalizable sharable theory others find value in and can use to such a nice clean implementation, then throw your existing prototype into the fires of hell where it belongs. There's usually not a lot of funding for that though and that's where commercial industries come in to scoop up publications and create nice clean efficient implementations of said theory they roll into some computational package to sell to people.

> If you hit something successful, that's when it's time to start thinking about these ideas

I agree with you completely. But the article did not specify the use case of these guidelines. They are not to be applied (in my opinion) for research code when you quickly need to publish something. They can be useful however when your already proven and battle tested ideas are used by other people. For example for keeping a shared code base inside a lab, or when you want to provide a robust implementation on top of your ideas.

I don't think that's exclusive to academia.

Business also need to choose wisely when to make ugly POCs for prototyping and when to create robust products / libs to save money long term.

It's not uncommon for labs to have frameworks and internal libs to aid prototyping and experimenting.

>First of all, the die function returns one sample at a time. It needs to be called N times to get N samples, which is slow.

This sort of thing is why I find the love of Python utterly baffling. Yes, the syntax is initially clean, but behind the scenes it is a total mess. If you just write the obvious thing the performance will be disgraceful. I don't know of another language where the for loop is considered harmful. You have to turn it into a hacked-together array language to get Python to exploit even one percent of the machine's real capabilities.

It reminds me of writing TI BASIC back in high school, where the classroom mythology was that leaving off the ending parentheses made programs run faster.

The author's solution to the toy problem is


die = Die(12)

expected_values(die, n=10000)

gaussian = Gaussian(mu=4.0, sigma=2.0)

expected_value(gaussian, n=100000)

coin = Coin(fairness=0.75)

expected_value(coin, f=lambda x: np.where(x == "H", 1.0, 0.0)


After all this work, the answers to the problems are right there in the constraints.

ev_die = (sides+1)/2

ev_gaussian = mu

ev_coin = sign * fairness

What is the point?

Isn’t there also a blatant mistake in the content of the post?

The naive approach for approximating e keeps track of two floats. No arrays. So minimal, aka negligible memory footprint.

Then they introduce iterators, promoting that yielding values is more efficient. Yet suddenly, using islice, they are actually producing and having to store a list of values. So now they do have to worry about memory, as opposed to before. Yet, the article claims/implies the opposite.

I've got a somewhat different critique of the article, in terms of the expected lifetime of the codebase. Part of this goes to language choice, feature choice in the language, etc. The author of the article skims this in the paragraphs above the Conclusion section.

Basically, Python is fundamentally, a moving target. You can use features today which may disappear in newer revisions of the language. This isn't theoretical, I've run into differences between 3.6 and 3.9 which caused me to add work-around code, so I didn't need to build functional versions for each different python. One may say "subclass!" but that is missing the point (and extending the cognitive load).

I've taken code I wrote in the very early 90s, compiled it, and run it against some of the data sets I had still sitting around on my disks (well, SSDs now) over the last year. These data sets were big endian, as I used to run on supers and the old style unix vendor workstations. My little endian machines were a) able to read the data thanks to a simple compiler option, b) compile the 30+ year old code, c) allow me to run this.

This is not because the language (fortran) has not evolved, but because it has been, and remains, remarkably stable over timescales comparable to peoples scien tific careers. Python, in contrast, doesn't remain stable over timescales of single digit years.

So if you code in (a very restricted subset of) Python, you may be able to get this type of stability, that should be a major feature of a stable research language. If you do any function calls, open files, create directories, etc. you need to code for the lowest common denominator, if you wish your code to be operational over even 5 year intervals.

I've got similar complaints about C++, that I ran into some 10yo boost that didn't compile on a recent compiler, even after setting the --std lower. I had to forward port that code to late model boost.

I didn't need to do that with the BLAS/GE calls in fortran.

If you think your work will be around in the long term, and you might need to use it in 10 years without redeveloping it, some of the other languages might be better fits.

Note: I have a similar, albeit more moderate critiques of Julia, in that something changed v0.7 to v1.x, that caused me 10 minutes of recoding in one case. Mildly annoyed, but given the far higher performance and far lower cognitive load of that language relative to the hyperoptimized (and likely still slow) version that article author wrote, I can handle that.

I think the title is a bit provocative / incorrect. There is nothing better about the scientific result of this.

The title could be, "How to improve engineering of scientific code".

Write it like natural language, rather than poorly named, succinct/hard to decipher variables.

That alone will put your code above 90% of scientific code.

There's a typo in "Distribution" in the "Boosting up the purity" section.

How to better scratch some working thoughts on a napkin?


I`d use R instead. Python is best used as a glue language to access premade libraries.

R doesn't give you magic code. I have worked on several projects where my job was to rewrite the R code that took over a day to run in Python. Each time I was (fairly easily) able to get to a place where the Python code ran in < minute. Mostly this was because the people who developed the R code had no good concept of data structures.

That's exactly what this is doing. Numpy is glue around lapack/blas.

R is okay for batch processing, but what happens when management wants engineering to implement data science's models alongside some pytorch models?

Python is totally performant enough if you know how to wield it.

So "Python is totally performant" when it's Fortran :-)

Pretty much. The only fast parts of Python aren't Python.

I always feel this is a little unfair. Sure, a Python program and an R program, written the same way, using the same data structures, etc.. is going to usually show the R program is faster.

But getting to that point is where the challenge is, and I feel that Python makes thinking about things like data structures and the algorithms you're using (in the case of external libraries) or writing much easier than other languages.

In my experience once you "get there" that is enough.

The catch, IMO, is that the need to stay in the numpy sandbox really constrains how you choose data structures and algorithms.

Storing N-dimensional points in a Point class, for example, is often so slow as to be a non-starter. Admittedly, it's not just a Python problem: struggles with array-of-structs vs. struct-of-array representations are pretty ubiquitous.

That's true if you need to stay in that sandbox, which isn't always the case!

My mistake. I scrolled through the article but didn’t see the part where numpy is called.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact