
Array Programming with NumPy - hardmaru
https://www.nature.com/articles/s41586-020-2649-2
======
jofer
Don't underestimate the impact this has on getting funding or even just
tenure/etc recognition for working on numpy. I'm in industry these days, but
coming from the academic side, it's _really_ hard to get recognized for
building the underlying infrastructure that tons of people use. I've built and
maintained libraries that are used in a ton of publications, but was always
told my work was "utterly and completely useless". It was also always
unpublishable, as methods are never publishable in my field. Numpy has
(obviously) vastly more respect and impact than my work, but the general
problem remains.

Articles like this are a _huge_ deal for that reason. It's an immense delayed
recognition for over a decade of work from a lot of folks.

~~~
throwaway-wroc
props to your work and similar to numpy, i assume it has been immensely useful
for loads of people.

but 'building the underlying infrastructure that tons of people use' is not
science. in my department we had to fail a phd student because 90% of his work
was just implementing bunch of existing methods as a python library. useful,
yes; science, no. wasn't his fault, had a shitty supervisor, but making useful
tools is not the same as undertaking scientific research.

~~~
wjnc
That's quite the wrong way of approaching science. The scientific method is
based on building on the shoulders of giants. Those giants aren't the
professors in the direct vicinity nor are they only the papers you cite. The
whole of the process is science and if we need further specialization for
building better tools (hey, maths and statistics are scientific tools as well)
I would classify that as science to a large extent.

What could be useful is openness in used tools and software and a way of
getting citation counts for software used. It's nothing more than a table.
That way the hotness of publication could start to flow for the underlying
tools.

~~~
throwaway-wroc
> The scientific method is based on building on the shoulders of giants. > The
> whole of the process is science

no. scientific research is proposing a useful model of an observable
phenomenon. this is what you train for during a phd, at least in natural/life
sciences: you learn how to test a hypothesis, not an easy skill.

refactoring code or transforming bunch of C++ into a python library is useful,
but it's not science.

> What could be useful is openness in used tools and software and a way of
> getting citation counts for software used. It's nothing more than a table.
> That way the hotness of publication could start to flow for the underlying
> tools.

agreed 100%

~~~
mantap
Maybe we should take all this "not science" software away from the scientists
and see how much science they can do without it.

If you write code that allows science to be done that couldn't be done
otherwise then that is science. As a high profile example, a large amount of
specialist software was developed for the LHC to allow it to process all the
events coming from the detectors.

It sounds like the refactoring here was not really that useful in the first
place.

~~~
throwaway-wroc
yes, in 2020 you mostly cannot do science without software, electricity, desks
and chairs and buildings, printers, pick your own irreplaceable tool. yet
building these things to enable research is emphatically not itself scientific
research.

doing a phd -> training to be a scientist.

~~~
mantap
Printers? Desks? Take your strawmen somewhere else.

------
westurner
Looks like there's a new citation for NumPy in town.

"Citing packages in the SciPy ecosystem" lists the existing citations for
SciPy, NumPy, scikits, and other -Py things:
[https://www.scipy.org/citing.html](https://www.scipy.org/citing.html) (
source:
[https://github.com/scipy/scipy.org/blob/master/www/citing.rs...](https://github.com/scipy/scipy.org/blob/master/www/citing.rst)
)

A better way to cite requisite software might involve referencing a
[https://schema.org/SoftwareApplication](https://schema.org/SoftwareApplication)
record in JSON-LD, RDFa, or Microdata; for example:
[https://news.ycombinator.com/item?id=24489651](https://news.ycombinator.com/item?id=24489651)

But there's as of yet no way to publish JSON-LD, RDFa, or Microdata Linked
Data from LaTeX with _Computer Modern_.

------
alextheparrot
As someone who cut his teeth on bioinformatics before eventually just
completing a full computer science degree, I was a bit worried that this
article wouldn’t address the applied scientific audience well. Pleasantly
surprised, though, at how well this article evangelizes NumPy to that exact
community.

Many labs are gaining access to or creating physical tools that create data
analysis over experimental design problems. Biologists transitioning from
running gels to detect the existence of a gene to running sequencing or flow
cytometry to segment populations, for example. I remember seeing I think an
RNA-seq experiment that tracked the full lineage of hematopoietic (blood) stem
cells by a professor - my jaw was on the floor at the level of insight.

One remaining step is transitioning many bioinformatics courses from applied
tooling to general program design and open-sourcing. I’ve noticed a few labs
have done that really well for years, but it is not often found as a field-
level skillset.

~~~
rubatuga
The coolest thing I read last year was about single cell rna-seq trajectory
analysis of the differentiation for different blood cells.

~~~
dm319
You might like the mass cytometry reconstruction of the human haematopoietic
system [1]. I've done a bit of scRNA-seq and mass cytometry, and the issue I
have with scRNA-seq is the tiny dynamic range it has compared to flow/mass
cytometry, which can make identifying populations much harder. Not to mention
the cost!

[1]
[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3273988/figure/...](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3273988/figure/F2/)

------
hoytech
Let's not forget to give at least some credit to Perl Data Langauge (PDL). It
pioneered a lot of these ideas 10 years before NumPy existed, and is still a
pretty great tool today:

[http://pdl.perl.org/index.php?page=FirstSteps](http://pdl.perl.org/index.php?page=FirstSteps)

~~~
mjburgess
Er... array programming and statistical programming languages pre-date both by
decades.

APL from 1966, I believe, is the key lang for array programming.

And statistical languages like S from 1976 come to mind:
[https://en.wikipedia.org/wiki/S_(programming_language)](https://en.wikipedia.org/wiki/S_\(programming_language\))

At a quick glance, it seems PDL is just a variation on S.

~~~
ktpsns
R
([https://en.wikipedia.org/wiki/R_(programming_language)](https://en.wikipedia.org/wiki/R_\(programming_language\))
) is kind of the successof of S. In certain communities (not only statistics,
but for instance also biogenetics), there is quite some concurrency between R
and (scientific) Python for data science.

For my understanding, the numpy syntax most closely resembles what would be
possible in matlab. And matlab again seems to have roots from Fortran. Thanks
to that, young folks nowadays can switch so easily between Fortran and Numpy,
the syntax and call structures can easily be made to almost fit to each other.

~~~
uryga
> there is quite some concurrency

i think you meant "competition" here :)

(in polish, my native language, it's "konkurencja", but it's a _" false friend
of the translator"_; i'm guessing you're in a similar boat)

~~~
yqx
Similar false friends in Dutch: Concurrentie

~~~
jefft255
Or French : concurrence :D

------
teorema
For some reason this struck me as inappropriate for the outlet. It's a nice
piece as an introduction to array programming with numpy, but seemed out of
place to me.

~~~
lacker
I think this is great. Nature is really a way for scientists to score points,
not a publication that you read cover to cover that needs stylistic
consistency. Right now the academic citation-count scoring mechanism doesn’t
give enough incentive for people to work on the important infrastructure
pieces like Numpy. So this is a good step towards putting scientific
priorities in the right place.

~~~
teorema
I definitely think things like the infrastructure don't get enough credit. I
also mean no criticism of numpy. But is Numpy per se _conceptually_ that
innovative, from a computer science perspective? I guess to me this just
seemed unusually introductory, about a specific library for a specific
language.

Put another way: if I was going to cite numpy, would I cite this? Probably
not. Would I cite this paper for any of the more general concepts it covers?
Probably not. I'd probably even argue someone _shouldn 't_ cite it for that
latter reason, as those concepts supercede numpy (and appear in other
languages under other names).

~~~
konjin
I forget who, probably Hamming, said his most cited paper ever was just an
intro to statistics for biologists. This is not a new thing and it's not a bad
thing. Papers are meant to spread information. If a field isn't aware that
another field has solved a problem of theirs than even a 101 level paper is
worth writing.

------
1vuio0pswjnm7
BLAS and LAPACK were originally written for FORTRAN, and FORTRAN is
particularly suited to numerical computing and array programming.

[https://modelingguru.nasa.gov/docs/DOC-1762](https://modelingguru.nasa.gov/docs/DOC-1762)

~~~
6gvONxR4sf7o
Aren't BLAS and LAPACK just for the linear algebra bits? Numpy is so much more
than linear algebra.

~~~
bagels
Yes, there's a lot more to it than that.

For instance, the interface and array/matrix types make vector operations
really natural and efficient in python.

------
enriquto
Young programmers are slowly but steadily reaching the power that Fortran
programmers had 40 years ago. That's good news!

~~~
mumblemumble
That statement only feels true to me if you interpret the word "power" as an
exact synonym for "performance." Which is a definition that is valid, but also
just about perfect for leading someone to miss the point.

Numpy approaches the performance you could get with Fortran. Mostly because
its core is written in Fortran. What Numpy offers that Fortran never did,
though, is _leverage_. The article mentions, but doesn't really do justice to,
the sheer volume of interoperability that Numpy has enabled. It's not just
that all these libraries were built on top of Numpy. It's also that their
common Numpy substrate makes them all deeply interoperable with each other.
And that works both above and below the boundary. You can swap out BLAS and
LAPACK for something else - say, CUDA, or a distributed representation - and
as long as the replacement also speaks Numpy's language, you can plug it into
existing libraries that were originally written against Numpy.

In short: Fortran gets you performance. Numpy gets you that, and also
productivity. I would argue that that actually makes Numpy _more_ powerful
than what was possible with just Fortran.

~~~
teleforce
Try D language, with its numerical library you can get both productivity and
performance that is even better than the OpenBLAS (Numpy and Julia library are
based on) [1].

In D you get the consistency of a single unified language semantic unlike the
impedance mismatched approach that is inherent in Python and Numpy programming
combination.

[1][http://blog.mir.dlang.io/glas/benchmark/openblas/2016/09/23/...](http://blog.mir.dlang.io/glas/benchmark/openblas/2016/09/23/glas-
gemm-benchmark.html)

------
Mikhail_K
Numpy array syntax inconsistent to the degree it can be considered broken. For
instance, to take elements number 4,3,2 of an array in Python, one writes
A[3:0:-1], but for the elements 3,2,1 one has to write A[2::-1], because
someone decided that making A[-1] refer to the last element of an array is
"intuitive", and so A[2:-1:-1] will return empty slice.

Now if you want to use array comprehensions, the first case looks similar:
[A[k] for k in range(3,0,-1)] but the second now has to be [A[k] for k in
range(2,-1,-1)].

Further, for some reason, array and matrix are different types and one has to
convert back and forth between them.

~~~
tgb
The matrix class is now officially deprecated, FYI, so just use array
everywhere.

I guess the generic way to write that would be A[bottom:top+1][::-1]. But the
blame there goes to Python, not numpy, since the same is true of lists.

------
throrthaway
Ah, just the right time to publish about numpy - right when everyone is moving
over to Julia because of numpy's warts.

~~~
microcow
I don't think that's accurate:

[https://trends.google.com/trends/explore?date=today%205-y&ge...](https://trends.google.com/trends/explore?date=today%205-y&geo=US&q=%2Fm%2F0j3djl7,%2Fm%2F021plb)

