Hacker News new | past | comments | ask | show | jobs | submit login
NumPy receives first ever funding, thanks to Moore Foundation (numfocus.org)
548 points by happy-go-lucky on June 13, 2017 | hide | past | web | favorite | 84 comments

Wow, I'm surprised that this is the first funding they've ever got.

It wouldn't be a big stretch to say that 90% of quantitative hedge funds use Numpy in some fashion, whether its directly, or via a library that sits on top of it like pandas or tensorflow.

I can't think of a more ubiquitous library in the financial space, maybe QuicFix (http://www.quickfixengine.org/)...

Maybe numpy's problem is visibility?

Possibly it does its job so well that people don't know they are using it when they use library libraries like scikit learn and Pandas?

I didn't know it needed funding until I read this.

It's just an assumed resource in quant finance, like air or water. You do realise you're using it, though. When you're using scikit or pandas it's very normal to do "import numpy as np". And you get the odd np.nan reminding you.

Do people still use QuickFix? I dropped it years ago, it was noticeably slower than alternatives when I tested it.

Tons of people still use QuickFIX. As I think you know, people very concerned with speed try to avoid using any kind of FIX for trading but still use QuickFIX for things outside the hot path such as drop copy (duplicate stream of trades for reliability).

QuickFIX is open source too, so you can make it somewhat faster without abandoning it. I did.

Don't be surprised, and don't feel bad. There is innumerable open source software we all depend on everyday without thinking about it. Consider zlib, libpng, libjpeg(-turbo), bash, bsd or gnu core utilities, random stuff ... and then the stuff you might sometimes think about, like openssl ...

That's maybe the biggest practical benefit of open source software. You don't have to keep track of who you owe what. A lot of these projects have had a few different critical creators and maintainers over the past decade or so. And we don't have to keep track of any of that. That's a huge efficiency boost.

(You should not re-distribute open-source software in a way that violates the license, but that's a separate issue from using it, and it scales a lot easier - everyone receives/uses many more different software works than they distribute.)

There's been quite a bit of private funding, through Enthought and other companies.

I don't think this is very true. There hasn't been much private funding for NumPy. People working at private companies have worked sometimes on NumPy but very rarely as part of their full-time job.

Oh, hey, Travis! (This is Joe Cooper, I worked at Enthought as a contractor for a few years before you joined the company full-time; we met a few times in Austin.)

You would obviously know better than I. I guess I considered employing people who were working on NumPy as "funding", but possibly not on the scale or with the focus of a specific grant. So many of the folks who do scientific computing with Python have gone through Enthought, it seems kinda like everyone has drawn a salary or contract work from there at some point. But, I guess a lot of the work at Enthought was focused on making the tools palatable to industry rather than the actual science side of things, and much of the math they're packaging came from the academic world.

Numpy is so 'normal' I think many people believe it is part of Python.

Visibility, REALLY? Was openssl problem also visibility?

It's great that the Moore Foundation provided funding for open source data science tools in Python. Good for them!

That being said, I do wonder if numpy is the most appropriate recipient. In my experience with data science, the tool that would benefit the most is not numpy, but pandas. While data scientists rarely use numpy directly, every data scientist I know who uses pandas says they are constantly having to google how to do things due to a somewhat confusing and inconsistent API. I use pandas at work every day and I'm always looking stuff up, particularly when it comes to confusing multi-indexes. In contrast, I rarely use R's dplyr at work, but the API is so natural that I hardly ever need to look things up. I would love if pandas could make a full-throated commitment to a more dplyr-like API.

Nothing against pandas -- I know the devs are selflessly working very hard hard. It's just that it seems there is more bang for the buck there.

If you look at the design documents for pandas 2 there is a good illustration of how a lot of pain points in pandas 1 spring from numpy ( https://pandas-dev.github.io/pandas2/internal-architecture.h...). I think any significant development effort numpy would probably greatly benefit both libraries.

Will have to check out dplyr :) love to see how they master the magic that is multi-indexes.

In many cases, the use of multi-indexes in Pandas is (I think) a result of culture/style or expectation that the cells of a dataframe should have scalar values. If that would change and it became common to have nested dataframes, the use of multi-indexes would diminish.

The tooling to support nested dataframes (and maybe even lists) is simple to create, It can even be a third party library. I find that multi-indices though may be an accurate conceptual way of thinking about certain data, they tend to be practically more inconvenient than nesting the dataframes. In all cases I have encountered only single level of nesting is required.

If you're excited about non-scalar values in DataFrames, you should take a look at xarray (http://xarray.pydata.org), which implements a very similar idea in its Dataset class.

Thanks for the link! Good stuff.

By the way, dplyr doesn't use multi-indexes. I actually think this one of the reasons (although not the biggest reason) dplyr is easier to use.

The funding source used by NumPy here is equally available to pandas developers. If someone with the experience to deliver wrote a good proposal I think there's a decent chance that it would be funded.

But.. pandas uses numpy under the hood. If numpy is better and can offload some of the core functionality from pandas that will also benefit pandas right ?

Right, but I'm talking about the pandas API. Stuff like how easy it is to remember exactly how to do aggregations, transformations, etc.

Here's some specific examples:



I could be wrong, but I'm pretty sure that these would be solved by pandas API design improvements, not with numpy improvements under the hood. (NB: As always, a big thanks to the developers for all their work.)

I had a similar issue. I had to read a certain piece of R code and it used a lot of dplyr, I read the dplyr documentation and I immediately felt more comfortable manipulating data in R than in Python. Later on I created https://github.com/has2k1/plydata, a dplyr imitation.

A lot of people already mentioned that pandas is built on top of numpy. Also the pandas and numpy are housed under the same non-profit: https://www.numfocus.org

First time I heard about NumFocus. Under their umbrella also sit iPython, Jupyter Notebook, Julia, Matplotlib, and a dozen more projects.

But isn't the issue here that completely redoing the API would break a lot of code? I don't see how throwing money at the problem would fix this. I don't use any of these libraries, so maybe I'm totally off base, but it sounds like it's more of a tech debt/design issue than an issue that requires the kind of programming hours that only money can buy.

On the other hand if lots of libraries use numpy, making it more efficient and/or capable would seem to give quite a lot of bang for the buck. And it sounds like that's the kind of problem that money can actually solve.

Pandas makes lots of backwards-incompatible changes. See for example these changes in the latest release


There have been a few independent attempts to add dplyr-like functionality to pandas without being backwards incompatible (e.g. dplython). I'd be very happy if the core pandas team went down this path.

That being said, I don't have a good understanding of how strong the distinction is between "design issues" and "issues where money helps". There must be some overlap.

I'll have to speak in generalities as I don't know enough about NumPy in particular to comment.

> That being said, I don't have a good understanding of how strong the distinction is between "design issues" and "issues where money helps". There must be some overlap.

That's true, but many projects have turned out bad no matter how much more money has been spent compared to less expensive, but better projects. See: Design by committee. The design of an API obviously requires careful thought, which I suppose is work that could be paid. But the issue of getting everyone to agree on a design isn't one that money can solve, and then you need to make some hard decisions about backward incompatibility. Perhaps you'd fund a fork of the project, splitting it into an old legacy one and a new, fancy version with a new API, but then you're committed to maintaining two projects which is its own headache.

These are the kinds of things I mean by design issues. Problems that aren't necessarily hard because they require many people to work for many billable hours to solve them, but because finding acceptable compromises is a very human issue quite irrespective of the programming effort involved.

Many a software project has recognized that serious, backwards-incompatible changes would improve the project, and often there is even a working implementation, but these human and legacy support issues prevent widespread adoption and then the new implementation dies a quiet death because nobody is using it, so nobody finds it worth their time to work on it.

Perhaps what you really want is a new library, rather than trying to contort a different project into the shape you want. Which is of course something money helps with, but then when the money dries up the question of adoption is going to determine whether it lives or dies as an open source project.

Again, those were some general thoughts, I don't know much about this particular project, so maybe I'm way off base. Just offering an alternative POV regarding what exactly constitutes "getting your money's worth" with respect to choosing which OS projects to fund.

pandas is often used for one-off reports, where backwards compatibility is not as important. Production software relying on the API could always depend on previous versions if a new version brings a significantly improved API.

I'm a regular user of pandas, would definitely say it's my favorite Python library by far... but it is very hard to do certain operations with it (as the OP said, anything involving multiple indexes, and things like plotting multiple plots after a groupby, etc.)

Ok, I might very well be totally off base. Sorry for butting in on a subject that I don't know much about.

> every data scientist I know who uses pandas says they are constantly having to google how to do things due to a somewhat confusing and inconsistent API.

That's a design error, not necessarily something that money will fix for you. This is why you need to think really long and hard before deploying a public API, it is very hard to change those.

well one could at least hope that some additional funding would improve the chance that these design errors are addressed, although I agree that it is no panacea

Just because the API is bad doesn't mean we should throw money at it. I agree that NumPy might not be the best recipient either. It's hard telling, really.

Personally, I believe the biggest blocker for me is to have good visualization tools. That's ultimately what gets me paid is showing other people my work and getting them to give me money to continue it.

On the core science stack IMO there's numpy, scipy, sympy, matplotlib, pandas and xarray. I probably use it next to least, but I really think sympy is the one that could benefit the most from some funding.

Do you not use Seaborn?

I can't speak to the reasons why pandas wasn't funded, but the team is looking for funding.

At the end of the day a lot of code uses NumPy and not Pandas.

Pandas is sponsored by AQR I thought?

Nope, just developed in-house there (and the original developer now works at Two Sigma).


Just to note that if you know of anyone who is interested in working on NumPy and potentially to move to UC Berkeley then tell them they probably should contact Nathaniel – if NumPy got funding they'll likely hire developers/community manager/technical writer ... etc . UC BIDS is a fantastic place to work at, and Nathaniel is an extraordinary person to work with. I'm going to assume there is also some opportunity for remote work.

Really surprised there wasn't already funding for this.

Numpy is an amazing library, and it's basically Python's "killer app." The fact that you can seamlessly blend numerical/data science computing with more general web applications is what makes Python great.

Imagine if .1% of wall street profits from shops that use numpy were donated to the project. Or some similar scheme for the other OSS projects used for profit by large firms.

Why would they do that? Numpy gave it to them. It's theirs now. "Screw you nerds" is their opinion. Don't like that? Then dual license. Copyleft and a commercial license. They can pay, break the law, or go leech off something else.

The funny thing is, open source projects scramble to offer the most permissive license possible. It's either BSD/MIT or go home. I've seen people shit on projects that use copyleft (GPL, even LGPL, etc).

Sounds a little schizophrenic to complain about the (lack of) money at the same time?

I personally think this may be connected to the "academic" origins and mindset of much OSS. Everything's "free" in that world, which can make a software's transition into the world with real economic constraints and sustainability challenges somewhat painful.

I would think that's because most open source projects don't start with money in mind, so their first goal is users. And as a user I can tell you GPL or even dual-licensing frustrates me, because it limits my options.

Consider: today I might be working on FOSS. But maybe tomorrow a friend asks me to help him with his (small) business, by e.g. adding a little bit of automation. Suddenly, all my knowledge and experience of GPL-ed libraries goes to waste, as I won't be able to use any of that to help my friend.

Given two equivalent libraries, one on GPL and one on MIT, I'll always go for the one on MIT. MIT, BSD, etc. seem to be the libraries that give you most options (I'd say freedom, but that's not how GPL sees freedom) while still maintaining the integrity of the library itself. Those are the licenses that best and at the same time satisfy the needs of developers who are not in it for money, and users who don't want to waste their brain cycles on going through possible legal scenarios around all their actual and imagined use cases.

> Suddenly, all my knowledge and experience of GPL-ed libraries goes to waste, as I won't be able to use any of that to help my friend.

Help me understand this please. In what legal way is GPL obstructing you as opposed to say an MIT or a BSD license.

GPL license has noting against use in a proprietary setting, its only if GPL'ed software is being sold/distributed, that it is required that the source and the changes be made available as GPL. Google uses GPL'ed software all the time and is far from the only one.

Last time I checked, the GPL, as opposed to LGPL or GPL with Classpath Extension, basically forced your product to become open source.

Only if you are distributing/selling the software to others. If you and your friend are using it as a company internal tool, GPL has no issues with that.

> In what legal way is GPL obstructing you as opposed to say an MIT or a BSD license.

Which one? 2 or 3?

I can totally understand why an entity (individual or business) would want the most options for themselves. It's completely natural, if a little selfish.

But the discussion here is about FOSS sustainability from the dev perspective, not yours as a user. Dual licensing was one option, proposed by the OP.

Or do you think it's OK that critical libraries like NumPy, Django, etc end up with scraps (if lucky), and then we read odes to that on HackerNews? Long-term planning needs a certain reliable continuum (no pun intended) of people and resources.

You know, experts able to meaningfully contribute at this level (core NumPy, core Scikit-learn, core Django, whatever) have very real trade-offs to make, regarding the cost of their labour, free time, family time etc, once out of academia. The "I HAZ BUG PLS FIX NOW, GIMME GIMME FREE" users are only one piece of the open source puzzle.

I'm sure your friend that runs a business understands this very well (or he'll be out of business quickly).

I agree what I presented is quite a selfish POV. What I was aiming at is an explanation why those very liberal licenses are attractive to developers at the initial stage of an open source project. Initially, the userbase is much more valuable than monetary contributions - I doubt that e.g. NumPy developers at the beginning of their work were confident that their project will end up being widely used by the financial sector.

The presented reasoning may indeed be a big part of the problem here. I'm just describing it, not encouraging it. I am definitely not happy to see so much open source being used as critical components of worldwide infrastructure and businesses, and yet receiving little support both in terms of money and professional effort. FWIW, I do donate the little spare money I have to open source projects.

Absolutely. Sustainability only becomes an issue when there's something to sustain :-)

Beginnings are a creative effort, undertaken by passionate pioneers (rarely in it for the money), with outcomes that are notoriously hard to predict in advance. The hallmark of academia.

That's why I said the later transition hurts -- it's a conceptual and cultural shift, not merely financial.

Don't they do that do get market share? I know it sounds crazy but it kinda makes sense. Get people using it and then figure out how to get monetary support from the users later. Although this second step often proves hard if you've already given away the goods.

I understand the GPL, but I personally prefer BSD/MIT and I am disappointed when I come across things I like in the GPL. IMO the GPL is not really 'free'.

On the other hand, the GPL wouldn't make a difference here unless they are actually distributing it and my understanding of LGPL means that they could do whatever they wanted to as long as they use NumPy and don't change it.

Why would copyleft affect them? Copyleft is about distribution. These Wall Street firms keep all their software in-house.

I wonder what they plan to use it for. Numpy kind of seems finished already.

There are about 200 open "Numpy Enhancement Proposal" issues on Github, https://github.com/numpy/numpy/labels/01%20-%20Enhancement

Many things could be improved in NumPy:

  * make it easier to implement and deploy custom dtypes, fix the time-related dtype
  * support for ragged arrays
  * consolidate internals, especially around ufuncs
I also think some non trivial part of pandas lowest levels belong to NumPy, though I have not thought very deeply about that one: support for missing value, some kind of indexing, etc...

Yes, I agree. I would love to see parts of what is in Pandas actually supported better in NumPy.

That's not to say there aren't things NumPy doesn't do that it could. How about lazy evaluation or even just matrix chain multiplication? Either could save a lot of computation.

Matrix chain multiplication was added to NumPy recently, see numpy.linalg.multi_dot: https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/...

Hooray! Still, it's not exactly transparent in a way that you could get with lazy array evaluation.

Dynamic arrays anyone?

I could have sworn that Continuum had gov't funding for numpy development, but maybe that was just for Blaze?


It was blaze and bokeh. The description of blaze given in the article is pretty out of date at this point.

Continuum co-founder here and previous NumPy developer. We received funding for Bokeh and also explorations around scalable "NumPy-like" work. The efforts around Blaze produced Numba, Dask, and Datashape as well as a more limited in scope library called Blaze.

While we continue to develop those things, we are working on taking the core ideas in NumPy and creating a set of lower-level C and Python libraries that could be used by pandas, xarray, Numba, dask, arrow, and potentially NumPy: https://github.com/plures. This work is not directly related to the NumPy funding.

Thanks for the clarification, Travis, that's pretty exciting to hear.

Are they going to make Numpy work on GPU? There is a library called Cupy (from Chainer) that does that but not quite well enough. In fact on my attempt to swap Numpy with Cupy, my program ran slower.


If you are looking for performance optimizations beyond Numpy (GPU or CPU) take a look at Numba by Continuum

They've come a long way without funding. Good for them. Mathworks taking notice, I'm sure.

Slightly off topic but you can use most of the Python stack from MATLAB since version 2014b. The syntax is a little funny but it works well outside of ABI mismatches in shared libs.

Wasn't Google funding the lead dev on NumPy for a while?

No. I'm not aware of this happening.

NumPy built on Numeric which was primarily written by Jim Hugunin while he was a grad student at MIT. While I was a professor at BYU, I wrote the the core of NumPy with a lot of community input -- outside of my regular job. I sold a book "Guide to NumPy" for a while that I used to replace the grants I should have been writing and pay for a grad student to help write iterative solvers for SciPy. Chuck Harris joined the NumPy effort early and has been steadily contributing ever since without direct funding. Many others have contributed volunteer time since then.

A major reason I started Continuum was to help create places where people could get paid to write open-source. I am happy to say we have been doing this for 5 years though mostly outside core NumPy itself (Numba, Dask, Bokeh, conda, etc.) We are working to support many more open source projects more generally -- and our devs have now made additional contributions to NumPy itself. We have a thriving 40 person Community Innovation team at Continuum supporting many open source projects. I expect this funding to help bring more new people to the NumPy development ecosystem.

The community also started NumFOCUS at this same time to be a community-run foundation that could be a focal point for donations and support to projects including NumPy.

Nathaniel Smith wrote a great proposal and put the effort into securing this funding. It is real work to secure funding. I look forward to NumPy getting better for the benefit of all because of this work.

Congratulations!! Nice work...looking for lots more math libraries :-D!

Does anyone have a link to the text of the proposal?

I really wish I could help!

What's stopping you?

About time.

$645020 is good for what? 4 jr developers or 3 slightly experienced developers, working full time on numpy for 2 years?

What if that's what it takes? And besides that, it is $645020 more than they had otherwise, that's serious dough. (Though not serious in terms of the value created with Numpy but that is the nature of open source.)

It's a heck of a lot more than Instagram donated to Django! (Around $25k/y ! ) [1]


I think this is a great question---academia is still finding out how expensive software development can be. From my experience in the UC system, 35-50% of this will go to grant administration overhead. After factoring in benefits, my guess would be one senior and one junior developer for 2 years–if those developers are willing to take rather large pay cuts to work on a public good / neat project (thankfully a lot are!).

20 graduate students

What's your point with this post?

free/open/libre software takes time and skill to create. more companies should probably be paying some kind of support or at least donating since they are no doubt using it to extract some kind of value from the market.

Ok. It came off as if 'this measly sum means nothing'.

This is a charitable foundation that gives scientific grants, not a company.

my point stands. more public money should be going to free/libre projects too.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact