
PEP 450: Adding A Statistics Module To The Standard Library - petsos
http://www.python.org/dev/peps/pep-0450
======
fiatmoney
It's not a terrible idea to support the absolute basics like mean & variance,
but anything beyond that (particularly things like models or tests) is not a
good idea for a standard library. Once you hit even something simple like a
linear regression you have issues of how to represent missing or discrete
variables, handling colinearity, or whether to do online or batch modes which
can give different results. Tests in particular are fraught because if you're
going to make them available for general consumption they need a good
explanation of when they're appropriate, which is basically a semester course
in statistics and well out of scope for standard library docs.

Basically, the idea of "batteries included" should also mean that if something
looks like you can put a D-cell in there, you're unlikely to blow your arm
off.

~~~
daniel-levin
Agreed. I'm studying statistics at the moment and I'm continually reminded of
how easy it is to choose the wrong model / distribution and be incorrect
because of some non-obvious and technical reason. For example, just the other
day, I wanted to use the binomial distribution to solve a problem. To use this
distribution, the trials must be independent of one another. In that
particular problem, there was a subtle condition that made the trials non-
independent. I arrived at correct-appearing answers (0 <= P <= 1) that were
actually all wrong. Statistics is way too easy to break to be used naively.

~~~
lutusp
> Statistics is way too easy to break to be used naively.

Fair enough, but the same argument could be made about using an unskewed
standard distribution on non-symmetrical datasets, a common error even among
people who should know better.

I think binomial functions should be included, on the ground that they're very
useful and their probability of misuse is only equal to the continuous
statistical forms, not more so.

~~~
dandellion
Hell, sometimes they use a dictionary when they should be using a list. Almost
everything can be used wrongly by a begginer, which doesn't mean it shouldn't
be there.

I think having a basic stats module always handy would be very convenient.

~~~
lutusp
> I think having a basic stats module always handy would be very convenient.

I absolutely agree. My only point was that these tools are sometimes
misapplied, not at all to argue that they shouldn't be readily available. They
should be.

------
clutchski
Batteries included is a fine philosophy when starting a language to encourage
early adoption, but at this point, I don't think it's worth adding new
libraries to the stdlib. Here's why:

\- It's very easy to find and install third party modules

\- Once a library is added to stdlib, the API is essentially frozen. This
means we can end up stuck with less than ideal APIs (shutil/os,
urllib2/urrlib, etc) or Guido & co are stuck in a time consuming
PEP/deprecate/delete loop for even minor API improvements.

\- libraries outside of the stdlib are free to evolve. users of those
libraries who don't want to stay on the bleeding edge are free to stay on old
versions.

~~~
dalke
Those are good reasons for rejecting any addition to the standard library.
However, new libraries are sometimes added to the standard library, which
means the reasons you listed can be overcome by even better reasons for
inclusion.

What are those reasons for why a new library can be included, and why aren't
those reasons appropriate justification for including this proposed statistics
package?

~~~
Peaker
> However, new libraries are sometimes added to the standard library, which
> means the reasons you listed can be overcome by even better reasons for
> inclusion.

Or overcome by bad judgement.

------
ot
Just out of curiosity, I submitted this yesterday:

[https://news.ycombinator.com/item?id=6190603](https://news.ycombinator.com/item?id=6190603)

The URL was

    
    
        http://www.python.org/dev/peps/pep-0450/
    

While this is

    
    
        http://www.python.org/dev/peps/pep-0450
    

That is, exactly the same except for a trailing slash. Doesn't the
deduplication algorithm handle this case?

~~~
daGrevis
Technically speaking, they are separate URLs that may lead to separate
resources. For example, Google engine treats them as separate URLs.

That's the reason why opening
[http://www.python.org/dev/peps/pep-0450](http://www.python.org/dev/peps/pep-0450)
redirects to
[http://www.python.org/dev/peps/pep-0450/](http://www.python.org/dev/peps/pep-0450/)
. HN engine should follow redirect to avoid situations like this.

~~~
ot
Technically speaking, there are no equivalent URLs in general, different
strings may lead to different resources.

Still, there are a number of common sense heuristics to normalize URLs, that
HN applies to do de-duplication. I was wondering what is the rationale for not
having trailing slash removal among them. I mean, is there any _legitimate_
website that serves a different resource if you remove the trailing slash?

~~~
gsnedders
Per RFC3986/7, [http://example.com/%60](http://example.com/%60) and
[http://example.com/a](http://example.com/a) are equivilant. (Indeed, all
major browsers will request the latter regardless of what is input.) Equally,
punycode encoded IRIs and the original IRI are equivilance. There is a whole
section on equivilance in both of the RFCs (3967 includes 3986 by reference,
so is a superset).

~~~
U2EF1
Browser equivalence is another thing entirely. Most browsers will accept
[http://www。google。com](http://www。google。com) (because in Japanese '。' is
'.'). But if you tried to request that actual resource it doesn't lead
anywhere.

But yeah HN should just use browser equivalence.

~~~
gsnedders
But totally undefined and all browsers do their own thing for what's entered
in the address bar — there's more consistency in URLs in content, and that
doesn't do stuff like normalising '。' but does do the percent-encoded case
(for unreserved characters, as the spec says).

Following what the spec says for eqivilance makes sense, at least. Anything
drastic is technically treating distinct URLs as equivilant.

------
cabalamat
> For many people, installing numpy may be difficult or impossible. For
> example, people in corporate environments may have to go through a
> difficult, time-consuming process before being permitted to install third-
> party software.

I do not regard this as a good justification for putting something in the
standard library! If you don't have root access, use vitualenv (which you
might want to do anyway) and install the package somewhere under your home
directory.

~~~
siddboots
NumPy is not nearly portable enough to do what you are describing as a user on
a windows machine. You cannot simply do a `pip install numpy` into a
virtualenv. Instead you must either install the package system-wide, or
compile it yourself, which means getting a working MinGW environment or
similar.

Edit: Although, I do agree that NumPy being difficult to install is not, on
its own, a good justification for the PEP.

~~~
cdavid
numpy is quite portable, I am not sure what you mean by not nearly portable
enough.

The reason why you can't do pip install numpy is pip's fault, there is nothing
that numpy can do to make that work. Note that easy_install numpy does work on
windows (without the need for a C compiler).

~~~
bachback
the problems I have had with numpy are endless. Usually I'll just prefer to
write my own, because it's quicker. if you have tried to get numpy running on
a cloud machine you'll know what I'm talking about. basically you will have to
know how to compile from source, know some gcc, etc. the last time I tried to
get it running I promised myself never to use numpy again.

~~~
malkarouri
I suggest you try again. I think the situation has changed. I have been using
Numpy/Scipy for a few years now, with all the negative experience of having to
sort out whether to use g77 or gfortran and the right gcc flags etc. The most
recent version, Numpy 1.7.0, is the first one that installed for me smoothly
on Windows and OS X (official packages). YMMV, but I think it is alright now.

------
lutusp
Great idea, but while assembling this library, don't leave out permutations,
combinations, and the binomial Probability Mass Function (PMF) and Cumulative
Distribution Function (CDF). Small overhead, easy to implement, very useful.
More here:

[http://arachnoid.com/binomial_probability](http://arachnoid.com/binomial_probability)

~~~
enalicho
Permutations and combinations already exist within the itertools module.

~~~
lutusp
> Permutations and combinations already exist within the itertools module.

Not exactly. Given argument lists, Itertools provides result lists (actually,
iterators for that purpose) with the original elements permuted and combined,
but doesn't provide numerical results for numerical arguments, as shown here:
[http://arachnoid.com/binomial_probability](http://arachnoid.com/binomial_probability)

I was referring to permutation and combination mathematical functions, not
generator functions.

~~~
megrimlock
Put differently, you want the functions that count the number of permutations
and combinations possible, not functions that yield/generate the actual
permutations and combinations.

~~~
lutusp
Yes, exactly, for a number of reasons including the problem of large arguments
and results.

------
aristus
About damned time. Writing your own stats library is like writing your own
crypto.

~~~
aidos
You wouldn't write your own - numpy / scipy have everything you'll need.

~~~
dalke
Some of the things I need are:

\- fewer dependencies for my package

I've written the average() and standard_deviation() functions at least a
couple of dozen times, because it doesn't make sense to require numpy in order
to summarize, say, benchmark timing results.

\- reduced import time

NumPy and SciPy were designed with math-heavy users in mind, who start Python
once and either work in the REPL for hours or run non-trivial programs. It was
not designed for light-weight use in command-line scripts.

"import scipy.stats" takes 0.25 second on my laptop. In part because it brings
in 439 new modules to sys.modules. That's crazy-mad for someone who just wants
to compute, say, a Student's t-test, when the implementation of that test is
only a few dozen lines long. (Partially because it depends on a stddev() as
well.)

Sure, 0.25 seconds isn't all _that_ long, but that's also on a fast local
disk. In one networked filesystem I worked with (Lustre), the stat calls were
so slow that just starting python took over a second. We fixed that by
switching to zip import of the Python standard library and deferring imports
unless they were needed, but there's no simple solution like that for SciPy.

\- less confusing docstring/help

Suppose you read in the documentation that scipy.stats.t implements the
Student's t-test as scipy.stats.t.

    
    
        >>> import scipy.stats
        >>> scipy.stats.t
        <scipy.stats.distributions.t_gen object at 0x108f87390>
    

It's a bit confusing to see scipy.stats.distributions.t_gen appear, but okay,
it's some implementation thing.

Then you do help(scipy.stats.t) and see

    
    
        Help on t_gen in module scipy.stats.distributions object:
        
        class t_gen(rv_continuous)
         |  A Student's T continuous random variable.
         |  
         |  %(before_notes)s
         |  
            ...
         |  
         |  %(example)s
    
    

Huh?! What's %(before nodes)s and %(example)s?

The answer is, scipy.stats auto-generates various of the distribution
functions, including things like docstrings. Only, help() gets confused about
that because help() uses the class docstring while SciPy modifies the
generator instance's docstring. Instead, to see the correct docstring you have
to do it directly:

    
    
        >>> print scipy.stats.t.__doc__
        A Student's T continuous random variable.
        
            Continuous random variables are defined from a standard form and may
            require some shape parameters to complete its specification.  Any
            optional keyword parameters can be passed to the methods of the RV
            object as given below:

~~~
cdavid
scipy.stats distribution objects are a bit particular, that's a bit unfair to
pin point them.

Generally, numpy and scipy have much better docstrings than python stdlib
itself.

~~~
dalke
Well, help(scipy.optimize.nonlin.Anderson) has the same problem, but you're
right in that that failure mode is rare, and that numpy/scipy has good
documentation. However, in the context of a stats library, I think it's okay
to point out that scipy.stats has some annoying parts. ;)

In all honesty, I seldom use NumPy and rarely use SciPy, so I can't judge that
deeply. I know that when I read their respective code bases I get a bit
bewildered by the many "import *" and other oddities. It doesn't feel right to
me. I know the reason for most of the choices - to reduce API hierarchy and
simplify usability for their expected end-users - but their expectations don't
match mine.

So I looked at more of the documentation. I started with
scipy/integrate/quadpack.py. The docstring for quad() says, in essence, "this
docstring isn't long enough, so call quad_explain() to get more
documentation." I've never seen that technique used before. The Python
documentation says "see this URL" for those cases.

Again, this is a difference in expectations. I argue that NumPy and Python
have different end-users in mind. Which is entirely reasonable - they do! But
it means that it's very difficult to simply say "add numpy to part of the
standard library."

There's also a level of normalization that I would want should numpy be part
of the standard library. For example, do out of range input raise ValueError
or RuntimeError? scipy/ndimage/filters.py does both, and I don't understand
the distinction between one or the other.

Now, in the larger sense, I know the history. RuntimeError was more common in
Python, and used as a catch-all exception type. Its existence in numpy
reflects its long heritage. It's hard to change that exception type because
programs might depend on it.

But it means that integrating all of numpy into the standard library is not
going to work: either it breaks existing numpy-based programs, or the merge
inherits a large number of oddities that most Python programmers will not be
comfortable with.

~~~
cdavid
Actually, I don't think the import * in numpy is anything else than historical
artefact. Numpy just happens to be one of the oldest, still widely used python
library (considering numpy started as numeric), as you point out. As for
import speed, have you considered using lazy import in your script ?

I don't see numpy being integrated in python anytime soon. I don't think it
would bring much, and one would have to drop performance enhancement that rely
on blas/lapack.

I think installing has improved a lot, and once pip + wheel matures, it should
be easy to pip install numpy on windows.

~~~
dalke
I've asked on the numpy mailing list. The "import * " was a design decision,
now irrevocable without breaking existing code.

For examples, from [http://mail.scipy.org/pipermail/numpy-
discussion/2008-July/0...](http://mail.scipy.org/pipermail/numpy-
discussion/2008-July/036228.html) :

Robert Kern: Your use case isn't so typical and so suffers on the import time
end of the balance

Stéfan van der Walt: I.e. most people don't start up NumPy all the time --
they import NumPy, and then do some calculations, which typically take longer
than the import time. ... You need fast startup time, but most of our users
need quick access to whichever functions they want (and often use from an
interactive terminal).

I went back to the topic last year. Currently 25% of the import time is spent
building some functions which are then exec'ed. At every single import. I
contributed a patch, which has been hanging around for a year. I came back to
it last week. I'll be working on an updated patch.

There's also about 7% of the startup time because numpy.testing imports
unittest in order to get TestCase, so people can refer to
numpy.testing.TestCase. Even though numpy does nothing to TestCase and some of
numpy's own unit tests use unittest.TestCase instead. _sigh_. And there's
nothing to be done to improve that case.

Regarding the age - yes, you're right. BTW, parts of PIL started in 1995,
making it the oldest widely used package, I think. Do you know of anything
older?

------
zokier
Reminds me of the story that made rounds here couple of years ago: The Python
Standard Library - Where Modules Go To Die

[https://news.ycombinator.com/item?id=3913182](https://news.ycombinator.com/item?id=3913182)

------
bachback
Nice proposal. I think the problem is numpy itself. If you could just do pip
install numeric_package then nobody can complain. I don't quite understand why
a package has to depend on LINPACK. I will probably switch to julia-lang,
because numpy is (at least for me) not that great to work with.

~~~
cdavid
Numpy does not depend on LINPACK, only scipy does (numpy only uses blas if you
have one installed, it is optional).

The reason why scipy (and Julia BTW) need blas/lapack is because that's the
_only_ way to have decent performance and reasonably accurate linear algebra.
The alternative is writing your own implementation of something that has been
used and debugged for 30 years, which does not seem like a good idea.

~~~
bachback
This is what I don't understand at all. Imagine somebody in a different area
of computing would say: oh, we solved that 30 years ago and now there is no
room for improvement at all? why can't this be done at least in C?

~~~
cdavid
There is room for improvement, and it gets improved all the time (e.g.
openblas is a recent contender). LAPACK is essentially an API for linear
algebra, which is what allowed people to improve implementations and to
benefit from them in older programs.

Think of it as the C library of numerical computing.

~~~
mistercow
I'm confused how a library written in Fortran 90 can be called "the C library
of numerical computing".

~~~
cdavid
I was comparing LAPACK to the C library: nobody claims it is weird to use the
C library, written > 30 years ago, instead of using something 'more modern'.
Everybody uses the C library for system/low level programming, the same is
true in numerical computing w.r.t. blas/lapack. Numpy/scipy use it, R use it,
julia use it, matlab use it, octave use it.

------
andrewflnr
I'm in favor. I was surprised and annoyed to find there wasn't a standard
library for doing excel-level statistics. If you throw basic least-squares
linear regression in there too, I can eliminate Excel from my physics classes.

------
bthomas
One side effect is that this would accelerate the adoption of Python 3 in the
scientific community

------
rev
Kudos to PHP for apparently being ahead of the curve among dynamic languages
with regard to statistics. Another interesting, yet unmentioned option is
Clojure/Incanter.

~~~
scribu
I upvoted this comment, then I realized that those PHP stats functions aren't
in the standard library. They're in a PECL extension (equivalent to Python C
extensions).

------
tvst
No mention of Pandas?

[http://pandas.pydata.org/](http://pandas.pydata.org/)

~~~
goronbjorn
They probably left out pandas because it depends on numpy and also this point:

> For many people, installing numpy may be difficult or impossible.

that's as true, and arguably more, for pandas.

~~~
maaku
Although the solution, imho, is to make numpy easier to install across many
platforms (ideally `pip install numpy pandas` should just work).

~~~
dagw
There are two types of potential limitations to installing software, technical
and policy. Just solving the technical won't necessarily make it easier to
install numpy/pandas on a users computer from a policy perspective. Being able
to rely on basic statistics functions being available in a default python
install would be nice.

------
bayesianhorse
I'm against this. Either you have to create a new statistics module or you
would have to include numpy/pandas/statsmodels into the standard library. In
both cases it would essentially freeze the modules for further development
outside the python release cycle...

------
Demiurge
I would like having these simple functions, but I think they can just go into
'math' library.

------
matiasb
Nice idea

