
Show HN: st – simple statistics from the command line - nferraz
https://github.com/nferraz/st
======
imurray
For casual purposes st may be convenient, but it doesn't have state of the art
numerical stability:

    
    
        my $variance = $count > 1 ? ($sum_square - ($sum**2/$count)) / ($count-1) : undef;
    

Taking the difference between two similar numbers loses precision, and in
extreme cases squaring the raw numbers could cause overflow. For comparison,
see the recently posted:
[http://www.python.org/dev/peps/pep-0450/](http://www.python.org/dev/peps/pep-0450/)
and
[https://en.wikipedia.org/wiki/Algorithms_for_calculating_var...](https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance)

~~~
Doches
If you need numerical stability, I'd use Gary Perlmann's
[|stat]([http://oldwww.acm.org/perlman/stat/history.html](http://oldwww.acm.org/perlman/stat/history.html)).
It's older and somewhat harder to get a copy of, but it's as reliably correct
as a piece of software can be...

~~~
imurray
I've just had a look at the |STAT source, and it computes the variance with

    
    
            double  M       = Sum/N;                /* mean */
            double  var     = (s2 - M*Sum)/(N-1);   /* variance */
    

where s2 is the sum of squares. In most reasonable situations, this approach
will work fine. It just doesn't take as wide a variety of inputs as is easily
possible to achieve with normal floating point doubles. In fairness, the |STAT
terms and conditions state:

|STAT PROGRAMS HAVE NOT BEEN VALIDATED FOR LARGE DATASETS, HIGHLY VARIABLE
DATA, NOR VERY LARGE NUMBERS.

------
Sprint
I'd just use octave. It's as simple as

    
    
      $ octave
      octave:1> a=load('numbers.txt');  
      octave:2> sum(a)
      ans =  55
      octave:3> mean(a)
      ans =  5.5000
      octave:4> std(a)
      ans =  3.0277
      octave:5> quantile(a)
      ans =
      
          1.0000
          3.0000
          5.5000
          8.0000
         10.0000
    

etc

~~~
nferraz
I like octave and R!

The reason I wrote this script was to get quick results from the command line.

For instance: I could use grep, cut and other unix tools to get the numbers
from a file and make quick calculations.

Of course, for complex processing I would use octave or R.

~~~
Sprint
Yeah, I was thinking about that and spent the past minutes to make me some
Bash functions like:

    
    
      function mean() {
              octave -q --eval "mean = mean(load('$1'))"
      }
    

Then just run "mean numbers.txt".

I am sure your approach is much quicker, octave takes a good 0.5s(!) to load
on my machine.

~~~
nferraz
Yup, octave requires more time to warm up.

Regarding speed, for simple calculations like sum, mean and variance, the
bottleneck is in I/O.

------
philsnow
One suggestion: whatever the default may be, give an option to have line-
delimited output rather than column-delimited.

IMHO if you want your script's output to be easily usable by other scripts,
line-delimited is easier since you can grep out what lines you want rather
than having to rely on the column position never changing (since you can give
cut only a field number and not a field name like "average").

------
sprayk
suckless' terminal emulator already uses the name st, though it's not quite
popular enough to be in any major repos.

[http://st.suckless.org/](http://st.suckless.org/)

~~~
nferraz
Thanks for the information!

I wanted to use "stat", but it was already used (display file status);
"statistics" was too big.

Just as curiosity, I got the idea for this script when I wanted to calculate
the sum of some numbers and discovered that the "sum" command was used for
another purpose (display file checksums and block counts)!

~~~
Kliment
sta seems to be available

------
riffraff
maybe you are not aware of it, but there is a nifty little tool in freebsd
called ministat that somewhat overlaps with what you did, maybe of interest:

[http://www.freebsd.org/cgi/man.cgi?query=ministat&apropos=0&...](http://www.freebsd.org/cgi/man.cgi?query=ministat&apropos=0&sektion=0&manpath=FreeBSD+8-current&format=html)

~~~
codemac
EXACTLY!

I converted this tool to linux for the archlinux package forever ago:

[https://github.com/codemac/ministat](https://github.com/codemac/ministat)

There are a few forks (adding autoconf, an osx branch, etc) as well.

------
pjungwir
Lovely! As someone with my own little script to sum up the values in a given
column, I can see how you'd want to just have this tool sitting ready to hand
in ~/bin in wherever. And this script seems to adhere better to the Unix way
than mine, since it's easy to use cut(1) to extract whatever column you want,
but it makes sense for one tool to do sum, mean, sd, etc. Thanks for sharing!

------
fsiefken
Nice, is there a maximum rowcount? What would be nice is a way to do a sum on
a second or third column - or would you use awk to get those and pipe the
result to st?

~~~
nferraz
There isn't a max rowcount for sum, mean, variance, etc, because it is not
necessary to hold the data in memory.

The calculation of median and quartiles require that the whole set is stored
and later sorted, so it is limited to the available memory.

Regarding your suggestion -- I'm considering the idea of dealing with multiple
columns and even CSV and other types of tabulated data.

------
jldugger
So, [http://suso.suso.org/programs/num-
utils/index.phtml](http://suso.suso.org/programs/num-utils/index.phtml)
already exists and is written in perl. It seems like your major contribution
is a statistical slant, which might be compatible with the existing code base.

edit: it's perl, not python. brainfart on my part.

------
montecarl
I wrote a program to quickly generate histograms from data. Seems like it
would complement "st" nicely for quick command land stats calculations.

[https://github.com/SamChill/hist](https://github.com/SamChill/hist)

------
hnriot
why not write a one line python program instead? I'd never use the shell for
these kinds of things. Quickly they grow into more than something that a
simple one liner can handle. Before you know it, you're reading from a CSV and
summing column "foo" and so on. This then turns your shell approach into mess,
instead of a (now 5 line) python program.

