

No, shut up. What statistical programming languages can learn from Dropbox. - erehweb
http://erehweb.wordpress.com/2011/02/19/no-shut-up-what-statistical-programming-languages-can-learn-from-dropbox/

======
scott_s
The code that makes him say "what a mess," I think is beautiful:

    
    
      def summary(data, key=itemgetter(0), value=itemgetter(1)):
        for k, group in groupby(data, key):
          yield (k, sum(value(row) for row in group))
    

Perhaps that's because I'm a programmer, and Python is a general purpose
programming language. But I think that's what his complaint boils down to: the
Python statistical code looks too much like _Python_. Which, yeah, it does.
Python is a general purpose programming language, not a domain specific
language for statistical programming.

However, I don't think the programming concepts one needs to understand to
make effective use of a well designed Python library are too much to ask. I've
only dabbled in R, but when I did, it required me to exercise my general
programming knowledge to understand list, matrices and functions. I think the
author is also falling in the trap of what is obvious to him is obvious to
everyone. I'm actually not sure of what the SAS code is doing, and much prefer
the Python.

~~~
kenjackson
_Perhaps that's because I'm a programmer, and Python is a general purpose
programming language._

Exactly. You shouldn't have to be a programmer to do statistics. Just like you
shouldn't have to be a network engineer to share files. What if DropBox had
stuff in there about http, ports, levels of service, bandwidth, etc... You'd
probably say, "Great! I always wnated to specify that DropBox use SSL4.7 draft
B over CDMA EvoX.1 -- who wouldn't?"

When you're doing a DSL make it is as simple as possible. And if you have
time, in v2, give it hooks to just break out and do crazy stuff... but the 90%
case should be simple as pi.

~~~
_delirium
There are GUI statistics apps for people who just want the common case,
Dropbox-style: packages like Weka for data mining / predictive statistics,
SPSS for descriptive statistics, and a dozen other such things.

The statisticians who choose to use a programming language like R or Python
typically do it because they actually _do_ want a programming language. I
mean, that's why Bell Labs statisticians invented S (R's predecessor) to begin
with.

~~~
ltjohnson
I am a statistician that does both research and applied work.

I use R for three reasons: (1) It's Free Software; (2) It's a programming
language; (3) Other statisticians use it so it's easier for me to collaborate.

There are the usual supporting arguments for (1). (2), I've only used SAS a
little bit, and it was extremely unpleasant to use it for non-built-in stuff,
which makes research harder for no good reason. For (3), I have nothing
against Python but most other statisticians don't use it. If I want to share
my work in R, it's easy (statisticians know how to install R packages). If I
want to share my work in Python, I first have to teach [most] other
statisticians how to use Python. There's nothing wrong with that, but why
raise the start-up cost for them?

tl;dr I conjecture that most statisticians don't want what the author is
suggesting. Also, there are plenty of companies that are trying to do what the
author is asking for, but most of them seem to miss the desired sweet spot, or
charge lots of money, or both. I haven't taken a survey of the available
software in quite some time.

~~~
crocowhile
can you reccomend a book to get started with r?

~~~
ltjohnson
No, but I can give some suggestions. It would help to know what you want to
do.

First of all, you need to decide if you want a language reference, or an
application guide, as R books fall into those two categories.

If you have a specific type of work in mind (bio-informatics, data mining,
data visualization, ...) I'd say to find a book that focuses on that topic. I
haven't looked in a while, but I haven't seen a general R book that I like,
anything I suggest there would be guessing on my part.

There are plenty of good references on the web. I'd start by looking at the
material available from the R web site:

R's core manuals [1] are typically correct and reasonable to use. The
"Introduction to R" guide will get you up to speed fairly well if you already
know another programming language. There is also the contributed documentation
[2]. I haven't gone through these, so I can't say much about them, or promise
that they are up-to-date. I suspect not, as R develops rapidly. The one
reference I can _recommend highly_ is "The R Inferno" by Patrick Burns [3].
This is not a starter guide, but something you read after one. It gives
excellent advice on avoiding common pitfalls in R.

[1] <http://cran.r-project.org/manuals.html>

[2] <http://cran.r-project.org/other-docs.html>

[3] <http://www.burns-stat.com/pages/Tutor/R_inferno.pdf>

~~~
crocowhile
Thanks. I do biology with limited amount of data and my needs are very basic.
Here is a software I wrote to do sleep analysis in Drosophila:
<http://www.pysolo.net>

So far I could satisfy most of my statistics needs with the function in numpy
and scipy but occasionally I need to do something slightly more fancy and R I
guess is the way to go.

~~~
ltjohnson
Possibly. R is really great at doing "fancy" statistical analyses. It's very
lousy at doing things like text manipulation. When I have a project that needs
some text manipulation on the front end, I frequently use other tools (Python,
vi, sed, ...) on the front end to beat text data into a nicer form for R. I
couldn't say without knowing more about your project.

------
jhamburger
To be honest I find the whole repeated "no, shut up" thing to be a bit crass
and it makes me unsympathetic if anything. I hope this doesn't become a
catchphrase in blogs.

~~~
apu
I thought it was perfect in the original dropbox quora post, but I agree that
it doesn't quite fit here, and there is certainly a danger of it becoming a
meme.

~~~
dkarl
The original post wasn't insightful at all. That arrogant, know-it-all
attitude is not how DropBox got their interface right. They got their
interface right through careful attention to their users, by being humble
enough to trust the user data and throw away features they had thought would
be useful.

EDIT: Downvoted, great. This must be the ultimate triumph of snark: we are now
perpetuating the myth that common sense and a sassy attitude is how DropBox
created a breakthrough product, instead of careful beta testing and analysis
of usage data.

------
dkarl
If 90% of usage boils down to a small number of rigid patterns, then there is
a simple solution: a handful of convenience functions. Often these functions
are missing, because the demand for convenience functions is obscured by the
fact that every experienced user defined them for himself years ago. That
forces newbies to suffer through the unnecessary task of understanding the
fully generalized API before they can accomplish simple tasks.

Languages that have good support for optional arguments, such as Python and
Lisp, also make it possible to create APIs that are elegant and concise for
experts but extremely intimidating for beginners. It may be more elegant to
have a single function with a slew of optional arguments, and an experienced
user may be able to accomplish any task quite concisely by specifying a few
arguments, but a beginner would be better served by a handful of specific
functions with specific names. API writers should consider providing those
functions as simple wrappers to the general API, in order to provide a simpler
learning curve for users who might never need more complex functionality.
Examining how those wrapper functions are implemented can help intermediate
users figure out the general API, too.

~~~
hyperbovine
Exactly. It's trivial to write a prettied-up interface to those Python
functions that would make as much sense (?) as PROC MEANS. Good luck trying
extend SAS to do anything the designers didn't implement as a procedure,
though. Having had to navigate through a complex SAS macro or two in my day, I
can assure you that it the single worst experience I have ever had in 20 years
of programming.

------
btilly
Actually in my experience business users want one thing on that list, and if
they have that thing they don't care about whether you provide the other two.
They won't ask for what they want because they don't know that they can get
it. But they will be happy if they get it.

They want to get at nicely organized data easily from inside of Excel. Excel
is a toolbox that they already know from which they can do their own pivot
tables and graphs. And they'd prefer to do that because then they can just do
it instead of less efficiently having someone else do it for them.

They want it to arrive nicely aggregated and organized, since Excel is not
very good at that. But they are more than happy to do the pretty reports
themselves. Just get them the data.

See [http://bentilly.blogspot.com/2009/12/design-of-reporting-
sys...](http://bentilly.blogspot.com/2009/12/design-of-reporting-system.html)
for a more detailed description of one set of experiences that taught me that.

------
tom_b
What Dropbox does is eliminate the programmer from the equation. You don't
need your IT staff to do any special setup or to follow a special process.

I think the truth is business users don't want to work with you - they just
want their data relationships to be discovered in a simple and intuitive way.
If your data crawling is sufficiently good, maybe you can do that.

I understand the point the article is making, but I feel rather strongly that
"(y)ou’ll want a third thing – to read in and parse data" translates to the
person that builds a tool that does that nicely and _automatically_ creates
pivot tables, graphs, and other dashboard-y things will probably have to hire
a team to shovel the money off so he/she can breathe.

I suppose it's an ok example to talk about this re: statistical programming
languages, but my own experience in the three requirements that preface the
whole discussion (pivot tables, graphs, data parsing) are a big example of
something just screaming for a new solution, not a new prog language . . .

------
skept
There's a nice NumPy-based Python package called Tabular
(<http://www.parsemydata.com/tabular/index.html>) that makes this super easy:

    
    
      import numpy as np
      import tabular
      
      # CSV with Region, City and Sales columns
      data = tabular.tabarray(SVfile = 'data.csv')
      
      # Calculate the total sales within each region
      summary = data.aggregate(On = ['Region'], AggFuncDict = {'Sales':np.sum}, AggFunc = len)
      summary.saveSV('summary.csv')

------
cawhitworth
Last I checked, though, Python wasn't a statistical programming language.

------
elmarks
I can't speak for the creators of R, but I have a strong suspicion that it
wasn't intended for erehweb or MBAs. It's not Microsoft Excel, its R. It's
used by PhD researchers in Mathematics, Statistics, Economics and Political
Science.

------
Duff
Maybe I'm just a simpleton, but it seems odd to attack something designed to
make statistics analysis easy for statisticians, because it doesn't meet the
needs of mid-level managers.

Managers and other folks who need to make pivot tables, graphs and related
things without programming have a great tool to do that: Excel. For these
people, Excel is Dropbox.

------
raymondh
> People don’t use that crap. > But they do want pivot tables, ...

That's what the cookbook recipe provides, a function called summary() that
makes a pivot table. Problem solved :-)

> I should be clear that my complaint is with > Python rather than the code as
> such.

There are plenty of ways to write the summary() function with plain, straight-
forward Python code that doesn't use generators, itertools, or any other
advanced feature.

So, why does the recipe author use itertools? It is because they provide a way
to get C speed without having to write extension modules. Had the author used
map() instead of a generator expression, the inner-loop would run entirely at
C speed (with no trips around the Python eval-loop):

    
    
      for pivot_value, row in groupby(data, key):
          yield k, sum(map(value, group))
    

I think it's wonderful that a two-line helper function is all it takes to
implement pivot tables efficiently.

------
freyrs3
> What about tuples, lambda functions, generator objects?” No, shut up. People
> don’t use that crap.

I use those every day.

------
patio11
Despite "No, shut up." being in the title, it does not add anything to the
submission, and should be a _strong hint_ that this is not HN material.

------
URSpider94
There are quite a few really nice tools out there for this kind of data
analysis (generally called Business Intelligence, or BI for short). None of
them that I have found are procedural, they are all based on interactive
dashboards. I'm sure that there is some scripting or XML formatting required
behind the scenes to get the system set up to accept data, but after that it's
all point-and-click.

The systems that I've seen/evaluated are Needlebase, Birst and Spotfire. None
of them are particularly cheap, but if you're in a business where real-time
access to data would help your team make better decisions, they could be very
valuable.

------
vinyl
For the record, the Lisp dialect that is mentioned in the comments is
<http://lush.sourceforge.net/> with respect to the discussion, it is on the
R/Python side, ie. powerful general-puspose (Lisp) language, with builtin
Statistics facilities.

------
btcoal
Anytime I hear a discussion about designing computational tools for "non-
programmers" I'm reminded of the subway in Mexico City, Mexico. The subway
stops have nicely detailed pictures that are descriptive of the locations
around the stops. This is because many people are illiterate. It's about time
people realized that programming is literacy.

Also, leave Python alone!

------
momotomo
Kirix Strata looks like the dropbox equivalent for this space
(<http://www.kirix.com/>). All it does is suck in data -> Create relationships
-> pivot, graph and report. The people I know that use it swear by it because
it fills one small gap instead of many.

------
sapuser
R syntax is not too bad

    
    
      # group by Species, can be multivalued see ?by
      # sum(Sepal.Length, Sepal.Width)
      # mean(Petal.Length, Petal.Width)
      by(data = iris, INDICES = iris$Species, FUN = function(x) {y <- colSums(x[,c(1:2)]); z <- mean(x[,c(3:4)]); result <- list(y,z); result})

------
huyegn
I've been using <http://tablib.org/> for awhile now to read in tabular data.
With this summary fn and a few other functions to simplify the process of
aggregating data into useful views I think you've got a winning solution to
the author's complaint.

------
molecule
funny, Dropbox makes it happen w/ Python:

"Dropbox uses Python on the client-side and server side as well. This talk
will give an overview of the first two years of Dropbox, the team formation,
our early guiding principles and philosophies, what worked for us and what we
learned while building the company and engineering infrastructure. It will
also cover why Python was essential to the success of the project and the
rough edges we had to overcome to make it our long term programming
environment and runtime."

[http://us.pycon.org/2011/blog/2011/02/07/pycon-2011-announci...](http://us.pycon.org/2011/blog/2011/02/07/pycon-2011-announcing-
startup-stories/)

------
Zeelar
Theres also Tableau (<http://www.tableausoftware.com/>) for people interested
in just pivoting data and charting. Its kind of expensive and PC only but
serves that function well.

As a Statistician, I used to use SAS, Stata, R, Excel and of course SQL to
extract data but for the purposes of pretty, pretty charts, Tableau is king.

------
badwetter
Basically you're saying do the KISS thing ...

------
Pooter
No, shut up. Python isn't a statistical programming language.

------
micah63
"No, shut up. People don’t use that crap." hahaha, ahhh, I love it...

