Hacker News new | past | comments | ask | show | jobs | submit login
How to share data with a statistician (github.com/jtleek)
70 points by shrikant on Nov 25, 2013 | hide | past | favorite | 40 comments

If you liked my guide to sharing data, you may also like my guides:

For writing R packages: https://github.com/jtleek/rpackages

For writing scientific reviews: https://github.com/jtleek/reviews

Or on the future of stats: https://github.com/jtleek/futureofstats

Before all that, you need to choose your statistician. My advice is: resuscitate E.T. Jaynes. Failing that, find one of his disciples. Failing that, find a Bayesian. Failing that, read Probability Theory: the Logic of Science, and just do the analysis yourself. Failing that, maybe a Frequentist statistician will do. Maybe.

And your Bayesian statistician will acquire their priors how exactly?

Bayesian statistics can be very powerful, but it would be a terrible idea to prefer Bayesian approaches to Frequentist ones in all situations.

> And your Bayesian statistician will acquire their priors how exactly?

As If Frequentists somehow didn't need priors. Everyone starts with prior knowledge. We might as well use it. Or do you advocate not using every scrap of knowledge available to you? That would be stupid.

Sure, prior knowledge can be shaky, or difficult to justify. But at least, a Bayesian will be explicit about it, instead of, like, sweeping normal probability distribution assumptions under the linear regression rug.

> it would be a terrible idea to prefer Bayesian approaches to Frequentist ones in all situations.

Name three examples that doesn't involve the Frequentist using better prior information than the Bayesian.

By the way, Bayesians know that using probability theory correctly is sometimes intractable (combinatorial explosion and all that). In those cases, they will use approximations. But at least, they will know it's an approximation.


You really should read chapters 1 and 2 of Probability Theory: the Logic of Science. They give a good feel of why Bayesians are correct as a simple matter of fact.

Here's two examples:

I'm testing the effectiveness of a drug. Drugs of this class have a certain likelihood of working, the noise in my data is known, the experimental group did this much better than the control... does the drug really work? So far so trivial, in either Bayesianism or Frequentism. Now, I happen to mention that I tested 10000 variants of this drug and only sent data for the one that seemed to work. The rest aren't interesting after all. Under Frequentism, it's easy to take this into account. Under Bayesianism, it requires complex definitions of observations, and is easy to overlook as there's no space for it in the formula.

I have a collection of unfair dice. Unfortunately, they all look the same and got dumped on the floor. Now someone grabbed one off the floor at random and wants to make bets with me about it. Even experienced Bayesians are likely to mix up their propositions in a case like this. I say that from having read discussions of similar problems. Yes, if you do it right, it comes out correctly, but Frequentism makes sure you've thought about what you're asking in the same way Bayesianism makes sure you've thought about your priors.

Somebody else will have to give a third example.

Bayesianism and Frequentism are based on the same math, and math is math. If you use them correctly, they'll get you the same answer every time. The difference is what they make easy, and what mistakes they protect you against.

Third example: A researcher has millions of datasets to analyze, with each dataset containing enough data points such that frequentist asymptotics are satisfied. You are tasked with finding summary statistics for all datasets. The maximum likelihood and maximum a posterior (MAP) estimators are equal within some tolerance for a subset of these datapoints. However, the marginal likelihood function is computationally intractable, so the Bayesian must use expensive methods to produce MAP estimates, e.g. using Markov chain Monte Carlo (MCMC). For complex posterior distributions, MCMC requires careful programming and verification procedures, which can be prohibitive in practice.

There are very many real-world problems that have fast and accurate frequentist solutions, but slow and difficult Bayesian solutions. Despite my personal bias -- my research primarily relies on Bayesian inference -- I can't fathom how one can reasonably argue that frequentist approaches are always inferior, even in applied statistics.

> I can't fathom how one can reasonably argue that frequentist approaches are always inferior, even in applied statistics.

My original claim is broader than I wanted it to be. The fact is, a Frequentist approach will always be less accurate than the correct application of probability theory. But of course,

> Bayesians know that using probability theory correctly is sometimes intractable (combinatorial explosion and all that). In those cases, they will use approximations. But at least, they will know it's an approximation.


The key to the Bayesian outlook is to remember that no matter what, there is a correct answer, even if you can't afford to compute it. As Eliezer Yudkowsky put it, there are laws of thought. Want to use Frequentist tools? Sure, why not. Just remember that they often violate the laws of ideal though. Some inaccuracy inevitably ensues.

First example: of course it's complex (though I have no idea what you mean by "definitions of observations"). The correct answer needs my probability distribution over the algorithm your brain used to select which drug to send to me (my brain already hurts). Also, could you describe the Frequentist method in more detail? I'm sure it must overlook something.

Second example: Okay, Bayesian statistics are harder. That's a disadvantage.


> If you use [Bayesianism or Frequentism] correctly, they'll get you the same answer every time


If they gave invariably the same answer, then, why the endless debates? By the way, here is an apparent factual disagreement bettwen Bayesianism and Frequentism:


As the author of the guide in question, I feel I should speak up here. Neither frequentist or Bayesian statistics is "just right". It is 100% dependent on the user and whether they use/interpret the quantities correctly. There are both really good and horrible Frequentist and Bayesian statisticians. To imply one or the other is "better" is incorrect and disingenuous. Just my 2 cents.

Oops. It seems I got carried away.

> There are both really good and horrible Frequentist and Bayesian statisticians.

Yeah. If I had to choose between Fisher and Anonymous Bayesian, I may chose Fisher.

However, unless both kind of statistics yield the same results (I don't think they do), then at least one of them is bogus, by application of the non-contradiction principle. So, while I can imagine there are good Frequentists Statisticians out there, I insist that frequentism itself is bogus.

Those chapters show that any consistent (in some sense) statistical reasoning system must result in the same answers as a Bayesian system.

They don't show that the frequentist approach is wrong. If both methods result in the same answer and frequentist methods are easier to use then what is the problem?

No problem of course. But that's a big if. And how are you going to evaluate whether both methods do in fact deliver the same answer?

Also, the Frequentist approach is not wrong. It's inaccurate, to the extent its results differ from the Bayesian ones. This inaccuracy tend to go down as we gather more data. Which is a good thing, or else science itself wouldn't work.

One more thing. You said "in some sense". Are you seriously suggesting that the assumptions behind Cox's Theorem can reasonably be challenged?

I'm ashamed to say that I said "in some sense" to be deliberately vague about what I meant by consistency - I didn't want to overstate the strength of Cox's theorem or end up arguing about Godel.

Actually, I was accusing you of understating the strength of Cox's theorem.

> Are you seriously suggesting that the assumptions behind Cox's Theorem can reasonably be challenged?

Sure: probability is continuous.

You mean, "it is not totally obvious that probability is continuous".

Well… To me, it is obvious.

You may be very interested to read about Bayes factors, which allow interpretable hypothesis tests without prespecified priors:


(though one does have to specify the hypotheses to be tested more exactly than, say, "the effect is not zero.")

On the other hand, if you let yourself update your prior with the actual data, you'll see that frequentist statistics has had the better part of a century worth of successes.

The same could be said of Newtonian mechanics. It did not stop it from being completely dead by the 1930's.

Both Newtonian mechanics and Frequentists statistics are approximations. Which makes them both useful, and inaccurate (how inaccurate depends on the situation at hand).

Just out of curiosity, what kinds of statistical problems do you tend to solve, or are you just a fanboy?

I saw you posted a link to something on lesswrong, which is confusing an application of Bayes Rule with Bayesian statistics. Maybe you should read what an actual statistician has to say on the matter: http://normaldeviate.wordpress.com/2012/11/17/what-is-bayesi... instead of a Harry Potter fanfic author.

Fanboy. I have read half of E.T. Jaynes' Probability Theory: the Logic of Science so far.

I have read the link you speak of. I have already responded here: http://normaldeviate.wordpress.com/2012/11/17/what-is-bayesi...

I don't think my LessWrong link confuses Bayes' Rule with Bayesian statistics. Why do you think Eliezer responded 1/2 to the brain teaser? He does not say it, but I'm pretty sure he just assumed that a mathematician who has 2 boys is twice as likely to spontaneously say "I have at least one boy", compared to a mathematician who has only one boy.

The disagreement between his inference (which he did not know was "Bayesian" at the time) and "Orthodox" statistics didn't come from Bayes' Rule. It came from the use of a non-uniform prior to begin with. Which Frequentism rejects, because it's "subjective". So, instead of using this highly relevant prior information, it just uses a uniform prior. (By the way, this is nuts. A scientist should never throw away relevant information.)

Both methods then use Bayes' Rule. They just start with different priors: best guess and objective-looking, respectively. (I'm still wondering why anyone would use antything but one's best guess.)

So you're basically just Monday morning quarterbacking here. It's kind of telling you didn't understand Wasserman's reply to you, it was perfectly clear to any practitioner.

Here's an example of a real-world problem I had to solve in my last job. We wanted to compare two treatments on an object, OLD and NEW, and evaluate which was better. We got a bunch of volunteers and showed them each a random sample of objects, the OLD treatment, and the NEW treatment (suitably blinded of course) and asked them to rate each on a scale of 1-5. The goal was to determine if the OLD treatment should be replaced by the NEW treatment.

What's your Bayesian approach to solving this? I was the closest thing to a domain expert, and I certainly didn't have any prior beliefs about what the ratings would be (except for a weak expectation that OLD>NEW).

Tell me the tests you would have run, and then I'll tell you the non-Bayesian thing I did that gave a very solid answer and took well under a second of computation.

> It's kind of telling you didn't understand Wasserman's reply to you, it was perfectly clear to any practitioner.

Condescension doesn't impress me. Explanations do. Could you please explain his "perfectly clear" reply to me?

> What's your Bayesian approach to solving this?

Applying probability theory as best I could. If it turns out to be too complicated, or computationally intractable, I'll resort to approximations.

For this particular case… Well, first, we don't have 2 treatments, we have as many treatments as we have objects. It's like flipping a coin. When you flip a coin in the obvious way, you have no idea which side will eventually come up, because even if you know it came up heads the previous time, and you tried to perform the same move again: in fact, you don't perform the same move, and you don't know how exactly your second move differed from the first.

So first, there is an unpredictable variability in the way the treatment will be performed. This is the first source of uncertainty.

Second, the objects aren't all the same. I expect them to be more or less clean to begin with, and other characteristics such as sharp angles may more or less hinder the treatment. This is the second source of uncertainty.

Once you perform a treatment on an object, it had a definite effect. Unfortunately, it is hard to define how well it actually went. Maybe you can come up with a definite criterion, but apparently, since you needed to ask a sample of human volunteers, that criterion is either not well understood, or hard to measure properly. Anyway, you have access to an uncertain measure, which in this case is a human giving you a scale from 1 to 5. This is the third and fourth sources of uncertainty (akin to the first and second one respectively: variability in how a given human will assess an object; and variability across humans).

From what I have understood of what you told me, there were 2 sets of objects, and 2 groups of people. One group looked at every object before any treatment. Then you applied OLD to half objects, and NEW to half the other. You then showed the newly cleaned up objects to the second group. Of course, you don't tell the group which objects they get to examine. An alternative would be to photograph every objects before and after the treatment, then show both sets of photos to everyone, and ask them to rank the objects (both before and after), and the perceived efficiency of the treatment. I don't like it much however, because photos add an additional source of uncertainty.

This looks like a complicated version of a compound estimation problem, described in chapter 6 of Jayne's PT:TLOS (elementary parameter estimation). Pretty basic stuff.

We're not finished yet. You wanted to know which treatment should be chosen: OLD or NEW. You basically have 3 alternatives: always apply OLD, always apply NEW, or look at the object then decide which treatment you apply. For each alternative, you should have a probability distribution over the "cleanness" distribution of the objects. Decide a utility function for the cleanness distribution, and chose the method that maximizes expected utility.

And now I'm stuck because I don't know probability theory deeply enough to actually give you a non-ambiguous procedure in less than a couple days. Not to mention that I don't have your prior information about OLD and NEW (why you expect OLD to be better etc.) Anyway, I believe a specialist would only need a couple minutes, or a couple hours tops. (To find the procedure, that is. Actually running the numbers in a computer may or may not be expensive. I personally have no idea.)

Maybe I was unclear. To be explicit, the experimental setup was: * Draw a sample of objects (in this case, an 'object' was a set of text documents) * Apply the OLD and NEW treatments to each sampled object * Show the results of OLD and NEW to volunteers (double blind, of course), get an evaluation.

And so the data I get is: (object 1, SCORE OLD 1, SCORE NEW 1) (object 2, SCORE OLD 2, SCORE NEW 2) etc

So again: * I don't have any prior knowledge of how the scores are going to be distributed * The primary reason I expected OLD to be 'better' than NEW is because OLD had theory behind it, and NEW was an ad hoc tweak of a previously rejected treatment * The very initial text corpus was fixed * The objects were selected from the initial corpus through a multistep process, at least one of those steps was a stochastic optimization of an objective function * One of the treatments was also obtained through a constrained optimization * The initial text corpus wasn't all that large, and the samples were pretty small

So what I did was look at all observations for which OLD SCORE was different than NEW SCORE. Say there were n of them, and say that for p of those OLD<NEW. If NEW wasn't producing results that scored better than OLD, we would expect p/n to be less than 0.5, up to some uncertainty from the random sample. This is just a straight up test of a binomial parameter, so I computed confidence intervals and found that 0.5 was several standard deviations below the observed p/n, from which I concluded that NEW was better than OLD and we should switch. (as described here: http://en.wikipedia.org/wiki/Sign_test)

In a more responsible analysis I could have done something more sophisticated and modeled the scores as a sum of object effect, tester effect, treatment effect, fixed term, noise, etc. Given the magnitude of the sign test score, I didn't feel that was necessary.

So, there was no "first source of uncertainty", and you had the advantage of real-world observation and photos put together. Neat.

Indeed, given the apparent massive evidence in favour of NEW, you probably didn't need to twist your brain further.

Now, if we're doing A/B testing on a commercial web site, I think we should use every scrap of information available to us, and go into full Bayesian mode if at all possible: if of A and B, one is worse, you don't want to make one too many test before you figure this out. The sooner you positively know which is best, the better.

I've only read the first two chapters of Jaynes' book, and bits and pieces of various other books. (One of the late chapters in http://uncertainty.stat.cmu.edu/ is a great reference for outlining the problems with significance and hypothesis testing.) Anyway, to me the problem described above sounds like it can generally be phrased as the common problem of ranking user-rated products on an e-commerce page. (A good Bayesian write-up is here, with simple code at the end: http://masanjin.net/blog/how-to-rank-products-based-on-user-...) At least if one assumes that treatments are applied more-or-less the same, and sets of objects amongst a treatment are more-or-less the same. There's also the question of how rankings change when both treatments are presented to a volunteer instead of just one. I interpreted the problem as just showing the OLD sets or just showing the NEW sets, never showing the original sets pre-treatment.

How does sanskritabelt solve the problem, I wonder? Set an arbitrary number of volunteers for each group to say "this is the point we can stop collecting data", assume normalcy of ratings, take the means and use some test to see if the difference is significant or not?

Really the problem of the problem is defining what is meant by 'better'. Is 'better' a higher score? (Which scoring function do you use -- mean, median, most 5-stars, least 1-stars, something else that takes into account the number of ratings per rating bucket?) Does it have to be higher by a certain amount (how meaningful are deltas in your scoring function)? Once it's defined what is meant by "A is better than B", then you can go about the business of computing the likelihood of your data given "A is better than B" to fulfill the RHS of Bayes' theorem.

In my career, and maybe this is just the kinds of problems I've been presented with, I've found that assuming normality is, by and large, like drinking tequila. It's fun at the time, but I end up regretting it in the morning.

EDIT: its this kind of experience that make me reach for things like nonparametric tests and the bootstrap.

How does everyone feel about a JSON dicts separated by newlines format? (Like the JSON format that mongoimport can accept.)

Each sample of the type of data that I'm often dealing with tends to be nested in nature. Yes, I do have a script that can flatten out the nested dicts into a regular table, but that always results in a blowup into hundreds of columns.

Nice suggestion to share the raw data. I've never seen a researcher do that, I think many don't even save the raw data to disk before extracting what they want, but I always try to.

It's rare (unfortunately), but Dr. Jeffrey Leek isn't the 1st proponent of sharing raw data, or otherwise a believer in reproducible results. Dr. Eamonn Keogh has made very important contributions in data mining and is also a huge advocate of reproducibility. See for example 'Why the lack of reproducibility is crippling research in data mining and what you can do about it' dated 2007 @ http://dl.acm.org/citation.cfm?id=1341922

I've been using this format recently and appreciate the additional fidelity from simple rows. The flexibility makes it good for the raw stage, but I usually have to extract tidy subsets for real analytic work. I have written many little scripts to pull arbitrarily deep keys out of these structures and produce tidy tables for further analysis.

actually, this format is also nice because iterating over the lines of a file is very similar to running through a mongo cursor. that makes it easy to reprise choose to work with both inputs.

I'd like to follow this up with a plea to statisticians to make sure you're not sending data back to programmers with bizarre newline formats - certain versions of Mac Excel save with /r newlines, which haven't been used in over a decade and pretty much break everything. If in doubt, it's probably best to save in Windows newline mode from MS software, at least most utilities are used to dealing with that.

I once debugged a case like that. A NumPy script expected to read in a table from a file, but the file used /r newlines, so the NumPy script read it in as a very long single line. Then later references to table lines >1 broke the script.

The strange part is, the file and the script were provided by the same person.

And a plea to programmers, handle \n, \r\n and \r. And utf8, utf16, and all the other common text encodings. It's sad that in 2013 it'sstill hard to read in a text file, so the burden is shifted to the user.

The Leek group is doing some fantastic things, pushing for better transparency and coding integrity from statisticians. I met with Jeff a little while back and was just really enthused by his desire to change the way traditional statistics/biostatistics is viewed and interacted with.

Bit of a side-question but I have been looking for such a thing without really knowing what to look for: is there some kind of small and easy JS and/or PHP programm allowing to do some easy work on a tidy database? Pretty much like you would have a full excel table in a browser where you can add/modify/remove rows and columns? No need for all the excel formulae and such...

Thanks very much for any help.

Most databases have a companion GUI tool that will allow you to "edit" the database visually; what DB are you using?

pgAdmin (Postgres) - http://www.pgadmin.org/

MySQL Workbench (MySQL) - http://dev.mysql.com/downloads/tools/workbench/

SQL Developer (Oracle) - http://www.oracle.com/technetwork/developer-tools/sql-develo...

I strongly recommend learning some SQL though; this will give you the ability to bulk edit columns and apply formulae.

I was thinking TSV in plain text format, not SQL. Hence the option of a pure JS solution?

A TSV is typically referred to as a "flat file". You're going to get a lot of confused looks if you talk about a "TSV database".

That said, you can:

- create an external table in an RDBMS (any of the above will work), which allows you to work on a flat file in place: http://www.fromdual.com/csv-storage-engine

- import the TSV to an RDBMS, work with it, and export it again.

I'm sure someone has written what you're asking for, but I don't get the appeal. RDBMS aren't scary, you can install MySQL and work with it for free, and if you use the GUI tool you don't even have to touch SQL.

edit: Excel is a tool that can edit TSV files, and as a bonus it looks and works exactly like Excel. What exactly do you want?

I see. Well thanks for your answer. Maybe I know better what are the options now and what I'm looking for at all :)

As I say, the idea would be to edit a database/file directly on the web (i.e. no local excel file). Since the original post talked about TSV I thought there might be a niche for such a light JS editor.

or just use R tools yourself

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact