

How to share data with a statistician - shrikant
https://github.com/jtleek/datasharing

======
jtleek
If you liked my guide to sharing data, you may also like my guides:

For writing R packages:
[https://github.com/jtleek/rpackages](https://github.com/jtleek/rpackages)

For writing scientific reviews:
[https://github.com/jtleek/reviews](https://github.com/jtleek/reviews)

Or on the future of stats:
[https://github.com/jtleek/futureofstats](https://github.com/jtleek/futureofstats)

------
loup-vaillant
Before all that, you need to choose your statistician. My advice is:
resuscitate E.T. Jaynes. Failing that, find one of his disciples. Failing
that, find a Bayesian. Failing that, read _Probability Theory: the Logic of
Science_ , and just do the analysis yourself. Failing _that_ , maybe a
Frequentist statistician will do. Maybe.

~~~
pja
And your Bayesian statistician will acquire their priors how exactly?

Bayesian statistics can be very powerful, but it would be a terrible idea to
prefer Bayesian approaches to Frequentist ones in all situations.

~~~
loup-vaillant
> _And your Bayesian statistician will acquire their priors how exactly?_

As If Frequentists somehow didn't need priors. _Everyone_ starts with prior
knowledge. We might as well use it. Or do you advocate _not_ using every scrap
of knowledge available to you? That would be stupid.

Sure, prior knowledge can be shaky, or difficult to justify. But at least, a
Bayesian will be explicit about it, instead of, like, sweeping normal
probability distribution assumptions under the linear regression rug.

> _it would be a terrible idea to prefer Bayesian approaches to Frequentist
> ones in all situations._

Name three examples that doesn't involve the Frequentist using better prior
information than the Bayesian.

By the way, Bayesians know that using probability theory correctly is
sometimes intractable (combinatorial explosion and all that). In those cases,
they will use approximations. But at least, they will _know_ it's an
approximation.

\---

You really should read chapters 1 and 2 of _Probability Theory: the Logic of
Science_. They give a good feel of why Bayesians are correct as a simple
matter of fact.

~~~
dspeyer
Here's two examples:

I'm testing the effectiveness of a drug. Drugs of this class have a certain
likelihood of working, the noise in my data is known, the experimental group
did this much better than the control... does the drug really work? So far so
trivial, in either Bayesianism or Frequentism. Now, I happen to mention that I
tested 10000 variants of this drug and only sent data for the one that seemed
to work. The rest aren't interesting after all. Under Frequentism, it's easy
to take this into account. Under Bayesianism, it requires complex definitions
of observations, and is easy to overlook as there's no space for it in the
formula.

I have a collection of unfair dice. Unfortunately, they all look the same and
got dumped on the floor. Now someone grabbed one off the floor at random and
wants to make bets with me about it. Even experienced Bayesians are likely to
mix up their propositions in a case like this. I say that from having read
discussions of similar problems. Yes, if you do it right, it comes out
correctly, but Frequentism makes sure you've thought about what you're asking
in the same way Bayesianism makes sure you've thought about your priors.

Somebody else will have to give a third example.

Bayesianism and Frequentism are based on the same math, and math is math. If
you use them correctly, they'll get you the same answer every time. The
difference is what they make easy, and what mistakes they protect you against.

~~~
landismj
Third example: A researcher has millions of datasets to analyze, with each
dataset containing enough data points such that frequentist asymptotics are
satisfied. You are tasked with finding summary statistics for all datasets.
The maximum likelihood and maximum a posterior (MAP) estimators are equal
within some tolerance for a subset of these datapoints. However, the marginal
likelihood function is computationally intractable, so the Bayesian must use
expensive methods to produce MAP estimates, e.g. using Markov chain Monte
Carlo (MCMC). For complex posterior distributions, MCMC requires careful
programming and verification procedures, which can be prohibitive in practice.

There are very many real-world problems that have fast and accurate
frequentist solutions, but slow and difficult Bayesian solutions. Despite my
personal bias -- my research primarily relies on Bayesian inference -- I can't
fathom how one can reasonably argue that frequentist approaches are always
inferior, even in applied statistics.

~~~
loup-vaillant
> _I can 't fathom how one can reasonably argue that frequentist approaches
> are always inferior, even in applied statistics._

My original claim is broader than I wanted it to be. The fact is, a
Frequentist approach will always be less _accurate_ than the correct
application of probability theory. But of course,

> _Bayesians know that using probability theory correctly is sometimes
> intractable (combinatorial explosion and all that). In those cases, they
> will use approximations. But at least, they will_ know _it 's an
> approximation._

[https://news.ycombinator.com/item?id=6793905](https://news.ycombinator.com/item?id=6793905)

The key to the Bayesian outlook is to remember that no matter what, there _is_
a correct answer, even if you can't afford to compute it. As Eliezer Yudkowsky
put it, there are _laws of thought_. Want to use Frequentist tools? Sure, why
not. Just remember that they often violate the laws of ideal though. Some
inaccuracy inevitably ensues.

------
compare
How does everyone feel about a JSON dicts separated by newlines format? (Like
the JSON format that mongoimport can accept.)

Each sample of the type of data that I'm often dealing with tends to be nested
in nature. Yes, I do have a script that can flatten out the nested dicts into
a regular table, but that always results in a blowup into hundreds of columns.

Nice suggestion to share the raw data. I've never seen a researcher do that, I
think many don't even save the raw data to disk before extracting what they
want, but I always try to.

~~~
edraferi
I've been using this format recently and appreciate the additional fidelity
from simple rows. The flexibility makes it good for the raw stage, but I
usually have to extract tidy subsets for real analytic work. I have written
many little scripts to pull arbitrarily deep keys out of these structures and
produce tidy tables for further analysis.

actually, this format is also nice because iterating over the lines of a file
is very similar to running through a mongo cursor. that makes it easy to
reprise choose to work with both inputs.

------
bermanoid
I'd like to follow this up with a plea to statisticians to make sure you're
not sending data back to programmers with bizarre newline formats - certain
versions of Mac Excel save with /r newlines, which haven't been used in over a
decade and pretty much break everything. If in doubt, it's probably best to
save in Windows newline mode from MS software, at least most utilities are
used to dealing with that.

~~~
sampo
I once debugged a case like that. A NumPy script expected to read in a table
from a file, but the file used /r newlines, so the NumPy script read it in as
a very long single line. Then later references to table lines >1 broke the
script.

The strange part is, the file and the script were provided by the same person.

------
tel
The Leek group is doing some fantastic things, pushing for better transparency
and coding integrity from statisticians. I met with Jeff a little while back
and was just really enthused by his desire to change the way traditional
statistics/biostatistics is viewed and interacted with.

------
croisillon
Bit of a side-question but I have been looking for such a thing without really
knowing what to look for: is there some kind of small and easy JS and/or PHP
programm allowing to do some easy work on a tidy database? Pretty much like
you would have a full excel table in a browser where you can add/modify/remove
rows and columns? No need for all the excel formulae and such...

Thanks very much for any help.

~~~
alanctgardner2
Most databases have a companion GUI tool that will allow you to "edit" the
database visually; what DB are you using?

pgAdmin (Postgres) - [http://www.pgadmin.org/](http://www.pgadmin.org/)

MySQL Workbench (MySQL) -
[http://dev.mysql.com/downloads/tools/workbench/](http://dev.mysql.com/downloads/tools/workbench/)

SQL Developer (Oracle) - [http://www.oracle.com/technetwork/developer-
tools/sql-develo...](http://www.oracle.com/technetwork/developer-tools/sql-
developer/overview/index.html)

I strongly recommend learning some SQL though; this will give you the ability
to bulk edit columns and apply formulae.

~~~
croisillon
I was thinking TSV in plain text format, not SQL. Hence the option of a pure
JS solution?

~~~
alanctgardner2
A TSV is typically referred to as a "flat file". You're going to get a lot of
confused looks if you talk about a "TSV database".

That said, you can:

\- create an external table in an RDBMS (any of the above will work), which
allows you to work on a flat file in place: [http://www.fromdual.com/csv-
storage-engine](http://www.fromdual.com/csv-storage-engine)

\- import the TSV to an RDBMS, work with it, and export it again.

I'm sure someone has written what you're asking for, but I don't get the
appeal. RDBMS aren't scary, you can install MySQL and work with it for free,
and if you use the GUI tool you don't even have to touch SQL.

edit: Excel is a tool that can edit TSV files, and as a bonus it looks and
works exactly like Excel. What exactly do you want?

~~~
croisillon
I see. Well thanks for your answer. Maybe I know better what are the options
now and what I'm looking for at all :)

As I say, the idea would be to edit a database/file directly on the web (i.e.
no local excel file). Since the original post talked about TSV I thought there
might be a niche for such a light JS editor.

------
imahboob
or just use R tools yourself

