
Tea: High-Level Language and Runtime System for Automating Statistical Analysis - furcyd
https://arxiv.org/abs/1904.05387
======
azhenley
From the abstract: "In Tea, users express their study design, any parametric
assumptions, and their hypotheses. Tea compiles these high-level
specifications into a constraint satisfaction problem that determines the set
of valid statistical tests, and then executes them to test the hypothesis."

This is awesome. I currently use R to run simple stats, but I make a lot of
mistakes and have to spend a lot of time verifying that I'm using appropriate
tests.

~~~
fxfan
Great point. I wish R had an interface to a well-typed language.

I'm surprised Microsoft hasn't done anything in this regard, despite the fact
that they own R.

~~~
kyllo
There's an interesting interface from F# to R, which is not maintained by
Microsoft
[http://bluemountaincapital.github.io/FSharpRProvider/](http://bluemountaincapital.github.io/FSharpRProvider/)

The most important errors in R are not at all related to data type safety
(though NA/NULL handling is a common issue), they're about the statistical
soundness of the analysis you're conducting. Experimental design, implicit
assumptions and invariants, confounding factors, and the appropriateness of
the statistical tests you're applying for the distribution of the data you're
applying them to. It sounds like tea-lang helps with the latter part.

Microsoft definitely doesn't "own" R, just one distribution of R (Microsoft R
Open) and a package repository (MRAN).

(I'm a Microsoft employee and heavy user of R at work)

~~~
emjun
This is so interesting/cool to learn about. Thanks for sharing :)

You're spot on - Tea is focused on ensuring statistical soundness. Another way
to think about this: Tea helps people analyzing data from studies to "maximize
internal validity."

I hope to support integration with existing tools/common workflows in the
future. Microsoft R Open definitely looks like an interesting target.

It'd be great to follow-up. I visit Microsoft in Redmond periodically. If
you're interested, email me! (<my username> at cs.washington.edu)

~~~
kyllo
Cool, I'll shoot you an e-mail. I'm very interested in the Tea project and
potentially contributing to it too. It's the sort of thing that could make
every working data scientist noticeably more effective at their jobs.

I was thinking about whether it'd be possible to write something like Tea for
R, but then I found there is no R package for Z3 nor any other SMT solver! So
that would be an extra challenge...

------
bhattisatish
It's an opensource python package that can be used in a jupyter notebook.
Source is available at [https://github.com/emjun/tea-
lang](https://github.com/emjun/tea-lang)

------
chriswarbo
This looks really useful! Is there a typo in example code, where the 'So'
variable is assumed to be normally distributed (I think it's meant to be the
'Prob' variable)?

I tended to avoid statistics in the past, knowing just enough to realise how
easy it is to draw wrong or misleading conclusions. I recently had to perform
some data analysis and got a bit paralysed at the thought of choosing the
wrong approach; something like Tea would have been great, if only to give
external validation/justification to the decisions I made.

~~~
emjun
Yes, nice catch! Thanks :) In case you're interested, we want to explore
opportunities to expand the "grammar" of assumptions people might want to
express.

And, that's our hope. We'd love to increase people's awareness and confidence
in their analyses.

~~~
bitminer
There are a few errors in the paper. "HCI" is not defined. Figure 2 refers to
independent and dependant variables for an observational study but the text
under "Study Design" says it should be "contributor" or "outcome".

The software has zero releases. This doesn't allow comments w.r.t. versions
and possible bug fixes.

I like the concept. Adding a graphic representing the model would help make it
interpretable by non-numerate stakeholders, such as management. Or politicians
(!)

It would be helpful to be able to make some statement on future support....is
this going to be around for a few years?

~~~
emjun
Thanks so much for pointing these out :) Yes, as we speak, I am working on
updating the README, supporting docs, and a better website for the
language/project. I will let you know once I have these up and running! It
would be great to get your feedback.

My plan is to continue building on Tea for _at least_ a few more years and
provide support for new use cases, statistical methods, etc. Are there
topics/concerns you'd like to see addressed? :)

~~~
kyllo
I read and shared your paper and GitHub link with the rest of my data science
team here at Microsoft Azure--we do a lot of hypothesis testing of this nature
in our day-to-day work (using R or Python) and a tool like Tea would go a long
way in helping us apply stats tests more efficiently and thoroughly. Good to
know that you're planning to continue development, we'll be following with
interest!

Some other areas to potentially explore could be support for power analysis,
sequential analysis and stopping rules, for use in the pre-trial phase of
controlled experiments such as A/B tests.

I'd also suggest adding a LICENSE file to your project to clarify how it's
allowed to be used.

------
exp1orer
This is extremely interesting. This paper does a few different things that are
worth separating:

1\. Structures knowledge about common statistical tests, the assumptions they
need to satisfy, and the kinds of problems for which they are appropriate.

2\. Models this as a constraint satisfaction problem.

3\. Creates a DSL to write specifications for statistical testing.

4\. Develops an output format for statistical testing based on (3).

5\. Provides a python package that implements the above.

IMO (1) is by far the most fundamental of these, but the paper spends most of
its time describing (3). I'll admit to not fully understanding (2), what the
alternatives were, and why it's an appropriate choice - I wish there were more
on that in the paper as well.

If the approach to (1) were described in more depth and published in a usable
way outside of the implementation in (5), I could see it being broadly useful
and leading to an ecosystem of different (competing) implementations. As it
stands, the work on (1) is spread across a few functions in solver.py.

Looking more broadly, statistics-as-an-HCI-problem is fascinating, and I think
this is a promising start. I'd also love to see more attention paid to (4),
the output format. Most "doing harm with statistics", I believe, comes from
misunderstanding/misapplying results, so looking at the full workflow / user
story is critical. The focus of this paper is instead making the input easier,
which is also obviously important.

emjun, so glad to see you in the comments here! What's the best way for
interested folks to contribute to this project?

~~~
emjun
Hi! I, too, think statistics-as-an-HCI problem is a cool framing that allows
for new methods/solutions :)

Your observations and distinctions about what Tea does are great.

I am currently trying to fix some bugs/make improvements from the deadline
push. After that, I plan to restructure the internals to make the
logic/reasoning/implementation easier to follow and extend.

I would love contributors! It would be great if you could watch, open an
issue, or open a pull request on Github ([https://github.com/emjun/tea-
lang](https://github.com/emjun/tea-lang)).

If there are enough interested collaborators, it might be worth opening up a
gitter or slack group? In the meantime, if you'd like to chat more, please
feel free to email me! (<my username> at cs.washington.edu)

