Hacker News new | past | comments | ask | show | jobs | submit login
Tea: High-Level Language and Runtime System for Automating Statistical Analysis (arxiv.org)
88 points by furcyd 39 days ago | hide | past | web | favorite | 14 comments

From the abstract: "In Tea, users express their study design, any parametric assumptions, and their hypotheses. Tea compiles these high-level specifications into a constraint satisfaction problem that determines the set of valid statistical tests, and then executes them to test the hypothesis."

This is awesome. I currently use R to run simple stats, but I make a lot of mistakes and have to spend a lot of time verifying that I'm using appropriate tests.

Yep. I love R, but serious statistical testing could really benefit from a strong framework that helps prevent mistakes. R's highly dynamic nature goes against that.

Great point. I wish R had an interface to a well-typed language.

I'm surprised Microsoft hasn't done anything in this regard, despite the fact that they own R.

There's an interesting interface from F# to R, which is not maintained by Microsoft http://bluemountaincapital.github.io/FSharpRProvider/

The most important errors in R are not at all related to data type safety (though NA/NULL handling is a common issue), they're about the statistical soundness of the analysis you're conducting. Experimental design, implicit assumptions and invariants, confounding factors, and the appropriateness of the statistical tests you're applying for the distribution of the data you're applying them to. It sounds like tea-lang helps with the latter part.

Microsoft definitely doesn't "own" R, just one distribution of R (Microsoft R Open) and a package repository (MRAN).

(I'm a Microsoft employee and heavy user of R at work)

This is so interesting/cool to learn about. Thanks for sharing :)

You're spot on - Tea is focused on ensuring statistical soundness. Another way to think about this: Tea helps people analyzing data from studies to "maximize internal validity."

I hope to support integration with existing tools/common workflows in the future. Microsoft R Open definitely looks like an interesting target.

It'd be great to follow-up. I visit Microsoft in Redmond periodically. If you're interested, email me! (<my username> at cs.washington.edu)

Cool, I'll shoot you an e-mail. I'm very interested in the Tea project and potentially contributing to it too. It's the sort of thing that could make every working data scientist noticeably more effective at their jobs.

I was thinking about whether it'd be possible to write something like Tea for R, but then I found there is no R package for Z3 nor any other SMT solver! So that would be an extra challenge...

It's an opensource python package that can be used in a jupyter notebook. Source is available at https://github.com/emjun/tea-lang

This looks really useful! Is there a typo in example code, where the 'So' variable is assumed to be normally distributed (I think it's meant to be the 'Prob' variable)?

I tended to avoid statistics in the past, knowing just enough to realise how easy it is to draw wrong or misleading conclusions. I recently had to perform some data analysis and got a bit paralysed at the thought of choosing the wrong approach; something like Tea would have been great, if only to give external validation/justification to the decisions I made.

Yes, nice catch! Thanks :) In case you're interested, we want to explore opportunities to expand the "grammar" of assumptions people might want to express.

And, that's our hope. We'd love to increase people's awareness and confidence in their analyses.

There are a few errors in the paper. "HCI" is not defined. Figure 2 refers to independent and dependant variables for an observational study but the text under "Study Design" says it should be "contributor" or "outcome".

The software has zero releases. This doesn't allow comments w.r.t. versions and possible bug fixes.

I like the concept. Adding a graphic representing the model would help make it interpretable by non-numerate stakeholders, such as management. Or politicians (!)

It would be helpful to be able to make some statement on future support....is this going to be around for a few years?

Thanks so much for pointing these out :) Yes, as we speak, I am working on updating the README, supporting docs, and a better website for the language/project. I will let you know once I have these up and running! It would be great to get your feedback.

My plan is to continue building on Tea for at least a few more years and provide support for new use cases, statistical methods, etc. Are there topics/concerns you'd like to see addressed? :)

I read and shared your paper and GitHub link with the rest of my data science team here at Microsoft Azure--we do a lot of hypothesis testing of this nature in our day-to-day work (using R or Python) and a tool like Tea would go a long way in helping us apply stats tests more efficiently and thoroughly. Good to know that you're planning to continue development, we'll be following with interest!

Some other areas to potentially explore could be support for power analysis, sequential analysis and stopping rules, for use in the pre-trial phase of controlled experiments such as A/B tests.

I'd also suggest adding a LICENSE file to your project to clarify how it's allowed to be used.

This is extremely interesting. This paper does a few different things that are worth separating:

1. Structures knowledge about common statistical tests, the assumptions they need to satisfy, and the kinds of problems for which they are appropriate.

2. Models this as a constraint satisfaction problem.

3. Creates a DSL to write specifications for statistical testing.

4. Develops an output format for statistical testing based on (3).

5. Provides a python package that implements the above.

IMO (1) is by far the most fundamental of these, but the paper spends most of its time describing (3). I'll admit to not fully understanding (2), what the alternatives were, and why it's an appropriate choice - I wish there were more on that in the paper as well.

If the approach to (1) were described in more depth and published in a usable way outside of the implementation in (5), I could see it being broadly useful and leading to an ecosystem of different (competing) implementations. As it stands, the work on (1) is spread across a few functions in solver.py.

Looking more broadly, statistics-as-an-HCI-problem is fascinating, and I think this is a promising start. I'd also love to see more attention paid to (4), the output format. Most "doing harm with statistics", I believe, comes from misunderstanding/misapplying results, so looking at the full workflow / user story is critical. The focus of this paper is instead making the input easier, which is also obviously important.

emjun, so glad to see you in the comments here! What's the best way for interested folks to contribute to this project?

Hi! I, too, think statistics-as-an-HCI problem is a cool framing that allows for new methods/solutions :)

Your observations and distinctions about what Tea does are great.

I am currently trying to fix some bugs/make improvements from the deadline push. After that, I plan to restructure the internals to make the logic/reasoning/implementation easier to follow and extend.

I would love contributors! It would be great if you could watch, open an issue, or open a pull request on Github (https://github.com/emjun/tea-lang).

If there are enough interested collaborators, it might be worth opening up a gitter or slack group? In the meantime, if you'd like to chat more, please feel free to email me! (<my username> at cs.washington.edu)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact