This is awesome. I currently use R to run simple stats, but I make a lot of mistakes and have to spend a lot of time verifying that I'm using appropriate tests.
I'm surprised Microsoft hasn't done anything in this regard, despite the fact that they own R.
The most important errors in R are not at all related to data type safety (though NA/NULL handling is a common issue), they're about the statistical soundness of the analysis you're conducting. Experimental design, implicit assumptions and invariants, confounding factors, and the appropriateness of the statistical tests you're applying for the distribution of the data you're applying them to. It sounds like tea-lang helps with the latter part.
Microsoft definitely doesn't "own" R, just one distribution of R (Microsoft R Open) and a package repository (MRAN).
(I'm a Microsoft employee and heavy user of R at work)
You're spot on - Tea is focused on ensuring statistical soundness. Another way to think about this: Tea helps people analyzing data from studies to "maximize internal validity."
I hope to support integration with existing tools/common workflows in the future. Microsoft R Open definitely looks like an interesting target.
It'd be great to follow-up. I visit Microsoft in Redmond periodically. If you're interested, email me! (<my username> at cs.washington.edu)
I was thinking about whether it'd be possible to write something like Tea for R, but then I found there is no R package for Z3 nor any other SMT solver! So that would be an extra challenge...
I tended to avoid statistics in the past, knowing just enough to realise how easy it is to draw wrong or misleading conclusions. I recently had to perform some data analysis and got a bit paralysed at the thought of choosing the wrong approach; something like Tea would have been great, if only to give external validation/justification to the decisions I made.
And, that's our hope. We'd love to increase people's awareness and confidence in their analyses.
The software has zero releases. This doesn't allow comments w.r.t. versions and possible bug fixes.
I like the concept. Adding a graphic representing the model would help make it interpretable by non-numerate stakeholders, such as management. Or politicians (!)
It would be helpful to be able to make some statement on future support....is this going to be around for a few years?
My plan is to continue building on Tea for at least a few more years and provide support for new use cases, statistical methods, etc. Are there topics/concerns you'd like to see addressed? :)
Some other areas to potentially explore could be support for power analysis, sequential analysis and stopping rules, for use in the pre-trial phase of controlled experiments such as A/B tests.
I'd also suggest adding a LICENSE file to your project to clarify how it's allowed to be used.
1. Structures knowledge about common statistical tests, the assumptions they need to satisfy, and the kinds of problems for which they are appropriate.
2. Models this as a constraint satisfaction problem.
3. Creates a DSL to write specifications for statistical testing.
4. Develops an output format for statistical testing based on (3).
5. Provides a python package that implements the above.
IMO (1) is by far the most fundamental of these, but the paper spends most of its time describing (3). I'll admit to not fully understanding (2), what the alternatives were, and why it's an appropriate choice - I wish there were more on that in the paper as well.
If the approach to (1) were described in more depth and published in a usable way outside of the implementation in (5), I could see it being broadly useful and leading to an ecosystem of different (competing) implementations. As it stands, the work on (1) is spread across a few functions in solver.py.
Looking more broadly, statistics-as-an-HCI-problem is fascinating, and I think this is a promising start. I'd also love to see more attention paid to (4), the output format. Most "doing harm with statistics", I believe, comes from misunderstanding/misapplying results, so looking at the full workflow / user story is critical. The focus of this paper is instead making the input easier, which is also obviously important.
emjun, so glad to see you in the comments here! What's the best way for interested folks to contribute to this project?
Your observations and distinctions about what Tea does are great.
I am currently trying to fix some bugs/make improvements from the deadline push. After that, I plan to restructure the internals to make the logic/reasoning/implementation easier to follow and extend.
I would love contributors! It would be great if you could watch, open an issue, or open a pull request on Github (https://github.com/emjun/tea-lang).
If there are enough interested collaborators, it might be worth opening up a gitter or slack group? In the meantime, if you'd like to chat more, please feel free to email me! (<my username> at cs.washington.edu)