

Show HN: Statwing is statistical analysis, simplified  - glaugh
https://www.statwing.com/

======
taliesinb
I'm glad to see other people working on this problem. We (Wolfram|Alpha) are
doing this too, starting out with the making the 'easy cases' nearly
automatic.

Here's a blog-post describing our effort:
[http://blog.wolframalpha.com/2012/02/09/launching-a-
democrat...](http://blog.wolframalpha.com/2012/02/09/launching-a-
democratization-of-data-science/)

You can also play around with our examples without having to sign up to
Wolfram|Alpha Pro. My favorite is an automatic analysis of the Titanic data
that nicely illustrates that while the motto "women and children first"
applied, being rich certainly helped:
[http://www.wolframalpha.com/input/?i=+&examplefile=1&...](http://www.wolframalpha.com/input/?i=+&examplefile=1&datasetfile=DataInput%2Fcategories-
numbers-genders&examplefile=1&datasetfile=DataInput%2Fcategories-numbers-
genders)

We cover other kinds of simple analysis and visualization too, like heat maps,
Venn diagrams, graphs, and so on. As always, feedback welcome.

~~~
glaugh
Agreed, glad to be a part of the community of folks trying to democratize data
analysis. It feels like an important problem to work on, and we're passionate
about it (as I'm sure you are, too).

Thanks for chiming in.

------
aphyr
The walkthrough took me through finding a correlation between voting
preference and neuroticism. Great! But it's also worth noting that this
dataset shows _larger_ effect sizes at similar CIs for the correlation between
[preference and age] and [age and neuroticism]. This, folks, is why ANOVA is
important.

That aside, the product was clear, fast, and intuitive. Well-chosen
visualizations and a clean emphasis on the important moments for basic
covariate analysis. Well done.

~~~
glaugh
Nice, well pointed out.

Unfortunately, we don't have regressions yet. But just to make sure these
findings were still valid, we tossed this data into a program that did to make
sure the effect remained (it did).

It's no substitute for regressions/ANOVA, but for now here's what Statwing can
do: If you add a filter that excludes datapoints below, say, 40 years old.
Looking only at folks older than 40, there's no relationship between
[neuroticism and age], but the relationship between [neuroticism and
preference] remains.

But, point taken. We'll count this as a vote for prioritizing regression.
Thanks!

------
recardona
I do a lot of stat. analysis and I was impressed with the clarity of the
analyses. However, I ran through the Obama v. Romney tutorial and was
surprised to see that the software was averaging survey items (Likert-scale
data). I thought that this was not allowed since it is troublesome to
interpret the output (how do you interpret 8.36 Neuroticism?)

Aside from that, I can see this filling a need for those whom are aware of the
importance of statistical significance but do not have the time to look up the
appropriate analysis function in R/SAS/SPSS/...

~~~
mgurlitz
You're right, they shouldn't be making that average. Specifically, Likert data
is ordinal, meaning 14 is less than 15 and greater than 13, but the gap
between 15 and 14 may be different than 14 and 13.

For example, let's say people measure neuroticism exponentially, and an
increase of one point means 10x perceived neuroticism. Because mean(log(x)) !=
log(mean(x)), the average won't be representative.

Everything else looks OK though: count, median, percentiles, a histogram.

------
talbina
There was a company that applied to YC that wanted to do the "Google Docs for
Statistics" but was rejected. They wrote about it in a blog post but I can't
find it. They ended up not launching.

It will be worth it to connect with these people to see if there is anything
that can be learned from them.

~~~
glaugh
Definitely let us know if you think of their name or dig up their blog post.
Sounds interesting.

~~~
TrevorBurnham
I believe the company talbina's thinking of is mine: We applied as
Theoryville, got interviewed, got rejected, applied to Betaspring
(<http://betaspring.com/>), got accepted, changed our name to DataBraid, and
proceeded to fall apart over the course of a summer.

I do think the idea has a lot of potential, and what StatWing has built is
already more complete than what my team managed to build in 3 months. A few
suggestions I'd offer based on that experience:

1\. Parsing CSVs is easy in theory, but painful in practice, because CSVs in
the wild tend to be full of junk. I would provide a JSON API that makes it
easy for developers to put data in your system directly, allowing people to
build their own CSV parsers for you.

2\. Use GitHub as your model. You want people to collaborate around data the
same way that developers collaborate around code. Just about every day when I
was doing DataBraid, we'd discuss a use case and then say "Oh, GitHub already
figured out the right way to do this." The most compelling use case here is
that researchers can run different sets of tests on the same data and discuss
which approach is the most valid/insightful.

3\. Getting to revenue will be hard, but having paying customers will make it
much, much easier to attract investment. So find the MVP that people will pay
for and put everything else on a "nice-to-have" list.

Best of luck!

------
jenius
Looks really great overall - props! One small design thing in there that
bothered me was how the gradients reverse in the buttons on hover - this
should never happen. Just lighten or darken the color on hover (move the
gradient up with background position and add a transition is a good trick),
then consider reversing the gradient on active (or just adding an inset
shadow).

Everything else in the design looks great and this is totally nitpicky, but
hope it helps!

~~~
lejohnq
I also work on Statwing so thanks very much for the comment.

Now that I look at it more, the front page buttons do look weird compared to
all of our other buttons. We've become numb to it after looking at it so
often. Most of our buttons do the design thing that you described, so we'll
change that shortly! Thanks for the feedback.

------
Bill_Dimm
Very nice. One tip: Don't require an email address to provide feedback and
you'll get more feedback.

A bug that I found in the tour for "Politics and the Big 5":

The instruction bubble says: _To run a different analysis, remove
"Neuroticism" from the white box by clicking the X to the right of the
variable name._ But, "neuroticism" is not one of the variables I was using. It
seems that something was hard-coded when it shouldn't have been.

~~~
glaugh
Ah, thanks a bunch. Appreciate both the bug and the feedback tip. Have a good
one, thanks for checking out Statwing.

------
kylemaxwell
This looks great and I look forward to running some analyses of the same test
data between Statwing and Wolfram|Alpha Pro in a mini-bakeoff.

EDIT: Can you talk about your business model any? Sort of a freemium service,
or maybe charging for a future API, or something along those lines? Please
don't say "ads".

~~~
glaugh
Fortunately, people are pretty used to paying for this kind of a product. So
we'll do freemium based on number/size of datasets uploaded and some as-yet-
unreleased advanced features. Probably throw in some academic discounts for
good measure.

Thanks for the question. Cheers!

~~~
kylemaxwell
Good to hear. I always like seeing cool sites have a way to make money so I
can have confidence they'll be around for a while. :)

------
grantjgordon
Very nice. Who's the target audience for this? Students? Curious enthusiasts?
Analysts within companies?

~~~
glaugh
We think of our target audience in concentric circles. We'll likely have users
from each circle at any given time, but we'll prioritize our product and
marketing towards the inner circles then move outwards:

Circle 1. A few specific analysts in a few specific companies we're associated
with. They analyze survey data, they use only basic functionality of the fancy
tools, and they want a simpler solution.

Circle 2. People analyzing surveys generally. It's a straightforward
application where existing tools are way too complicated.

Circle 3. The rest of the 50% of stats tool users that never use more than the
core functionality of existing tools (that number is from our research).

Circle 4. People who analyze at work. In particular, Excel power-user analysts
and marketing folks for whom the go-to tool for analysis is the pivot table.
We want to ease them into the world of more powerful, statistical analysis. We
do a lot of usability testing with these folks and we're excited about their
reactions so far. But they're not in a lot of pain, so they're not a great
initial audience for us.

Grand vision stuff: Tools like SPSS and the like were built in the 80s, and
Excel pivot tables were built in the 90s. They've been updated but not
overhauled, and there's a gaping hole between them in terms of ease of use and
power. As small, rich datasets become ubiquitous, are people in 2020 really
going to be using tools from 1990? We hope not.

~~~
grantjgordon
Thanks! Very insightful.

------
kirillzubovsky
A statistics application that run on the cloud and looks good too? Yes please!
Looking forward to playing around with the data to see what's possible. Where
are you guys planning to take this software?

------
tel
I'm worried for how quickly you can do tests with this interface. I feel my
fingers urging for hypothesis hunting---do you have multiple comparison
corrections in place?

~~~
glaugh
Totally valid. We do multiple comparison protection on ANOVA post hoc tests,
but not across all analyses.

Ultimately we'll need to address this. Hopefully doing so (automatically) will
differentiate Statwing from other stats package, where one is quite free to
shoot one's self in the foot (and one often does).

We'll count this as a vote for the prioritization of that feature.

Thanks for the comment, really appreciate it.

~~~
tel
Honestly, I wrote is as a disguised compliment. Doing tests quickly makes
statistical validation _available_ and that is solidly better than winging it
because you can't be bothered to do the math.

I did a bit of brainstorming previously about penalties and negative feedback
controls for hypothesis testing in a medical context. The metaphor I liked was
that you are buying hypothesis tests with data and therefore there is a
penalty risk for each attempted test. I never worked out the math very
thoroughly, but I'd love to see how a system like that would work live.

I think it'd be an amazing boon to your system to have these kinds of
feedback. You'd not only be easy and available but also _trusty_ since you
make sure you never promise too much.

Very cool project.

------
hashpipers234
I can do everything they can do in matlab with your data in less time and with
less hassle. my only price is a xmen comic book and a 6 pack of coke.

------
hokua
Similar to what Swivel was trying to do. Great idea, nice execution, but
really how will you monetize this? There is no real market for consumer grade
"intuitive" statistical software. While this will appeal to casual data
analyzers, these users arnt ready to spend much money on tools. And those
doing data analysis for a living prefer their power tools: R, SAS, Matlab,
NumPy, etc.

~~~
glaugh
Agreed that if you spend most of your day most days doing analysis of large
datasets you probably need a power tool.

But there's a whole class of overlooked folks who need to do statistical
analysis on smaller datasets on more of a weekly or several-consecutive-
intense-days-per-month basis. These folks, who split time between Excel and
stats tools, make up a surprisingly high proportion of the user base of stats
products. And they tell us they're willing to pay for something that makes
their analysis and communication more efficient.

Thanks for the comments, and for the kind words RE the idea and execution.

edit: And to be fair to your point, we're sort of comparing apples and oranges
insomuch as you're looking at what we have now (not nearly enough) and we're
looking at our roadmap for what we'll have in six months, a year, etc.

------
doleson
Are there any plans to add-in any realtime feeds? Like say weather data and
the Dow jones close to see any correlations?

~~~
georgek
+1. I really like the intuitive interface and the speed with which I can
conduct analysis. It would be great to have a library of feeds for each user
that is automatically curated / updated. This library could include both
public datasets (fore free) but also proprietary feeds specific to my industry
or even my company that are only accessible by me (which I would pay for).

~~~
glaugh
This would indeed be super cool. Could definitely see us getting to this
eventually.

------
jqueryin
While I appreciate the graphs, I'd also like to see the numbers if I hover
over wording that says "Very clearly significant". What confidence interval
are we talking about? 95%?

If I was you, I'd hide this information from the average user but make it
available in a tooltip to those of us who care.

~~~
glaugh
Good call RE the tooltip.

Just for reference, everything's at 95% confidence. We do mention that in the
Advanced output but it's perhaps a bit too hidden.

------
leeny
The optional upgrade survey appears to be broken. After I submit the survey, I
get redirected back to the login page with my username in the query string.
After I click "login", I get the alert telling me I can take a survey to
upgrade. Rinse. Repeat.

------
dlf
I absolutely love this. I'm learning to code (slowly) and have an infatuation
with data visualization. I've imagined what something like this might look
like, and I think you guys absolutely nailed it. Well done!

~~~
dlf
P.S. I shared this with the Maxwell School alumni group on LinkedIn, so
hopefully that drives some traffic your way! I think that my fellow MPA alums
will dig it.

------
duaneb
Very cool. Why should I use this instead of R/gnuplot?

~~~
lejohnq
Thanks!

We are trying to make Statwing automatically display the right analyses for
the portions of your data you are most interested in.

If we can accomplish that, then hopefully we've helped make you faster at
understanding the relationships in your data. Maybe that is enough so you
don't need to break out R for basic analyses. Otherwise I would also use R. I
made some graphs in R that wouldn't be able to do in Statwing right now, but
if we can output the right things based on your data then hopefully you could
save some time with us.

------
leeny
I'd like to throw in another vote for prioritizing regressions and
specifically adding logistic regressions to the mix. Thanks!

------
mcarvin
demo is very cool. love anything that can make pattern recognition in large
datasets this much easier.

------
danso
1\. Great looking product. I clicked through a little bit and liked the
general polish, but didn't have time to explore everything.

2\. For people whose jobs involve statistical analysis, how much need is there
for something like this? The more analysis I do, though, the more I realize
that the hard part is collecting the data and programmatically "piping" it
from package to package...And from professionals I know in various numbers-
based industries, their biggest blind spot seems to be that ability to gather
data that doesn't come in a CSV/Excel sheet for them.

* edit: in addition to the challenges presented above, the challenge of cleaning data so that a package like Statwing can do a proper analysis

~~~
glaugh
Thanks, really appreciate #1.

Agreed that quite often the hardest part is getting the data together
(particularly on the web). But from our perspective, it's still true that
conducting the actual analysis and visualizing the data should be a lot easier
than it is. And that's particularly true if you're in our initial audience,
the roughly 50% of SPSS/R/Minitab/etc. users who never use anything past the
basic functionality of those programs.

I guess a simpler answer is that we think there's a need because this is a
product that I badly wanted when I was an analyst/consultant, splitting time
between Excel and the basic functionality of SPSS.

edit: Also, and this isn't very helpful, but we talk to a _ton_ of people
about their data analysis needs, and we hear a good chunk of them talk a lot
about the pain of using highly technical solutions for relatively simple
problems like analyzing a survey.

~~~
danso
More notes:

#3 I have to say, my first impression was that the tutorial was a little
annoying, but it's actually done pretty slickily and it introduces features,
such as the multiple variable analysis, that I probably would not have
stumbled upon in the first place. Well done.

#4 That's my SOPA project you're referencing! :)
<https://www.statwing.com/demos/sopa> (though if you can edit the copy, credit
should also go to the Center for Responsive Politics, from which the campaign
finance data was collected)

~~~
glaugh
Nice! It's a great dataset, really fun, and we're big fans of ProPublica.

We'll definitely edit the copy. I'll ping you unicast to make sure we did it
right. Yay!

edit: Also, thanks for the feedback on the tour. Our goal is to make the
interface so intuitive that it shouldn't require the tour to know what to do.
We've got some updates in mind that should get us much closer to that goal.

------
fredsters_s
looks really awesome. interested to see what if any data analysis can be
linked to current events.

------
Flenser
"Female tends to have slightly higher values for Neuroticism than Male"

