
Apply HN: Statistics as a Service - augb
Basic Concept:
Provide simple-to-use, basic, automated statistics to a user based on an uploaded file (e.g. a CSV) to help them spot trends, data anomalies, etc. Deliverables would be either a PDF report or JSON according to user selection  (possibly with actual cleansed data used to generate the results).<p>This is in the idea stage.<p>Business Model:
Freemium<p>Pricing tiers:
1. Free
2. Pro (additional, more advanced statistical smörgåsbord* to choose from, suggested stats to run, etc.)
3. API (programmatic access to FREE and PRO tier stats)
4. Custom (could be custom stats implementations w&#x2F; API access, more of a consulting project, or even, a statistician on call, etc.)<p>* Though, initially the options would start relatively small, and be added to as time progresses and it makes sense to do so.<p>Use Cases:
1. Knowledge worker at a SMB or local government entity with little access to data analytics or statisticians, but still need to make use of statistics.
2. Dev shops needing stats, but don&#x27;t have the time, in-house resources, etc. to bother with it.<p>It is <i>not</i> an attempt ...
* at big data analytics
* at building an AI&#x2F;ML service
* to be a replacement to R, Python, Julia, et al.
======
gus_massa
Somewhat related "Design and Implementation of CSV/Excel Upload for SaaS"
[http://www.kalzumeus.com/2015/01/28/design-and-
implementatio...](http://www.kalzumeus.com/2015/01/28/design-and-
implementation-of-csvexcel-upload-for-saas/)

Perhaps you can find some ideas there.

~~~
augb
Thank you for the pointer. Problems such as CSV/Excel handling are non-
trivial, but, thankfully, there are many OSS options out there to help with
some of the edge cases.

------
joehilton
I'm interested in this. Would you have a way for me to pipe you lots of data?
Like way bigger than a CSV file upload?

~~~
augb
Possibly. The API would likely be the venue to handle this. What do you have
in mind?

------
buss
I think one of the big problems with statistics for most people, besides the
complexity of the math, is understanding when to use which techniques. How
will you help the user choose the correct way to analyze their data?

~~~
augb
Thank you for your reply. I agree with you.

For the Free Tier the idea is provide some basic descriptive statistics with
maybe a histogram overlaid with the normal distribution curve on top. Some
example values that might be calculated are the mean, median and mode along
with several standard deviations. Down the road a bit, I could envision a
feature to suggest further exploration, which would likely guide the user
towards some of the Pro features. (Although, I would like to expand what is
available in the free tier, where it makes sense.)

For the Pro Tier, they would of course, have access to the same results as the
Free Tier, but I imagine a wizard like interface (which can be turned off) to
help guide the user through choosing what they want done. Initially, the focus
would be on helping spot basic trends and more obvious data anomalies. Further
down the road, I could see the options expanding.

One of the challenges to be handled is the "width" of the data set. For
example, if User A has a CSV with 100 columns, the basic calculations may not
be a big deal, but do we present a 100 histograms? What if there are 1,000
columns. My current thinking is for the user to prioritize the columns they
want analyzed if they exceed a certain threshold (say, 25 columns). The
results would be done on the columns up to this threshold.

I am open to ideas and suggestions on this, and even on things such as which
"basic" descriptive statistics are most important.

Edit: Clarification (last sentence)

