

Ask HN: Recommended stack for a data-heavy application? - lemma

My brother and I had an idea for an application that will perform statistical analysis on arbitrary data sets. We plan to work on it as a side project over the summer (or longer if need be) to learn as well as potentially build it into a business if it turns out as planned. I'm currently doing research to see what would be the best tech stack to build this on.<p>I'm have experience with LAMP, but I'm not sure this would be the best tool for what we have in mind (he has no programming experience). Ideally, I want:<p>- A server that can efficiently handle large uploads (mostly csv/spreadsheets). Could this just be configuring Apache properly?<p>- A language/framework that is good/efficient for stats. (Maybe some combination of Python and R?)<p>- A database that can handle arbitrary data sets and preferably integrates well with the previous tools. I forced MySQL to do this in a proof-of-concept, but there are probably better tools for the job.<p>- A graphing/visualization tool. I kind of like rgraph.net for HTML5 charts, but I'm open to recommendations.<p>- Lastly, a lightweight/simple framework that handles all the typical features of a web app (user registration and management for now).<p>As I mentioned, this project is primarily for learning at this point, so I'm pretty open to any recommendations. Feel free to ask anything and let me know if I'm missing anything.<p>Thanks!
======
pgroves
Allowing datasets of arbitrary size is going to make things tough. My first
thought is to keep the data as .csv files on Amazon S3 or some other
persistent storage network. Getting a database tuned is tough even when you
know what data you have up front. Hadoop wouldn't be quite as bad but it still
wouldn't be trivial.

If you do that, I would recommend looking at WEKA's arff file format. It's a
really clunky file format but it captures a bunch of meta data (data types,
max/min, etc) needed by many typical machine learning algorithms. You could
capture that type of data as the data is being loaded, which would make later
analysis easier.

After that, you'd have a situation where you can either stream the data out of
the csv files or chunk the files into subsets for use in map-reduce type
algorithms. I'm not sure what the performance is like when you start
requesting the middle of a large file from S3, though.

As for a stats package, if you know python, I'd go with it. There are a few
stats packages already out there that seem pretty good. But really, if you're
just going to do basic stats like averages, st. dev, moving averages over
time, etc, those are pretty trivial to implement. That might be beneficial if
you have very large data sets that can't fit in memory at once and a custom
way of accessing data.

I should say I haven't used a lot of the newer whiz-bang analytics setups that
have been coming out, but in general my experience has been that working
around the idiosyncrasies of stats packages is usually more difficult than
implementing my own methods while using their code as a reference.

My final advice is to not adopt an analytics framework that has to be the top
level of the program. You really need to be able to control the analytics
engine programatically from your application. Stay away from systems that make
you create modules or data flows inside their application, and the only way to
modify them is inside a gui or a complex config file. These systems are
everywhere. They are nice as a high-powered replacement for Excel but not when
you are trying to develop a software application.

------
dscape
You can try MarkLogic Server. We have a free edition that will cover what you
need and Office toolkits that are really cool (so you can work directly in the
excel sheets)

------
phren0logy
Hadoop + Clojure + (Ring for web and Incanter for stats)?

