

Ask HN: Interest in a new open source, fast, in-memory, data analysis system? - siganakis

Over the past few months I have been experimenting with building a fast, distributed, in-memory, append-only database designed for analytics.  The idea being that many No-SQL databases are pretty terrible at ad-hoc querying, and while ordinary relational databases provide good support for ad-hoc queries, their performance leaves a lot to be desired for large data sets.  Basically an open source Vertica / KDB.<p>The system will accept data over HTTP (as JSON / CSV) and can be queried in either SQL, or an SQL like language, with full support for joins, sub-queries and aggregations, with output as CSV or JSON over HTTP.<p>The idea being that you can fire of a JSON request to your analytics database whenever something happens (be it a signup, purchase, click, etc) and have that data captured, ready for use in your dashboards or for ad-hoc querying.  The system will also be able to integrate with R for statistical fanciness.<p>If such a system existed, would you use it?<p>I will probably continue working on it regardless of any feedback (because its fun!), but I'll spend more time on it if its something people feel they might use.<p>If you would like to develop it too (using a combination of C for data manipulation and go-lang for everythign else), send me an email.
======
jknupp
I'm not sure I understand how your design overcomes the issues of either type
of database. You'll have unstructured data like a NoSQL database by virtue of
your insertion mechanism, and there is no mention of how you plan to optimize
for record size the orders of magnitude that would be required to keep all
data in memory. As a thought exercise, say my average record takes up 1Kb.
After only a few million records, with zero overhead for the database
structures themselves (not to mention unrelated processes also runnin on the
system) you've already exceeded the amount of memory typically used to run
these types of processes.

------
pjin
You didn't include an email.

I've been having success with Go for backend work, and easy C integration in
this kind of database like this would be useful for what I do. I'm not sure if
R integration should be a priority though, since most R users I know are
ambivalent about it (whatever, small sample size), and because Julia
development continues to improve.

~~~
siganakis
Thanks for your comment. My email is terence (at) siganakis [dot] com if you
would like to chat.

The R integration stuff is more for the data science crowd. I feel it would be
useful to have a streamlined way of writing a query to return a subset of a
large database and then be able to manipulate it in R as if it were a data
frame / matrix. I am interested in seeing how Julia plays out too.

