The web server itself is a homemade http server I wrote in go and jokingly called "nging". Doing a lot of SSI and transfer compression can be a bit much for that machine, but it got easier than maintaining my nginx config on upgrades.
I happened to have DB logs up and some some queries that weren't me and traced it back here. Good morning, HN.
All I saw on the entry was a blurb about its concurrency features.
I was pretty much production-ready with seriesly in two weeks by myself (though I'm starting to get contributions from other users). Last night, I closed my last open issue "bulk interface" by making an optional memcached binary protocol interface with custom packets for database selection and streaming data in.
Today, I started a new project with a new guy, and got some pretty impressive internal demos working after a couple hours of work.
I get lots of things done fast and reliably. This doesn't happen as much for me in other languages. I went into more details in the follow-up post where I described how I built seriesly (and keep in mind, I wrote this after it had only been alive for two weeks): http://dustin.github.com/2012/09/13/inside-seriesly.html
When data set is small - RDBMS or Mongo is fine. When data set is big - the general advice is to go Cassandra, Hadoop, HBase and friends, or similar NoSQL route, which is a bit of pain to deal with.
Where I think this tool fits nicely is when you don't want/have to invest time in big NoSQL cannons, but your data/data processing is of NoSQL like nature and big enough to be painful using existing solutions.
It would be really cool to have support for the following features [please keep in mind that I am a finance person, not a developer ;-)]:
1. Ability to query data in batch [in addition to live] mode. Mode shall be transparently managed by application code. Example: my dataset is too big, I am fine with delays, I run data processing [aggregation, etc] as a scheduled task during nightly maintenance window.
2. Ability to define aggregation schema that can be saved as schema in the database. This is very useful for the following use case: I know my typical aggregation pattern, so I define it in the database schema, I schedule a batch processing task, which aggregates data according to the already defined db schema and saves the results into the database. When I need to query data later on, I can use already pre-computed data instead of live query each time. This feature is very important from ease of use perspective, the ease of use is coming from db handling this instead of application.
3. Transparent easy way to manage availability - master-slave | master-master and automatic failover.
4. Sharding data automatically and/or easily across available nodes.
5. Some way to ensure that data is never lost.
I really hope that this project will get rolling and expand.
P.S.: shoot me a note if you need me to elaborate...
Thoughts on your specific items:
1. I could probably prioritize the query/doc processing and get most of this out of the way, or something like what I've been thinking about for #4.
2. I've thought about this one for sure. It's actually possible to do externally already, just not very magically. I'll learn more when I get more internal people pushing it.
3. I've been tempted to add replication -- not because I need it, but because it's just really easy. master-slave is completely trivial. master-master isn't hard, but requires a tiny bit of state to be tracked I don't have an easy way to do yet. It'd be worth it just for fun.
4. I have a lot of infrastructure for this. To be efficient, I need something like _all_docs that doesn't include the values and/or something like get that evaluates jsonpointer. Then you could pretty well round-robin your writes and have a front-end that does this last-step work. Harvest the range from all nodes concurrently while collating the keys. Once you find a boundary from every node, you have a fully defined chunk and can start doing reductions on it. A slightly harder, but more efficient integration is to have two-phase reduction and let the leaves do a bunch of the work while the central thing just does collation. You wouldn't be able to stream results in that scenario, though.
5. Is this as simple as disabling DELETE and PUT (where a document doesn't exist)?
There's a cool hack
But not as cool as a time series db.
You might find this talk informative: http://www.cs.nyu.edu/shasha/papers/jagtalk.html
I'm building an open source behavioral database (time series over objects) called Sky (https://github.com/skydb/sky). It comes with a LLVM-backed language called Qip that is a mix between procedural & declarative and provides easy integration with C libs so you could plugin a machine learning library (relatively) easily. It's fast too. It'll crunch through tens of millions of events per second on a single core.
I'm releasing the initial v0.1.0 at the end of the month. Shoot me a message on Twitter (@benbjohnson) if you want some more information.
How important of a goal was it to follow/diverge from pp/sql's lead in designing Qip?
Thank you for your response!
I think you mean cache invalidation. ;)
The way I like to think about document-oriented databases is that you store what you have when you have it, and worry about what it was later when you need to get things back out of it.
e.g. the big bag of stuff I mentioned in the blog post contains a few things I know I don't need, a few things I think I probably need, and a lot of stuff I just don't want to think about (I might need it later, maybe after some manipulation, etc...). Lob it all in.
The downside of a system like seriesly vs. a system like rrd (or any modern equivalent) is the same as the downside of any nosql database vs. a sql database. By planning up front, I can keep the size down and get more performance by incrementally computing stuff from the beginning. In the meantime, I'll just buy more disk. :)
I'm reasonably happy with the performance, though. There's a good number of visitors on the page right now and this is what they're seeing:
2012/09/11 12:28:50 Completed query processing in 82.54ms, 6,266 keys, 1,280 chunks