
Seriesly - a document-oriented time-series database written in Go - mgrouchy
http://dustin.github.com/2012/09/09/seriesly.html
======
dlsspy
BTW, my demo site is running in my bedroom on a slow ARM5-based debian box
over my terribly slow DSL. If things get slow, that's why.

The web server itself is a homemade http server I wrote in go and jokingly
called "nging". Doing a lot of SSI and transfer compression can be a bit much
for that machine, but it got easier than maintaining my nginx config on
upgrades.

I happened to have DB logs up and some some queries that weren't me and traced
it back here. Good morning, HN.

~~~
michaelcampbell
If I may ask, why Go? As an excuse to exercise your Go skills, or did you pick
it for some specific reason? (In other words, were you doing this for Go, or
doing Go for this?)

All I saw on the entry was a blurb about its concurrency features.

~~~
dlsspy
What else would I use? I've been writing tons of go code for nearly three
years now. I get fast, concurrent, parallel code in very little time.

I was pretty much production-ready with seriesly in two weeks by myself
(though I'm starting to get contributions from other users). Last night, I
closed my last open issue "bulk interface" by making an optional memcached
binary protocol interface with custom packets for database selection and
streaming data in.

Today, I started a new project with a new guy, and got some pretty impressive
internal demos working after a couple hours of work.

I get lots of things done fast and reliably. This doesn't happen as much for
me in other languages. I went into more details in the follow-up post where I
described how I built seriesly (and keep in mind, I wrote this after it had
only been alive for two weeks): <http://dustin.github.com/2012/09/13/inside-
seriesly.html>

~~~
michaelcampbell
Sorry, I didn't know your history. I was truly curious to know if this was a
problem you decided to solve strictly so you could use Go, or if that was just
your, pardon me, "go to" language. It appears the latter. I hope I'm not
implying that you SHOULDN'T have used it; my query was truly just curiosity,
since I'm curious about the language. I'm trying to get a feel for what other
people are doing with it "for real".

------
daemon13
This a very cool project that can be very useful in certain specific use
cases. One example would be when you are working with medium size time-series
business data.

When data set is small - RDBMS or Mongo is fine. When data set is big - the
general advice is to go Cassandra, Hadoop, HBase and friends, or similar NoSQL
route, which is a bit of pain to deal with.

Where I think this tool fits nicely is when you don't want/have to invest time
in big NoSQL cannons, but your data/data processing is of NoSQL like nature
and big enough to be painful using existing solutions.

It would be really cool to have support for the following features [please
keep in mind that I am a finance person, not a developer ;-)]:

1\. Ability to query data in batch [in addition to live] mode. Mode shall be
transparently managed by application code. Example: my dataset is too big, I
am fine with delays, I run data processing [aggregation, etc] as a scheduled
task during nightly maintenance window.

2\. Ability to define aggregation schema that can be saved as schema in the
database. This is very useful for the following use case: I know my typical
aggregation pattern, so I define it in the database schema, I schedule a batch
processing task, which aggregates data according to the already defined db
schema and saves the results into the database. When I need to query data
later on, I can use already pre-computed data instead of live query each time.
This feature is very important from ease of use perspective, the ease of use
is coming from db handling this instead of application.

3\. Transparent easy way to manage availability - master-slave | master-master
and automatic failover.

4\. Sharding data automatically and/or easily across available nodes.

5\. Some way to ensure that data is never lost.

I really hope that this project will get rolling and expand.

P.S.: shoot me a note if you need me to elaborate...

~~~
dlsspy
This is great feedback. You seem to get what I'm going for.

Thoughts on your specific items:

1\. I could probably prioritize the query/doc processing and get most of this
out of the way, or something like what I've been thinking about for #4.

2\. I've thought about this one for sure. It's actually possible to do
externally already, just not very magically. I'll learn more when I get more
internal people pushing it.

3\. I've been tempted to add replication -- not because I need it, but because
it's just really easy. master-slave is completely trivial. master-master isn't
hard, but requires a tiny bit of state to be tracked I don't have an easy way
to do yet. It'd be worth it just for fun.

4\. I have a lot of infrastructure for this. To be efficient, I need something
like _all_docs that doesn't include the values and/or something like get that
evaluates jsonpointer. Then you could pretty well round-robin your writes and
have a front-end that does this last-step work. Harvest the range from all
nodes concurrently while collating the keys. Once you find a boundary from
every node, you have a fully defined chunk and can start doing reductions on
it. A slightly harder, but more efficient integration is to have two-phase
reduction and let the leaves do a bunch of the work while the central thing
just does collation. You wouldn't be able to stream results in that scenario,
though.

5\. Is this as simple as disabling DELETE and PUT (where a document doesn't
exist)?

~~~
daemon13
Hi Dustin, I've sent you an e-mail to @spy.net to continue the discussion. Is
this your working e-mail?

------
zaphar
Nice. I did a timeseries db as well and just recently put it up on google
code. I started trying to wrap rrdtool at first and got sufficiently
frustrated with it's api that I wrote something that wraps sqlite3 instead.
Works great and uses dygraphs to graph the feed.

<http://code.google.com/p/gomonitor/>

------
steve8918
As a side question, what exactly is the benefit of a time-series database vs a
regular relational database? Is it just a special case or a relational
database, where rows are stored based on time, so that time-based queries are
more efficient? Or is there some other fundamental difference that makes it
more useful?

~~~
est
I understood the need for a time-series server when I was developing a real-
time DAU metric server.

There's a cool hack

[http://blog.getspool.com/2011/11/29/fast-easy-realtime-
metri...](http://blog.getspool.com/2011/11/29/fast-easy-realtime-metrics-
using-redis-bitmaps/)

But not as cool as a time series db.

------
terjeto
Thanks for sharing. Working with the same concept for the last few days. Cube
(mongoDB/nodeJS rest server), <http://square.github.com/cube/>, really gave me
a kickstart on simple datacapture and quering. (I use highcharts for the
visualization).

------
steve19
That is nifty. Does anyone know of any high performance time series-orientated
databases (closed source or not) which support flexible/advanced querying
(like being able to find patterns) ?

~~~
benbjohnson
TempoDB (<http://tempo-db.com/>) is a TechStars Cloud company that is doing a
hosted time-series database. Really smart guys.

I'm building an open source behavioral database (time series over objects)
called Sky (<https://github.com/skydb/sky>). It comes with a LLVM-backed
language called Qip that is a mix between procedural & declarative and
provides easy integration with C libs so you could plugin a machine learning
library (relatively) easily. It's fast too. It'll crunch through tens of
millions of events per second on a single core.

I'm releasing the initial v0.1.0 at the end of the month. Shoot me a message
on Twitter (@benbjohnson) if you want some more information.

~~~
bgilroy26
Thank you for sharing your code! This is really cool stuff.

How important of a goal was it to follow/diverge from pp/sql's lead in
designing Qip?

~~~
benbjohnson
I'm not sure that I know what pp/sql is. As far as diverging from SQL, the
data I want to analyze is really separate paths of actions performed by
distinct objects (e.g. users). it's not relational or tabular data so I ran
into issues trying to use a language like SQL to query it.

~~~
bgilroy26
It was pl/sql, I'm sorry. I got autocorrected. PL/SQL is a procedural language
and a superset of sql. It's what stored procedures etc. tend to be written in.

Thank you for your response!

------
xntrk
I love the name.

~~~
chocolateboy
It's a shame this project chose to duplicate the name of an existing
project/website:

<https://github.com/stefanw/seriesly>

<http://www.seriesly.com/>

~~~
dlsspy
There are two hard things in computer science. The next blog post I'm planning
is related to cache management.

~~~
fruchtose
> The next blog post I'm planning is related to cache management.

I think you mean cache invalidation. ;)

------
kanwisher
Looks like a good concept, I've wondered why there isn't better time series
databases that are open source

~~~
dlsspy
I think there are some good ones. I wrote this because I could get it running
faster than I could get the data I've got adapted to existing ones. That
doesn't mean they're bad as much as it means I don't understand the data I've
got. :)

The way I like to think about document-oriented databases is that you store
what you have when you have it, and worry about what it was later when you
need to get things back out of it.

e.g. the big bag of stuff I mentioned in the blog post contains a few things I
know I don't need, a few things I think I probably need, and a lot of stuff I
just don't want to think about (I might need it later, maybe after some
manipulation, etc...). Lob it all in.

The downside of a system like seriesly vs. a system like rrd (or any modern
equivalent) is the same as the downside of any nosql database vs. a sql
database. By planning up front, I can keep the size down and get more
performance by incrementally computing stuff from the beginning. In the
meantime, I'll just buy more disk. :)

I'm reasonably happy with the performance, though. There's a good number of
visitors on the page right now and this is what they're seeing:

    
    
        2012/09/11 12:28:50 Completed query processing in 82.54ms, 6,266 keys, 1,280 chunks
    

That means that for that query, it scanned through 6,266 keys in the on-disk
b-tree, grouped them into 1,280 separate result "rows" to be reduced and did
the necessary computation to emit all of them in under a tenth of a second
while lots of other queries were in flight. My "extreme" cases right now are
taking under 3 seconds on over half a million keys. I consider that acceptable
for two weeks of side-project.

~~~
rz2k
What are some of the others? The last time I looked, MonetDB and LucidDB
seemed to be the most popular column-store open source projects, but they seem
to have been mostly subsumed by proprietary products.

~~~
dlsspy
Whisper (backend for graphite), cube, ganglia... possibly more. I used to
build things directly on top of rrdtool, but the schema definition can be a
pain when you've got a lot of dynamic data.

