
Time Series, the new shiny? - iamd3vil
http://basho.com/posts/technical/time-series-the-new-shiny/
======
eggy
I don't know Riak, other than its a distributed NoSQL key-value data store.

Time series has always been prevalent in the fintec and quantitative finance,
and other disciplines for decades. I read a book in the early 1990s on music
as time series data, financial tickers, and so on.

How is Riak different, or more suited to use than Kdb + q, J with JDB (free),
Jd (a commercial J database like Kdb/q)[2], or the new Kerf lang/db being
developed by Kevin Lawler[3]?

Kevin also wrote kona, an opensource version of the "K programming
language"[4].

Kdb is very fast at time series analysis on large datasets, and has many years
of proven value in the financial industry.

[1] [https://kx.com/](https://kx.com/) [2]
[http://www.jsoftware.com/jdhelp/overview.html](http://www.jsoftware.com/jdhelp/overview.html)
[3] [https://github.com/kevinlawler/kerf](https://github.com/kevinlawler/kerf)
[4] [https://github.com/kevinlawler/kona](https://github.com/kevinlawler/kona)

~~~
athenot
Minor quibble: the article is about RiakTS, their time-series enhanced version
of riak_core.

riak_core's main strength is that it does key-value in a distributed/resilient
manner, spreading values in multiple copies (at least 3) all over a cluster of
servers. Kill one server, no problem. Need more capacity, add servers and it
will rebalance itself.

The TS part is just an optimization built on top of that, to make values which
are near each other in time to be near each other in storage, for faster
range-base retrieval.

~~~
siculars
Yes, exactly. (Author)

~~~
eggy
TS enhanced version of riak_core, got it.

Most TS, or tick DBs, are columnar-based, memory-mapped, fast and light
systems.

Are there any benchmarks similar to STAC-M3, which is a year's worth of NYSE
data run on different hardware to gauge kdb+'s effectiveness on different
hardware configurations [1]? It's a great way to gauge performance and TCO.

Does it do both memory (streaming data) and disk-based (historical) storage
for big data set analytics in realtime?

I'd be interested to see numbers there.

A lot of people think kdb+ is only for finance. There is a conference coming
up in May that will have talks on q (the language for the kdb+ database) about
natural language processing and machine learning in q to name a few. Another
is about using it at a power plant to most efficiently route power based upon
realtime data [2].

I only got into kdb+ and q with the free, non-commercial 32-bit version. I
usually use J and sometimes APL, which had MapReduce since at least the 80s
for APL.Check out this post from 2009 [3]. I guess the 'new shiny' bit threw
me in your chosen title.

[1] [https://stacresearch.com/news/2014/02/13/stac-reports-
intel-...](https://stacresearch.com/news/2014/02/13/stac-reports-intel-ivy-
bridge-ex-stac-m3-and-kdb)

[2] [https://kxcon2016.com/agenda/](https://kxcon2016.com/agenda/)

[3] [http://blog.data-miners.com/2009/04/mapreduce-hadoop-
everyth...](http://blog.data-miners.com/2009/04/mapreduce-hadoop-everything-
old-is-new.html)

~~~
srpeck
You might find these benchmarks interesting:
[http://kparc.com/q4/readme.txt](http://kparc.com/q4/readme.txt)

~~~
eggy
I was inquiring about benchmarks for RiakTS, but your link was perfect. I am a
J/APL dabbler, and quite recently learning kdb+/q (I prefer k).

As much as I step away from these languages, I always find my way back to them
in strange ways. I was studying music, and there was a great J article in
Vector magazine written in August 2006 [1] that walks through scales, and
other musical concepts in J.

A Forth-based music software called Sporth [2] has a kona ugen in it, so you
can generate scales or other musical items in kona, and then use them in the
stack-based Sporth audio language.

My interests in kdb+/q, k, J and APL are in applying them to mathematical
investigations of music, visuals, doing data analysis, and then just code
golfing, or toying around. They're so much fun!

I need more time on large streaming datasets (Time Series data), than large
disk-based datasets to really test latencies. I am building a box much better
suited for it than my current machine. The goal is to stay in RAM as much as
possible.

[1]
[http://archive.vector.org.uk/art10010610](http://archive.vector.org.uk/art10010610)

[2]
[https://github.com/PaulBatchelor/Sporth](https://github.com/PaulBatchelor/Sporth)

~~~
srpeck
You should definitely check out JohnEarnest/RodgerTheGreat's iKe, built on his
open source k interpreter in JS. Fun examples:
[http://johnearnest.github.io/ok/ike/ike.html?gist=bbab46d613...](http://johnearnest.github.io/ok/ike/ike.html?gist=bbab46d6132bf961976524393a5517e5)
and
[http://johnearnest.github.io/ok/ike/ike.html?gist=b741444d04...](http://johnearnest.github.io/ok/ike/ike.html?gist=b741444d04de8efc937f146b2b77904e)

[https://github.com/JohnEarnest/ok/tree/gh-
pages/ike](https://github.com/JohnEarnest/ok/tree/gh-pages/ike)

[https://github.com/JohnEarnest/ok](https://github.com/JohnEarnest/ok)

And related APL/J/K subreddit:
[https://www.reddit.com/r/apljk/](https://www.reddit.com/r/apljk/)

~~~
eggy
I had stumbled upon John's work before. I am currently dabbling with a stack-
based audio language called Sporth [1], and messing with the idea of somehow
mashing it up with John's ike project.

See, vector/array languages aren't just for FinTech or Time Series!

[1]
[https://github.com/PaulBatchelor/Sporth](https://github.com/PaulBatchelor/Sporth)

------
the_alchemist
> Riak uses the SHA hash as its distribution mechanism and divides the output
> range of the SHA hash evenly amongst participating nodes in the cluster.

Wait, Riak uses SHA as distribution hash? Why use a cryptographic hash for
distribution and not something like Murmur3, if you're talking about high-
performant[0] ?

[0] [http://blog.reverberate.org/2012/01/state-of-hash-
functions-...](http://blog.reverberate.org/2012/01/state-of-hash-
functions-2012.html)

~~~
baq
i'd bet it doesn't matter.

~~~
siculars
Author here. We hash the key so the number of bytes are negligible. So, I'd
agree, I don't think it matters.

~~~
zcam
Isn't it more about the number of cycles it takes to generate the hash more
than it's size, since you're likely to do that very often in a db context?

~~~
siculars
Probably. I'd venture to say the reason we use SHA is due to its uniform
distribution over its speed (or lack thereof.)

~~~
zcam
I guess it's worth investigating, murmur has been in use by some big names
(cassandra, elastic search, hadoop, etc...) for a while in similar contexts.

------
fit2rule
I've always found it quite curious that computer human interfaces have always
focused on the noun/verb proposition of describing data, and not the
time/place. Time is the only true constant in the universe, and yet computers
are set up to track and control it, seemingly, as a second thought.

Imagine if instead of having files/folders to (teach,confuse) Grandma, we
simply had a time-based system of references. If Time was a principle unit of
information that a user was required to understand as an abstract concept, I
feel that it would result in far better user interfaces.

We can see this in the Music-making world, where Time is the most significant
domain over which a Musician exerts control. A DAW-like interface for managing
events seems to me to be quite intuitive - for so many other non-musical
applications - that its almost extraordinary that someone hasn't built an
email system, or accounting system, or a graphical-design system, of
applications, oriented around this aspect. (Of course, they are out there -
but it seems that Time management makes the dividing line between
"professional" and "dilettante" users rather thick...)

~~~
pbowyer
> Imagine if instead of having files/folders to (teach,confuse) Grandma, we
> simply had a time-based system of references. If Time was a principle unit
> of information that a user was required to understand as an abstract
> concept, I feel that it would result in far better user interfaces.

How would it be easier when I, let alone my Grandma, can't remember if I did
something last week or the week before last? It seems it puts a higher
cognitive load on the user.

~~~
x5n1
Yeah I actually don't think the human brain accounts very well for time. Like
despite a decade has passed I don't feel as if time has moved very much for
me. Despite the fact that people age, it does not seem an inbuilt thing to
recognize your age. It just seems to be something that happens to your body.

Thinking a bit more about it I don't think the brain accounts for time in long
term memory, it does a better job in short term memory. That's why a musician
can use time and we can't use it very well for stuff we stored a week ago.

------
bbrazil
Are there performance numbers available?

We're on the look out for suitable remote storage for prometheus.io, and would
want to know the hardware that'd be required to handle 1M samples/s and how
many bytes a sample takes up.

It doesn't support full float64 which we need, but we could workaround by
putting it into a 64 bit unsigned number.

~~~
svjethani
We have engaged a 3rd party that will be doing testing that will get
published. Currently we have done testing around specific customer use cases.

------
im_down_w_otp
Might be worth looking into dalmatiner.io (DalmatinerDB) as an alternative to
this. It's also built on riak_core to manage cluster membership and the top-
level framework for dealing with routing and rebalancing.

Waited for a long time for Riak TS to come out. Tried KairosDB & Cyanite, but
the operational overhead of Cassandra wasn't something I wanted to buy into
for such a narrow use case (infrastructure metrics store), and then suddenly
out of nowhere DalmatinerDB was released. The code is clean, the architecture
is solid, and the ops story is simple.

I don't have any affiliation of any kind with the Dataloop folks. I am however
a happy end-user. We do currently use Riak KV due to its CRDT support though.

~~~
Licenser
DalmatinerDB has been around for a while now. I started working on it during
the EUC in 2014, and it was released as part of ProjectFiFo the same year I
just really suck at marketing ;).

But I'm kind of curious, are you using Dalmatiner directly? And if so what is
your use case?

~~~
im_down_w_otp
Current deployment is Graphite (Carbon/Whisper) proper due to a lack of time
to change it out, but eval'd Dalmatiner and should be migrating to it in the
next 6 months (before the year is out). Made it through testing like a champ.

The use case isn't particularly interesting though. Just multi-site machine
and application metrics aggregation. Needed something that could work with
Graphite tooling, would be highly-available, and be comparatively trivial to
operate/maintain.

------
rdtsc
I see SQL support, that is interesting. Isn't Riak the premier NoSQL database.
I guess it is a NoNoSQL db now ;-)

The implementation of SQL part is so neat. Great work whoever did that. It
uses yecc and leex that comes with Erlang and rebar even knows how to compile
those. Very cool!

[https://github.com/basho/riak_ql](https://github.com/basho/riak_ql)

~~~
gordonguthrie
[takes a bow]

~~~
rdtsc
Awesome work!

------
Confusion
Poses the question

    
    
      So what’s the big deal? People have been recording 
      temporally oriented data since we could chisel on tablets.
    

Never answers it, but instead explains how Riak handles large time series.
Certainly interesting, but I would like an answer to this question, as I don't
understand the big deal.

~~~
ttctciyf
Immediately following the part you quote (my emphasis)

    
    
      Well, as it turns out, thanks to the software-eating-
      the-world thing and the Internet of Things we happen to be
      amassing *vast quantities* of all sorts of data [...] The
      demand for systems that are capable of storing and
      retrieving temporal data on an *ever increasing scale*
      necessitates systems that are specifically designed for
      this purpose.
    

Strongly implying the big deal is the need to scale?

~~~
Confusion
That's an obstacle that needs to be overcome because you want to get
somewhere. A challenging obstacle that has spawned an entire industry, but
nevertheless not the goal. It is implied we can get somewhere new and exciting
with these vastly larger timeseries. My question is: where?

------
flatwhiskers
In terms of getting data into RiakTS, would streaming something through Kafka
be an option for instance?

~~~
mdigan
Hi, Basho employee here. Yes, Kafka is an option. Here's an example about
using Kafka with Riak TS and Spark Streaming:
[http://docs.basho.com/riak/ts/1.3.0/add-ons/spark-riak-
conne...](http://docs.basho.com/riak/ts/1.3.0/add-ons/spark-riak-
connector/usage/streaming-example/)

------
epaulson
As someone who deals with sensor data, the tricky part is really not the
write-rate, but rather dealing with messy data. There's a lot of parallelism
in sensor network streams, and for many domains you never look at the sensors
from one device against the sensors of another device, so you can put them in
entirely different databases and it doesn't matter. (It's not true in every
case, of course, but if you're doing time series/streaming, ask yourself if
it's true for you before picking a system)

The real pain is handling data that arrives out of order or otherwise very
late, or handling data that never arrives at all, or handling data that's
clearly wrong. Worse, you may have streams that are defined/calculated from
other streams for some algebra on series, e.g. series C is series A plus
series B - so handling new data on A means you need to recalculate/update the
view for C.

Oh, and you'd like this all to be mostly declarative so you have some way to
migrate between systems if you need to switch for whatever reason.

Apache Beam/Google Dataflow gets a lot of this stuff right: it's not quite as
declarative as I'd like but it gets the windowing flexibility right and
handles restatements at a data model level.

~~~
siculars
>The real pain is handling data that arrives out of order or otherwise very
late

Riak TS uses leveldb under the hood. Leveldb is natively sorted. In riak ts
that includes the bucket. so the sort order is basically bucket/%PK where PK
is your composite PK as defined in your CREATE TABLE statement. See Local Key
[0].

[0]
[http://docs.basho.com/riak/ts/1.3.0/using/planning/](http://docs.basho.com/riak/ts/1.3.0/using/planning/)

------
jsonninja
For the TS experts out there, any real world experience with Influx?
([https://influxdata.com/](https://influxdata.com/))

~~~
thinkdevcode
Fantastic db, but maybe not quite ready for production usage. We are using
influx for a small portion of our ingestion engine as well as for storing
server metrics. The updates/improvements have been pretty astounding over the
last year, but also hard to keep up. I had to fork the nodejs library just to
update it from 0.9 to 0.12 [0] because there were a _LOT_ of breaking changes.
Pre-0.9 there were many issues we ran into when it came to disk space &
performance but they have all been resolved as of 0.9.

Another thing to note (after speaking with them on a few occasions) is that
they are only providing cluster support to their enterprise offering which
wont be available till this summer. They do offer Relay which is their high
availability tool for the open source version.[1] You really won't need
clustering unless you're doing an insane amount of writes: single server
performance is insane right now. We average 10k writes/sec with bursts up to
5x that and it doesn't break a sweat (on a cheap 2 cpu/7gb ram instance, with
ssd block storage).

[0] [https://github.com/thinkdevcode/node-
influx](https://github.com/thinkdevcode/node-influx)

[1]
[https://docs.influxdata.com/influxdb/v0.12/high_availability...](https://docs.influxdata.com/influxdb/v0.12/high_availability/relay/)

~~~
pauldix
InfluxDB CTO here, thanks for the note! We've definitely had significant
improvements even over the last two months. And the breaking API changes will
be a thing of the past very soon:
[https://github.com/influxdata/influxdb/blob/master/CHANGELOG...](https://github.com/influxdata/influxdb/blob/master/CHANGELOG.md#v100-unreleased)

0.13 drops Thursday and 1.0 is the next release :)

------
walrus01
Time series does not necessarily have to be about 'huge' data either, just a
much greater level of historical precision. Example:

ISP sells a circuit with 95th percentile billing to a customer.

If you poll SNMP data from a router interface on 60 second intervals and store
it in an RRA file, you will lose a great deal of precision over time (because
RRAs are highly compressed over time). You'll have no ability to go back and
pull a query like "We want to see traffic stats for the DDoS this customer
took at 9am on February 26th of last year".

with time series statistics you can then feed it into tools such as grafana
for visualization.

An implementation such as openTSDB to grab the traffic stats for a particular
SNMP OID and store it will allow you to store all traffic data forever and
retrieve it as needed later on. The amount of data written per 60 second
interval is miniscule, a server with a few hundred GB of SSD storage will be
sufficient to store all traffic stats for relevant interfaces on core/agg
routers for a fairly large sized ISP for several years.

------
rch
Love the simple install for development on a Mac. Thanks for that.

~~~
siculars
I can't speak for the engineers but a lot of folks at basho are constantly
building and tearing down riak single instance or multi instance clusters on
redhat/ubuntu virtual machines. I do that and I also have different versions
of riak sitting in their own folders on my mac hd. mac osx support +1.

------
gnufied
Looks really nice. although I am bit sad to see that - it requires structured
schema. I have been lookout for a metric collection system (like influxdb) and
this would fit very well - except the schema part.

~~~
siculars
You can kinda fudge it by making one of the columns a varchar. It will store
whatever you put in it like stringified json but not compute over it
(arithmetic, filters, aggs).

(author)

------
siculars
Hello, I'm the author of the post. Thanks for all the interest! AMA!

