

Timeseries data storage in MongoDB - seigenblues
http://www.slideshare.net/sky_jackson/time-series-data-storage-in-mongodb

======
moe
Aw, this was physically painful to skim.

What you really want for time-series data is a column db such as cassandra (or
vertica etc.), perhaps HBase, _perhaps_ a RDBMS, or _perhaps_ a plain old log-
file.

What you most definitely don't want is Microsoft Access or MongoDB. Thinking
about it, MS Access might still work to a degree.

~~~
rbranson
As long as it all the data for all metrics fits in RAM, it will work great in
MongoDB. Of course, as soon as it doesn't, you're totally hosed.

To expand on the parent, a column store like HBase or Cassandra is perfect as
each row can represent a timeslice and then each column can represent a single
event or record within that timeslice. As the row gets evicted from it's
initial storage in memory, it will be written to disk contiguously, in sorted
order, and batch reads of this data is sequential.

It is possible to use MongoDB to store a timeslice as a document, but it is
not designed to scale out to store very large numbers of columns within a
single document.

~~~
seigenblues
happily, this is not a synchronous application with tens of thousands of
concurrent users, so "totally hosed" for us may have a very different
definition.

What do you mean by "very large numbers of columns" ? I've seen some mongo
users with very rich, i.e. large, document models.

~~~
rbranson
1,000s, 10,000s, 100,000s, millions. MongoDB columns are designed for
serialization of rich documents, not store unbounded ranges of data values.

Even without tens of thousands of concurrent users, you'll eventually run into
a deeply critical performance wall when MongoDB starts reading from disk. It's
really best to think of it as an in-memory database.

------
ericHosick
For a temporal or time based key value store (I think this is kinda what the
presentation shows) I used a collection that was something like:

Temporal Collection { _id: "X1", data_temporal : [ { time_start: SomeDate,
time_stop: SomeDate, _id: "ID2" }, { time_start: SomeDate2, time_stop:
SomeDate2, _id: "ID2" }]

Data Collection { _id: "ID1", parent: X1, data: { field1: "some info", field2:
34 }, _id: "ID2", parent: X1, data: { field1: "Some info new", field2: 34 } }

What is cool about this is that if you have access to the data like ID1, you
can easily find out when it was added and how it changed.

If you have access to the temporal ID, X1, then at any time you can see what
the data looked like.

If you need to relate data, the "foreign key" used is the data_temporal ID. In
this way, it is possible to ask what your key value store data looked like at
any time.

But, this could be off from the article.

This also works quite well in a relational database.

------
cstuder
Being in a similar industry, I share the sentiments of the last couple of
slides: Dataloggers are expensive and horrible pieces of hardware. Proprietary
solutions with no regard for real-life scenarios (Limited connectivity, power
failures, connection failures, weird and inflexible data formats...)

I would love to have a look at their Arduino based solution.

------
iskander
Are there advantages over storing data in HDF? I've been working with a few
hundred gigabytes of financial data this summer and I'm finding that python's
data-oriented libraries (h5py, numpy, scipy, matplotlib, scikits.learn) cover
my needs.

~~~
seigenblues
depends on your usage. The rest of the toolchain is all python, numpy, scipy,
matplotlib especially.

This might be poorly titled: it's not just about the storage as it is about
the aggregation of disparate sensor data into coherent, continuous data
streams.

~~~
iskander
I'm totally ignorant of mongodb: what does it do for you (in the way of data
aggregation) that's not easy in numpy?

~~~
hogu
if your data can fit into arrays, then there's no advantage in terms of the
types of aggregations.

however mongo allows you to store complex structures, think nested
dictionaries/lists, and query on those nested structures, even allowing you to
reach inside of nested structures to do the querying.

I guess you could do nested structured arrays in numpy, I've never done that
before.

~~~
iskander
I use h5py's datasets (which are organized hierarchically and stored in
compressed chunks) to do basic filtering and then load a fraction of my data
into memory as numpy arrays.

------
luigi
I saw this presentation live at MongoDC and it was awesome.

------
amalag
What about Hadoop? You can also use Hadoop as a backend for hadoop style
filesystems. What about using Hive? Does it also require a fixed schema?

------
snowwindwaves
he needed to find this board for his datalogger
[http://www.amazon.com/Webcontrol-Universal-Temperature-
Humid...](http://www.amazon.com/Webcontrol-Universal-Temperature-Humidity-
Controller/dp/B001H4JXLU/ref=wl_it_dp_o?ie=UTF8&coliid=I2CE1S2ZFOUCP6&colid=2RTLJLUX5CSOZ)

------
yannis
Interesting presentation, would do better with some more details in a blog or
pdf.

~~~
seigenblues
thanks, i'll try and write them up. any particular areas you'd like to see
expanded?

~~~
yannis
I am particularly interested in the software implementation. I am a Mechanical
Engineer, involved with high rise construction. Looking to disrupt Building
Management Systems.

~~~
theatrus2
Overtaking a BMS has huge problems (I know, I am in that space). On one end
you're talking about certified hardware for a variety of needs (hardware,
especially certified to various ASHRAE/ANSI/etc specs is expensive for a small
company, no two ways about it). Second is the "no one got fired for buying
IBM" mentality - if you hook up with Siemens, JCI and the like, you have paid
that huge maintenance contract to make the vendor fix their problem and you
won't end up with an insolvent small vendor's VAV controllers which don't
work.

~~~
yannis
Think positive, that is why you and I are on HN and our colleagues live in 30
year old specs space.

