Nanocubes: Fast Visualization of Large Spatiotemporal Datasets

dhotson · on Jan 14, 2014

Kind of related, I recently made a similar style of visualisation for 99designs.com - http://a.tiles.mapbox.com/v3/dhotson.d2013/page.html

Once I got the data into PostGIS and aggregated into hexes (slightly harder than it sounds) it wasn't that big a deal to visualise using Tilemill and MapBox.

I haven't read the nanocubes paper yet, but it sounds much much faster than my approach using PostGIS. It took me many hours to import and process tens of millions of points in my visualisation. It'd be great to be able to turn something like this around in <10 mins.

Would this nanocubes approach work with hexes? (they look cooler IMO)

Offtopic: After all this, all I can think about—is making a Settlers of Catan MMO. ;-)

cscheid · on Jan 14, 2014

Heh, I had a conversation a few weeks ago with Jeff Heer, coauthor of a very similar paper where we talked about how to make this stuff work with hexes. I think it can be done, but the hierarchical part is a pain, and I'll need more than a HN comment to explain :) Look me up offline if you're interested and we can chat.

pyalot2 · on Jan 14, 2014

Or as we in the graphics fields call it: mipmapping, geoclipmapping, megatexturing, virtual texturing or slippy maps.

There's a related capability called "interpolation" which could be linear, bicubic etc. Looks nice too, try it.

cscheid · on Jan 14, 2014

(Hey, I'm a big fan of your WebGL work!)

This is really not about interpolation or mipmapping; it's about creating a data structure that allows you to quickly answer queries with which to build visualizations. For example, you might want to see a heatmap of all tweets that happened in 2013, but you might want to see only the count of a particular week, or a month of tweets from Instagram. You might also want to see all of these things above over the entire world, or over a single metro region. What nanocubes enables is answering these queries in time essentially proportional to the size of the output (we take a polylog hit, but that's kind of to be expected). We don't even touch all of the input data, which, if you ask me, is kind of nice.

I wish I had bigger public datasets to show you, but we keep this fast experience with datasets in the billions of points.

About your interpolation point: yes, we could (and do sometimes) use interpolation and smoothing. We choose here not do it here so that it looks the same when you go to the webgl-less mode and run it on an iPhone or iPad, there isn't a big visual difference.

pyalot2 · on Jan 14, 2014

> it's about creating a data structure that allows you to quickly answer queries with which to build visualizations

Hm, I see, so this can output the virtual miplevels quickly? Also, would it work with a "full" dataset (i.e. no empty cells)?

> About your interpolation point: yes, we could (and do sometimes) use interpolation and smoothing.

Also consider miplevel interpolation, makes the transitions between levels much easier on the eye (doesn't work well with nearest sampling).

cscheid · on Jan 14, 2014

> Hm, I see, so this can output the virtual miplevels quickly?

Yes: when you see the tiles on (say) http://www.nanocubes.net/view.html#twitter, what you're getting is not a precomputed tile: the server is actually visiting the datastructure every time (and I hope I can convince you that at that speed, it is not touching 210 million points :). The reason you can't precompute all choices is simply that there's too many of them and you'd run out of bits in the universe (the paper, http://www.nanocubes.net/assets/pdf/nanocubes_paper.pdf, has details).

> Also, would it work with a "full" dataset (i.e. no empty cells)?

Not exactly. We get away with an in-memory data cube exactly because in many practical cases the data is sparse over the address space. With dense data you'll take a much more significant memory hit (but it's one you'll have to take with any aggregation scheme. The paper, again, has details)

> Also consider miplevel interpolation, makes the transitions between levels much easier on the eye (doesn't work well with nearest sampling).

That's fair, although again we're hoping to keep the WebGL version feature-compatible with the leaflet.js version (that does canvas-only, and has no zoom transition at all); the smooth version looks nicer, as you can see in the (bit on the PR-heavy side, sorry) following video https://www.youtube.com/watch?v=8P9QA6TJwys#t=69

carlob · on Jan 14, 2014

I think what's novel here is the data structure, not the type of visualization.

I also think the capability of taking slices of the data (just Saturday, just evening) pretty slick from a UI standpoint.

cscheid · on Jan 14, 2014

Author here, happy to answer questions.

vanderZwan · on Jan 14, 2014

> and in some cases it uses sufficiently little memory that you can run a nanocube in a modern-day laptop.

What's the limitation? (I admit that I did not attempt reading the paper yet)

EDIT: To reword it, this sentence gives the impression that there could more factors than just plain quantity, and I was wondering if that is the case and what that could be.

cscheid · on Jan 14, 2014

Memory usage gets worse as the number of dimensions increases (because there's more ways in which to build summaries), and also as the number of unique keys increases (because there's more of different things to store). The main advantage nanocubes has is that it's a sparse scheme: previous work on fast data cubes for visualization (immens, for example) use a dense storage scheme, and so memory usage goes up proportionally to the size of the address space instead.

nitrogen · on Jan 14, 2014

A high speed multidimensional query seems like it would be useful for visualizing and analyzing server logs to look for attacks and other trends.

cscheid · on Jan 14, 2014

You're right, it is :)

The version with online demos uses latitude/longitude for the spatial dimension, but a new version we're working on (under the master branch on github) allows arbitrary x,y addresses. With that, we've encoded IP addresses as locations in space-filling curves to play around with source IP/destination IP datasets. It's particularly nice because the hierarchical nature of the spatial addresses end up mapping to larger and larger subnets of the IPv4 space.

teddyh · on Jan 14, 2014

Would this make sense as a type of index in PostGIS?

cscheid · on Jan 14, 2014

It's open source (http://github.com/laurolins/nanocube), so one could certainly try that. With that said, it is currently an in-memory store with no disk backing, so it would be a significant amount of work to get it going.

At the same time, I certainly believe that some sort of hierarchical, low-footprint data cube would be a great addition to postgis.