
Nanocubes: Fast Visualization of Large Spatiotemporal Datasets - chuhnk
http://nanocubes.net/
======
dhotson
Kind of related, I recently made a similar style of visualisation for
99designs.com -
[http://a.tiles.mapbox.com/v3/dhotson.d2013/page.html](http://a.tiles.mapbox.com/v3/dhotson.d2013/page.html)

Once I got the data into PostGIS and aggregated into hexes (slightly harder
than it sounds) it wasn't that big a deal to visualise using Tilemill and
MapBox.

I haven't read the nanocubes paper yet, but it sounds much much faster than my
approach using PostGIS. It took me many hours to import and process tens of
millions of points in my visualisation. It'd be great to be able to turn
something like this around in <10 mins.

Would this nanocubes approach work with hexes? (they look cooler IMO)

Offtopic: After all this, all I can think about—is making a Settlers of Catan
MMO. ;-)

~~~
cscheid
Heh, I had a conversation a few weeks ago with Jeff Heer, coauthor of a very
similar paper where we talked about how to make this stuff work with hexes. I
think it can be done, but the hierarchical part is a pain, and I'll need more
than a HN comment to explain :) Look me up offline if you're interested and we
can chat.

------
pyalot2
Or as we in the graphics fields call it: mipmapping, geoclipmapping,
megatexturing, virtual texturing or slippy maps.

There's a related capability called "interpolation" which could be linear,
bicubic etc. Looks nice too, try it.

~~~
cscheid
(Hey, I'm a big fan of your WebGL work!)

This is really not about interpolation or mipmapping; it's about creating a
data structure that allows you to quickly answer queries with which to build
visualizations. For example, you might want to see a heatmap of all tweets
that happened in 2013, but you might want to see only the count of a
particular week, or a month of tweets from Instagram. You might also want to
see all of these things above over the entire world, or over a single metro
region. What nanocubes enables is answering these queries in time essentially
proportional to the size of the _output_ (we take a polylog hit, but that's
kind of to be expected). We don't even touch all of the input data, which, if
you ask me, is kind of nice.

I wish I had bigger public datasets to show you, but we keep this fast
experience with datasets in the billions of points.

About your interpolation point: yes, we could (and do sometimes) use
interpolation and smoothing. We choose here not do it here so that it looks
the same when you go to the webgl-less mode and run it on an iPhone or iPad,
there isn't a big visual difference.

~~~
pyalot2
> it's about creating a data structure that allows you to quickly answer
> queries with which to build visualizations

Hm, I see, so this can output the virtual miplevels quickly? Also, would it
work with a "full" dataset (i.e. no empty cells)?

> About your interpolation point: yes, we could (and do sometimes) use
> interpolation and smoothing.

Also consider miplevel interpolation, makes the transitions between levels
much easier on the eye (doesn't work well with nearest sampling).

~~~
cscheid
> Hm, I see, so this can output the virtual miplevels quickly?

Yes: when you see the tiles on (say)
[http://www.nanocubes.net/view.html#twitter](http://www.nanocubes.net/view.html#twitter),
what you're getting is not a precomputed tile: the server is actually visiting
the datastructure every time (and I hope I can convince you that at that
speed, it is _not_ touching 210 million points :). The reason you can't
precompute all choices is simply that there's too many of them and you'd run
out of bits in the universe (the paper,
[http://www.nanocubes.net/assets/pdf/nanocubes_paper.pdf](http://www.nanocubes.net/assets/pdf/nanocubes_paper.pdf),
has details).

> Also, would it work with a "full" dataset (i.e. no empty cells)?

Not exactly. We get away with an in-memory data cube exactly because in many
practical cases the data is sparse over the address space. With dense data
you'll take a much more significant memory hit (but it's one you'll have to
take with _any_ aggregation scheme. The paper, again, has details)

> Also consider miplevel interpolation, makes the transitions between levels
> much easier on the eye (doesn't work well with nearest sampling).

That's fair, although again we're hoping to keep the WebGL version feature-
compatible with the leaflet.js version (that does canvas-only, and has no zoom
transition at all); the smooth version looks nicer, as you can see in the (bit
on the PR-heavy side, sorry) following video
[https://www.youtube.com/watch?v=8P9QA6TJwys#t=69](https://www.youtube.com/watch?v=8P9QA6TJwys#t=69)

------
cscheid
Author here, happy to answer questions.

~~~
vanderZwan
> and in _some_ cases it uses sufficiently little memory that you can run a
> nanocube in a modern-day laptop.

What's the limitation? (I admit that I did not attempt reading the paper yet)

EDIT: To reword it, this sentence gives the impression that there could more
factors than just plain quantity, and I was wondering if that is the case and
what that could be.

~~~
cscheid
Memory usage gets worse as the number of dimensions increases (because there's
more ways in which to build summaries), and also as the number of unique keys
increases (because there's more of different things to store). The main
advantage nanocubes has is that it's a _sparse_ scheme: previous work on fast
data cubes for visualization (immens, for example) use a _dense_ storage
scheme, and so memory usage goes up proportionally to the size of the address
space instead.

------
nitrogen
A high speed multidimensional query seems like it would be useful for
visualizing and analyzing server logs to look for attacks and other trends.

~~~
cscheid
You're right, it is :)

The version with online demos uses latitude/longitude for the spatial
dimension, but a new version we're working on (under the master branch on
github) allows arbitrary x,y addresses. With that, we've encoded IP addresses
as locations in space-filling curves to play around with source IP/destination
IP datasets. It's particularly nice because the hierarchical nature of the
spatial addresses end up mapping to larger and larger subnets of the IPv4
space.

------
teddyh
Would this make sense as a type of index in PostGIS?

~~~
cscheid
It's open source
([http://github.com/laurolins/nanocube](http://github.com/laurolins/nanocube)),
so one could certainly try that. With that said, it is currently an in-memory
store with no disk backing, so it would be a significant amount of work to get
it going.

At the same time, I certainly believe that _some_ sort of hierarchical, low-
footprint data cube would be a great addition to postgis.

