
Datashader: turns even the largest data into images, accurately - luu
http://datashader.org/
======
mhalle
Looks like a great project. Contrary to other comments, rendering !=
visualization. This project seems to have paid attention to lots of the
seemingly little but critical details of this type of visualization that are a
pain to handle yourself (anti-aliasing of multi-scale data, terrain shading,
large- and out-of-core visualization).

Any one of these topics can bring a visualization project to a screeching
halt, or make the results look misleading or bad.

Even better that they built a tool that works with existing libraries, rather
than replacing them. Good work!

~~~
BubRoss
> anti-aliasing of multi-scale data, terrain shading, large- and out-of-core
> visualization

Webgl will basically do all of that for you, including the out of core if you
can stream the data in.

~~~
mhalle
No, it can't. WebGL is at a completely different level of the stack than
Datashader. WebGL is even lower level than what most people use for 3D
graphics (hence threejs and babylonjs).

Data visualization is at a higher semantic level than rendering; ideally you
don't want to deal with pixels and polygons. D3, for instance, binds graphical
primitives (usually in SVG) with data representations but requires more
programming to do actual data visualization (and that's why a bunch of
software layers on top of D3). Bokeh deals with still higher level primitives
closer to the data set level (plotting and charts).

And Datashader carves out a niche where there's too much data to have a 1:1
ratio of data element to graphical primitive on the screen. It does that by
rasterizing, but then also handling the hard part of mapping backwards from
image to data for selection and interactivity (I hope that's right; I got it
from watching the 2016 video).

Anyone who has had to do this stuff for a living knows it is hard to do right,
and that good modular tools are always welcome.

I don't see how this repeated "it's just like X" line of responses is
benefiting the discussion. Datashader is not just like WebGL or basic low-
level rendering, any more than D3 is just SVG or web apps are just TCP
connections. Completely different levels of abstraction with lots of value add
in between (and hard work, I'm sure).

~~~
BubRoss
I didn't say that webgl would do everything this can do, I said it would do
the things I copied from the post I replied to.

Again, my issue is that they seem to put a lot of focus on this having some
sort of sophisticated new rendering when it seems to be marketing of trivial
techniques. People seem to like what this library does, but they didn't invent
new rendering algorithms and their buzzwords and clever names just show a lack
of awareness of what they are doing in the rendering department.

~~~
jbednar
I'm not sure if you're objecting to the name "Datashader", but surely every
library needs a name, and this one is accurate in that it allows the sort of
shading that one does for 3D rendering to be applied to 2D data plotting. Or
are there other buzzwords used in the docs you find objectionable?

~~~
BubRoss
If I said I was an expert in 'big data visualization with billions of points'
and had written my own 'out of core' rendering library that I dubbed 'data
shader', complete with a paper where I coined the term 'Abstract Rendering' or
'AR' for short, then you found out that I was just reading points from disk
and drawing them with opengl's draw points function, what would you think?

The term 'out of core' rendering comes from raytracing, where you really do
need all the geometry available. They are applying it to trivial accumulation
where it was never a problem in the first place. That's like me writing a
paper on how to make a balloon air tight. That's how it has always worked, why
would I take credit for something that was never a problem?

~~~
jbednar
Sigh. Datashader is not a paper, it's an actual usable piece of software, so
it should be compared to other tools and libraries for rendering data. Unlike
nearly ever other 2D plotting library available for Python, it can operate in
core or out of core, so it's entirely appropriate to advertise that fact (why
hide it?). Unlike OpenGL's point drawing functions and nearly every other 2D
plotting library available for Python, it avoids overplotting and z-ordering
issues that make visualizations misleading (so why hide that?). Unlike NumPy's
histogram2D, it allows you to define what it means to aggregate the contents
of each bin (mean, min, std, etc.), to focus on different aspects of your
data. It's a mystery to me why you think Datashader should somehow fail to
advertise what it's useful for!

~~~
BubRoss
> Datashader is not a paper

[https://www.semanticscholar.org/paper/Abstract-
rendering%3A-...](https://www.semanticscholar.org/paper/Abstract-
rendering%3A-out-of-core-rendering-for-Cottam-
Lumsdaine/41a157f01b27bcf20143fc578a0db49f4e7b90f3)

You keep defending the project as a whole while not confronting the fact that
they are touting rendering breakthroughs, while I have given a lot of
explanation of why there are no rendering breakthroughs and the actual
rendering, no matter where it is done and no matter how much data is used, is
trivial. I'm not sure what can help you focus in on the point I'm making here,
I haven't strayed from it. This isn't about the workflow or the language used
or anything else. It is about false claims and buzzwords to make people think
that it is solving rendering problems that have never existed like 'accuracy'
and 'big data' ( in the context of these visualizations ).

~~~
eggie
They are touting it specifically in the context of the visualization of very
large datasets.

The fact that their software exists is itself a breakthrough. It enabled me to
do things that other equivalent tools (such as in statistical packages) could
not allow. I would have been reduced to directly implementing my rendering
pipelines, and I would also have had to make many of the same design decisions
they made, such as doing things out of core.

------
IanCal
Datashader is a great project. Very fast, very easy to use. You can throw a
lot of data at it in a notebook and get back a zoomable interactive pane.

Here's a 2016 talk on it:
[https://www.youtube.com/watch?v=fB3cUrwxMVY](https://www.youtube.com/watch?v=fB3cUrwxMVY)

There's likely a lot of improvements since then, but that should help show
some of the core parts and explain why it's a useful tool.

------
24gttghh
[https://anaconda.org/jbednar/gerrymandering/notebook](https://anaconda.org/jbednar/gerrymandering/notebook)

I wish more people were outraged at this kind of election tampering. Great
visualizations though! Zoom in on some of those tight masses of black
outlines. The shapes are ridiculous. Maryland 3rd? Come on.

------
itodd
I've used datashader for plotting NGS (Next Generation Sequencing)
enrichments. At the time I had to hack together the ability to use the polygon
select tool on the data, but it worked and blew my mind.

Very elegant solution to a difficult problem (overplotting).

~~~
abcc8
Do you have any examples of this you could point to online? I am looking at
different visualization tools for various NGS-based analyses currently.

~~~
itodd
I do. I remember posting this to the mailing list. I don't have an example
calculating enrichments though. We simply group by read and divide using the
frequencies and then plot one enrichment vs another. This way we can see how
one sequence enriches between conditions. There's more to it than that but
this will produce a plot similar to the one in the attachment in the thread
linked below.

[https://groups.google.com/a/continuum.io/forum/m/#!msg/bokeh...](https://groups.google.com/a/continuum.io/forum/m/#!msg/bokeh/BdQV54W0WnY/OHjlMAtqBAAJ)

Edit: here's a link to the plot I refer to:
[https://10826817673355204906.googlegroups.com/attach/e6e58ad...](https://10826817673355204906.googlegroups.com/attach/e6e58ade6248/Screen%20Shot%202017-05-04%20at%202.47.07%20PM.png?part=0.1&view=1&vt=ANaJVrHL6xR_HvljfCodWVnBwALwADyL8wIpWga5D7Q5DzAHYPOsTTpqg193EuAC81xxXb-
toP7Nv8y724yMogPSNlLXeU6lZXJ-Trm3r5-zpIVHEe2fpUI)

------
tokyodude
> Turns even the largest data into images, _accurately_

The first image, the image of the USA, seems really mis-representative to me.
LA and NYC should be way way way more bright in relation to everything else
than the entire area east of the Mississippi.

At least to my eyes that map makes it look like parts of Denver, Kansas City,
Salt Lake City, Atlanta, and the San Joaquin Valley are just as dense as
Manhattan.

Atlanta's population density 630 per square mile

Manhattan's population density 70826 per square mile

It seems like an accurate data image would have Atlanta's brightness 1/100th
of Manhattan's. Basically it looks like they saturated out at around 250
people do anything over 250 people is the same brightness.

~~~
jbednar
By default, Datashader accurately conveys the shape of the distribution in a
way that the human visual system can process. If you want a linear
representation, you can do that easily; see the first plot in
[http://datashader.org/topics/census.html](http://datashader.org/topics/census.html)
, but you'll quickly see that the resulting plot completely fails to show that
there are any patterns anywhere besides the top few population hotspots, which
is highly unrepresentative of the actual patterns in this data. There is no
saturation here; what it's doing in the homepage image is basically a rank-
order encoding, where the top brightness value is indeed shared by several
high-population pixes, the next brightness value is shared by the next batch
of populations, etc. Given only 256 possible values, there has to be some
grouping, but it's not saturating.

------
johnmarinelli
Looks like a really cool project. One thing that I would be interested in see
would be using Datashader as a dynamic visualisation library - for example,
generative art projects. Probably not the main interest of data visualisation
practitioners but hey, if you've got a sweet pipeline to render all those
points, why not?

~~~
jbednar
The attractors at
[http://datashader.org/topics/strange_attractors.html](http://datashader.org/topics/strange_attractors.html)
are probably closer to art than science...

------
whoisjuan
What license is this project using? The repo has a license but besides the
provisions listed there I don't see any standard license.

~~~
tnvaught
The repo has a standard 3-clause BSD license.

------
burtonator
This actually gave me an interesting idea regarding bitcoin passphrase
mnemonics.

Instead of text we could use the same algorithm to generate images.

So you could have an index of images and generate them. I'm actually wondering
if you could use nouns and verbs to maybe make stories if you could mutate the
nouns reliably.

Like 'bird flying' vs 'bird sleeping' ...

This could help to remember long passphrases visually which people seem to be
better at.

------
simplyinfinity
Is there anything similar for network graphs?

~~~
lmeyerov
Yep -- at
[https://github.com/graphistry/pygraphistry](https://github.com/graphistry/pygraphistry),
we started by making millions of nodes/edges interactive. If you use
notebooks, can signup on our site and get going. The trick is we connect GPUs
in the browser to GPUs in the cloud, and encapsulate it enough that you can
stick to writing standard SQL/pandas/etc.

We've been curious about server-side static tile rendering for larger graphs,
but has been on the back-burner. (We already connect to GPUs on the server, so
not rocket science.) Currently, we're actively increasing how much can be
ingested + computed on, such as for finding influencers, communities & rings,
etc. However, visualizing that hasn't been an operational priority for our
users. More useful to generate the communities, and then either inspect
individual ones, or see how communities stitch together: quickly run out of
pixels otherwise due to too many edges. Likewise, we're building connectors to
gigascale-petascale graph DBs: titan, janus, aws neptune, tigergraph, spark
graphx, etc.

We still are interested, but more for when we start supporting geographic
maps: you can see that is the primary use for datashader. Also, because data
art is fun :)

~~~
jbednar
Geographic maps aren't the primary use for datashader; those are just easy
examples that people can appreciate without a lot of explanation. In practice
we use it for _any_ large datasets that we don't want to subsample before
visualizing them.

~~~
lmeyerov
Yes, definitely more capable. I've seen them primarily used in scatterplots
(x, y, maybe z) + maps. Curious where else you're seeing the 80/20 breakdown
if not there...

