Any one of these topics can bring a visualization project to a screeching halt, or make the results look misleading or bad.
Even better that they built a tool that works with existing libraries, rather than replacing them. Good work!
Webgl will basically do all of that for you, including the out of core if you can stream the data in.
Data visualization is at a higher semantic level than rendering; ideally you don't want to deal with pixels and polygons. D3, for instance, binds graphical primitives (usually in SVG) with data representations but requires more programming to do actual data visualization (and that's why a bunch of software layers on top of D3). Bokeh deals with still higher level primitives closer to the data set level (plotting and charts).
And Datashader carves out a niche where there's too much data to have a 1:1 ratio of data element to graphical primitive on the screen. It does that by rasterizing, but then also handling the hard part of mapping backwards from image to data for selection and interactivity (I hope that's right; I got it from watching the 2016 video).
Anyone who has had to do this stuff for a living knows it is hard to do right, and that good modular tools are always welcome.
I don't see how this repeated "it's just like X" line of responses is benefiting the discussion. Datashader is not just like WebGL or basic low-level rendering, any more than D3 is just SVG or web apps are just TCP connections. Completely different levels of abstraction with lots of value add in between (and hard work, I'm sure).
Again, my issue is that they seem to put a lot of focus on this having some sort of sophisticated new rendering when it seems to be marketing of trivial techniques. People seem to like what this library does, but they didn't invent new rendering algorithms and their buzzwords and clever names just show a lack of awareness of what they are doing in the rendering department.
The term 'out of core' rendering comes from raytracing, where you really do need all the geometry available. They are applying it to trivial accumulation where it was never a problem in the first place. That's like me writing a paper on how to make a balloon air tight. That's how it has always worked, why would I take credit for something that was never a problem?
You keep defending the project as a whole while not confronting the fact that they are touting rendering breakthroughs, while I have given a lot of explanation of why there are no rendering breakthroughs and the actual rendering, no matter where it is done and no matter how much data is used, is trivial. I'm not sure what can help you focus in on the point I'm making here, I haven't strayed from it. This isn't about the workflow or the language used or anything else. It is about false claims and buzzwords to make people think that it is solving rendering problems that have never existed like 'accuracy' and 'big data' ( in the context of these visualizations ).
The fact that their software exists is itself a breakthrough. It enabled me to do things that other equivalent tools (such as in statistical packages) could not allow. I would have been reduced to directly implementing my rendering pipelines, and I would also have had to make many of the same design decisions they made, such as doing things out of core.
You imply that accumulative rendering into a framebuffer solves large statistical integration problems. But, the framebuffer is not implemented using abstract math over the real nor integer domain. You need to consider numerical effects of adding the smallest value (one sample) into a running sum.
If you use an integer/fixed precision buffer for the running sums, you need enough bits to avoid overflow even if billions of points land in one bin. You might think to use floating point, but that has worse problems for running sums. You are effectively limited to the number of bits in the mantissa when continuously adding small increments.
So, you cannot scale up the naive approach of zeroing the framebuffer and blending/accumulating points from a stream. You need to do some hierarchical aggregation to accurately represent sub-populations and combine them in a numerically robust manner. Most likely, you would also like to precompute some of these results to support better interactive performance, much like mip-mapping is used to provide more accurate texture sampling at multiple rendering scales.
It's not about what APIs are being used to render whatever. At that level of analysis, all that anybody is every doing, is just doing memcpy and bitblt. Rather, datashader provides a framework for applying semantically meaningful, mathematical transformations on datasets as they're being accumulated, as those accumulations are converted into aesthetic/geom primitives, and as those primitives are rendered into colors. It really is "renderman for data", along with arbitrary vertex/texture shaders, driven by a dynamic rasterizer that can use whatever bins in data-space (not merely physical pixels).
BTW "Out of core" does NOT come from raytracing; in fact its history in computing is a term for anything that exceeds physical memory. We use it all the time in scientific/HPC and data science because datasets are frequently much larger than available memory.
Focus on the workflow refinements, saying g there are rendering breakthroughs here is snake oil.
1GB to the browser for rendering... Sure, it's doable. So is eating 1KG of wings for lunch. Doable, but very far from being a good idea.
Here's a 2016 talk on it: https://www.youtube.com/watch?v=fB3cUrwxMVY
There's likely a lot of improvements since then, but that should help show some of the core parts and explain why it's a useful tool.
I wish more people were outraged at this kind of election tampering. Great visualizations though! Zoom in on some of those tight masses of black outlines. The shapes are ridiculous. Maryland 3rd? Come on.
Very elegant solution to a difficult problem (overplotting).
Edit: here's a link to the plot I refer to: https://10826817673355204906.googlegroups.com/attach/e6e58ad...
The first image, the image of the USA, seems really mis-representative to me. LA and NYC should be way way way more bright in relation to everything else than the entire area east of the Mississippi.
At least to my eyes that map makes it look like parts of Denver, Kansas City, Salt Lake City, Atlanta, and the San Joaquin Valley are just as dense as Manhattan.
Atlanta's population density 630 per square mile
Manhattan's population density 70826 per square mile
It seems like an accurate data image would have Atlanta's brightness 1/100th of Manhattan's. Basically it looks like they saturated out at around 250 people do anything over 250 people is the same brightness.
Instead of text we could use the same algorithm to generate images.
So you could have an index of images and generate them. I'm actually wondering if you could use nouns and verbs to maybe make stories if you could mutate the nouns reliably.
Like 'bird flying' vs 'bird sleeping' ...
This could help to remember long passphrases visually which people seem to be better at.
We've been curious about server-side static tile rendering for larger graphs, but has been on the back-burner. (We already connect to GPUs on the server, so not rocket science.) Currently, we're actively increasing how much can be ingested + computed on, such as for finding influencers, communities & rings, etc. However, visualizing that hasn't been an operational priority for our users. More useful to generate the communities, and then either inspect individual ones, or see how communities stitch together: quickly run out of pixels otherwise due to too many edges. Likewise, we're building connectors to gigascale-petascale graph DBs: titan, janus, aws neptune, tigergraph, spark graphx, etc.
We still are interested, but more for when we start supporting geographic maps: you can see that is the primary use for datashader. Also, because data art is fun :)
I added edge bundling (probably the slowest thing in datashader!) but I know there's examples of flight path rendering in the video I linked in another comment.
How about instead of starting with an insult (I can't believe you didn't already know this) you instead congratulate them on putting together a full working library, with pretty, easy-to-grasp examples then offer up some research links that they could use to further refine and improve their system. It's our job to teach people, you can't expect everyone to suddenly know everything.
To the Datashader team: I apologize for the above comment. Good job in building and launching a tool for others to use, and great choices for examples!
Like you said. Starting with “Great job launching. What are the benefits of using Dropbox over, say just rsync + cron?” would go a long way towards improving the environment around here.
Saying 'visualize big data and billions of points' when the buzzwords are just there to sugar coat an accumulation buffer, gets into a territory of reinventing the wheel but naming it 'the flattened infinite curvature hypersphere'.
So, before you get too self righteous at least realize that it is the delivery and lack of context and precedent, not the actual work that is the problem.
I'm also curious (as a fledgling graphics programmer) - what leads you to believe that Datashader uses an accumulation buffer internally? I would think that they use some magic to draw all the points in a single draw call using instanced rendering, but I am very naive :)
D3 and Bokeh and other web-based visualization tools, in general, plot HTML or CSS primitives to the browser. This approach works great for smaller datasets, but doesn't scale to millions/billions.
Datashader aggregates (accumulates) graphical representations of data into images, then provides a way to get those to the browser and work well with the other libraries. That high level description leaves out 95% of the critical practical details of visualization, which the creators of datashader handle.
That is literally what opengl does. If you mean a histogram per pixel in depth, that's literally voxels in perspective space.
If there are usability benefits here, that's great, but everything seems to be centered around there being new rendering techniques here, when not only are they not new, they're completely trivial, with solidified names and formalized math.
The pipeline is also built in such a way that it permits front-end JS viewers like Bokeh to drive a very dynamic experience.
Their product page is well-written and accurate. It sounds like you want them to purposely describe their product as something that is inferior to what it actually is.
I only actually got my bearings and self-confidence as a programmer when I realized that most of the people pushing blogs with subscriptions about "cutting-edge" tech were literally snake-oil salesmen and shovel merchants.
That coding wasn't actually different from anything else I had learned in my life, and that there were some fundamentals I could latch onto, and grow from there upwards. All this nonsense about the field experiencing a revolution that upends all existing knowledge year-after-year is far more mentally taxing.
I actually have a background in 3D computer graphics, and it's precisely because of my detailed knowledge of raytracing, rasterization, OpenGL, BMRT, photon maps, computational radiometry, BDRFs, computational geometry, and statistical sampling, etc... that when I came to the field of data science & specifically the problem of visualizing large datasets, I realized the total lack of tooling in this space.
The field of information visualization lags behind general "computer-generated imagery" by decades. When I first presented my ideas around Abstract Rendering (which became Datashader) to my DARPA collaborators, even to famous visualization people like Bill Cleveland or Jeff Heer, it was clear that I was thinking about the problem in an entirely different way. I recall our DARPA PM asking Hanspeter Pfister how he would visualize a million points, and he said, "I wouldn't. I'd subsample, or aggregate the data."
Datashader eats a million points for breakfast.
Since you're clearly a computer graphics guy, the way to think about this problem is not one of naive rendering, but rather one of dynamically generating correct primitives & aesthetics at every image scale, so that the viewer has the most accurate understanding of what's actually in the dataset. So it's not just a particle cloud, nor is it nurbs with a normal & texture map; rather, it's a bunch of abstract values from which a data scientist may want to synthesize any combination of geometry and textures.
I chose the name "datashader" for a very specific and intentional reason: we are dynamically invoking a shader - usually a bunch of Python code for mathematical transformation - at every point, within a sampling volume (typically a square, but it doesn't have to be). One can imagine drawing a map of the rivers of the US, with the shading based on some function of all industrial plants in its watershed. Both the domain of integration and the function to evaluate are dynamic for each point in the view frustum.
So does opengl on a decade old computer.
They're not claiming to have reinvented the wheel, they're just explaining what it is.
> 'Turning data into images' isn't exactly a new concept.
No, but doing so on large data accurately (the last word is important that you cut off) is not something I know I can easily achieve in a different python library faster. I'd like to know if I could.
What do you mean by 'large' or 'accurate' ? Where would accuracy be lost in any approach?
We renamed from Abstract Rendering to Datashader for affordances of human cognition.
This is a great paper from Gordon Kindlmann and Carlos Scheidegger talk about how to gauge the accuracy of a visualization, as part of an effort to come up with an algebraic process for visual design: https://vis.cs.ucdavis.edu/vis2014papers/TVCG/papers/2181_20...
Using their metrics around "confusers" and "hallucinators", Datashader came out as one of the few things that doesn't suffer from such intrinsic limitations.
> Rendering techniques are currently a major limiter since they tend to be builtaround central processing with all of the geometric data present.
This is completely untrue - OpenGL and virtually all real time rendering is done using z-buffer techniques that were originally used because they don't need all the geometry present. These techniques date back to the 70s and were some of the first hidden surface rendering algorithms.
> This paper presents Abstract Rendering (AR), a technique for eliminating the cen-tralization requirement while preserving some forms of interactivity.
Interactivity might be novel here so that is what should really be focused on, if anything. I don't think coining a new term and acronym that don't seem to relate to what is happening is a going to be a good choice to communicate the techniques.
> AR is based on the observation that pixelsare fundamentally bins, and that rendering is essentially a binning process on a lattice of bins.
This observation was made in the early 80s and has been the backbone of renderman renderers for almost 40 years. Renderman calls them 'buckets'.
> This approach enables: (1) rendering onlarge datasets without requiring large amounts of working memory,
Renderman originally rendered film resolution images with high resolution textures with only 10MB of memory.
> (3) a direct means of distributing the rendering task across processes,
Giving different threads their own buckets is standard for any non-toy renderer. Distributing buckets across multiple computers is part of many toolsets.
> high-performanceinteraction techniques on large datasets
This is the only part that has a chance of being novel, but paper only shows basic accumulation of density for adjacency matrices. The visualization are timed in the multiple seconds but look extremely simple, and for some reason are rendered 'out-of-core' on a computer with 144GB of memory even though it seems very unclear that these images couldn't be made with z-buffer rendering in opengl.
> This is a great paper from Gordon Kindlmann and Carlos Scheidegger talk about how to gauge the accuracy of a visualization
It looks like that paper is about the transformations of visualizations for higher dimensional data, not rendering accuracy, so these two things are being conflated even though they are completely separate concepts.
Actually, no. The paper may not have been explicitly clear about this, but the ENTIRE point of a "data visualization" system is to transform potentially high-dimensional datasets, with a large number of columns, into meaningful images by a series of steps. You seem to be interpreting this narrowly, and imagining that geometry is already pre-defined in the dataset, so then of course this looks like a fairly trivial 2D accumulator.
That is not the intent, nor is the common use case.
For data visualization, the question of "how do I accurately aggregate or accumulate the 25 - 1million points in this bucket" is a deep one. There is NO data visualization system that programmatically gives access to this step of the viz pipeline to a data scientist or statistician. Most "infoviz" tools gloss over this problem - they do simple Z buffering, or cheesy automatic histograms of color/intensity, etc. These are almost always "wrong" and produce unintended hallucinators.
Your first comment - about "not needing all the geometry present" - indicates that you are not understanding the nature of the problem datashader was designed to solve. There is no simple "cull" function for data science; there is no simple "Z" axis on which to sort, smush, blend, etc. At best, your data points can be projected into some kind of Euclidean space on which you can implement a fast spatial subdivision or parallel aggregation algorithm. But once that's done, you're still left holding millions of partitions of billions of points or primitives, each with dozens of attributes.... what then?
I'm not sure why you would coin a term 'Abstract Rendering' and talk about 'out of core rendering' then turn around and say that transforming high dimensional data sets is part of rendering. Rendering is well defined and very established, coming up with transformations and calling that part of rendering is nonsense. You made this mess yourself by trying to stretch the truth.
What library should I use to do this faster in python without subsampling the data for a start?
Can't remember where I've seen them before - possibly in a book on Symmetry.