Hacker News new | comments | ask | show | jobs | submit login
Don't use Scatterplots (chrisstucchio.com)
135 points by yummyfajitas on May 26, 2012 | hide | past | web | favorite | 73 comments



Great points, but in the author's 2D histograms the use of 'color' to show point density is a little misleading. Color is not a scalar, it's a vector, and simply varying hue doesn't map out well to the human perception of lightness or intensity. It also doesn't allow for meaningful interpolation. If you take a look at http://vis4.net/blog/posts/avoid-equidistant-hsv-colors/ you can see a strategy for building perception-aware color palettes/gradients, which would be much more useful in this case.

(actually it looks like this wasn't the author's decision, but something built into the graphing library)


I agree with you. I think it's better to use transparency to show density-- here's an example from ggplot2: http://had.co.nz/ggplot2/graphics/25464f1f2c009d435b862debca... and the page in general at http://had.co.nz/ggplot2/geom_point.html

And if you do this overplotting with partly transparent data points, you can even continue using scatterplots (which have several benefits over binned color density plots, such as showing the actual data, rather than summaries).


Agreed - good article, but terrible heat maps. Another great article about this: http://www.research.ibm.com/people/l/lloydt/color/color.HTM


Interesting article you linked to and something to consider certainly. That being said, I somewhat disagree with you about the histograms being misleading. His color scheme looks to be based much the same as thermal imaging. Which I find is quite clear about what are the 'hot' zones (or in this case high point density) and the 'cold' zones (or low point density).


I think zacharyvoase's point was that it might either emphasize or deemphasize how much hotter or colder something was. Because the eye doesn't see blue->yellow->red hue changes consistently, what looks like a "big" change on the graph might actually be fairly small.


Don't use Density Plots ..

... without seriously considering the benefits of a scatterplot.

Here is some data I plotted recently: http://i.imgur.com/lvraM.png

Yes, you need to be careful about overlap (transparency can help), but without a scatterplot, I would not see the sharp edges, or have my attention drawn to the outliers.

Density plots imply a model: By creating bins, square or hex, you are adding a layer of interpretation on top of the 2D data, which can be bad. Also the bins of this article have sharp edges (high frequency content) and add artificial structure. I think smooth density plots, not covered by this article, are superior.

Best of both worlds? --> http://www.survey-design.com.au/graphs/density_plot1.png


Having spent quite a bit of time looking at geospatial data in density plots, I'd warn that visualizing overlapping point data as a density plot is not a panacea for this problem. As it turns out, the distribution of colors along the density vector is also critically important.

e.g. Support you have lots of data where there are overlaps that aggregate to a value of 2, then a few that overlap to 5, and one or two that overlap to a value of 9.

Depending on how you distribute the colors you may either:

1. Only really see the colors at the low end and the high end of the spectrum, thus missing the 5 values (or they may be so near the color for 2 that you can't perceive them).

2. Clearly delinieate between the 2s the 5s and the 9s, but it's not clear that the 5s are more than magnitude than the 2s, while the 9s are not as big of a multiplier from the 5s.

3. Some other distribution that shows a different story on the magnitude of the density, but can be interpreted wildly differently.

Different color gradients can also be perceived by our visual systems differently, blue-to-red doesn't always cut it.


Putting the data in log-scale before coloring should improve your ability to visualize magnitude differences. (issue 2 that you raised)


Another great one is Jenks. It doesn't do much to help with comparing magnitude between colored classification levels, but it can help visualize the fact the categories exist at all.

http://en.wikipedia.org/wiki/Jenks_Natural_Breaks_Optimizati...


One of his complaints is about truncation and therefore overlapping points. It's very common when you have multiple points occupying the same space to add a 3rd dimension of color ala the hexplot or size of point... or just add "jitter", which is easy in R or SAS.

Though, as http://www.statisticalanalysisconsulting.com/scatterplots-de... points out, past 10,000 points, jitter isn't enough and alternative plots are probably best.


(Going off-tangent here...sorry, hacker news makes me think like this). Chris, what do you think about setting up a stats-plotting-as-a-service kind of thing? People would give you data in some acceptable format and you chart it up and send it back? I am thinking of this more as a consumer service. At work, every once in a while, I get data back from testing and then I am scrambling to use gnuplot to chart data because using charts allows me to quickly visualize patterns and communicate the result of the data to others. But, I am not an expert at this and am continually refreshing my rusty knowledge and have always thought it would be great to send data out to someone who is comfortable doing this and get it back charted from them.

Note that I am thinking small (every once in a while; small amounts of data). Something like a subscription service for a small monthly fee. Or the ability to do one time payment for one-off jobs. Things I would look for in such a service are (1) willingness to clean data a little bit (most times I get excel spreadsheets, but I always need to do some tweaking to how the data is presented) (2) support to do a NDA -- this is mostly so the company I am with doesn't get annoyed that I am sending sensitive data to a random outside service.

My needs might be small, but to me it seems like this might be a valuable service for organizations that continually handle large amounts of data.

Just a thought...


We've got a service like this. We're currently working on a new version. See http://prettygraph.com


Wouldn't that be the same idea as the data input side of Wolfram|Alpha Pro?


Send me an email, contact info is in my profile.


The guy who made the original scatter plot is even a math teacher... yikes.

The post author provides some nice python code to create the plots, though the choice of hex bins when the original data was rounded to the nearest integer is a bit strange. Square bins set to some integer size would be better I believe.

If you're using Matlab, I've recently found cloudPlot (http://www.mathworks.com/matlabcentral/fileexchange/23238-cl...) to work rather nicely for "dense scatterplots".


Square bins set to some integer size would be better I believe.

I agree - a gaussian regularization would also work nicely. But matplotlib comes with a hexbin, so that's what I used.


Hexagons do have the benefit of being isotropic, however.


They have a higher order of symmetry than rectangles, but how are they isotropic?


I've probably remembered something wrong here, nevertheless the centres of all adjacent hexagons are of equal distance to the centre of the central hexagon. In contrast, a square sharing one vertex (diagonally adjacent) with the central square is farther than the square sharing one edge.


That's only if you use a scaling factor of 9 instead of 4.

edit: Wikipedia tells me hexagonal tiling is conjectured to be the tiling with the smallest perimeter per cell, and is the densest way to arrange circles on a plane. So that's a big plus when binning.


Click through gets you at http://mathworld.wolfram.com/HoneycombConjecture.html, which shows it is not a conjecture anymore. It was proven in 1999-2001 (I do not know whether the ArXiv version needed improvements)


See also http://www.mathworks.com/matlabcentral/fileexchange/19280-bi... amongst other density estimators, for a nice smooth plot of the estimated joint density.

(Downside being there's still a somewhat-arbitrary choice to be made for the smoothing bandwidth, and it's not quite as visually obvious as the choice of number of bins in a histogram plot.)


Somewhat arbitrary, but not completely. There's some heavy math that suggests sqrt(n) scaling in the 1-d case.


I've never seen a data visualization tool that is universally applicable, so a simple edict like "don't use scatterplots" is a bit too simple. This hexagonal plot looks cool for the problem under question but there are obvious cases where it would be unnecessarily complicated and less informative that a scatter plot. There's a reason why all of the various plot types were invented.


Under what circumstance do you believe a scatterplot is superior to a density plot?


Scatterplots are the tools of choice for displaying time-dependent data that doesn't necessarily makes a smooth curve. Often you only have one value per time stamp, so overlap is not a problem.

In fact, the wikipedia article even says "A scatter plot is used when a variable exists that is under the control of the experimenter". https://en.wikipedia.org/wiki/Scatter_plot

If the author had read that part of the wikipedia article, I guess his claim would have been more specific :-)


I guess you are right, graphing y=f(x) by plotting points does meet the definition of a scatterplot. Go ahead and use it for those cases.


If your data can be displayed without points overlapping, a scatterplot can display all the information, while a density plot will always display only a summary of the data. The larger your grid size, the greater the loss of information.


And, by adding noise ("jitter" or "dither") to each point, you can still use a plain scatterplot even for many kinds of overlapping data.

It's simple to do and mimics reversing the effect of truncation of the data (at least for continuous quantities). Just use uniformly distributed values that are as wide as one bin width.

For most purposes, I prefer adding dither, and then using transparency, to moving to a density plot, for exactly the reason you mention -- the density plot introduces another parameter, the smoothing method, which puts another layer between you and the data.


Yes, if the data was sparse enough that it could be plotted without overlap and the graph carried an annotation saying so, I could see a scatter plot being better.


With less than 20 non-overlapping points on the graph, I doubt that I would want to use anything else.

Furthermore, if my data has two outliers that are near each other, they may well be indistinguishable in the hexplot from one (or five) clustered outliers.

Your post was very interesting and your examples are great. I'll definitely use hexplots in the future. But I will still default to scatterplots. It's just easier to see if there's something wrong with the data, and they require less interpretation.


Lots... imagine plotting y = f(x) for several different data series on the same axes. A density plot for all the different data series would be really crowded, whereas a scatter plot could (not necessarily would) better show the differences between the series


A density plot would never have revealed the discrete grouping along the x-axis due to rounding. The density plot hides it, but if he started from density plots this might have gotten him into trouble (for example if he would group them in [starttime,stoptime[ partitions it would skew results) .


A scatterplot will be superior whenever you want to see all your data (and not smooth some of it away).


Small number of points, high variability, accurate numbers.


Or use scatterplots with points that have opacity < 100%. C'mon there is still a bunch of cases where a scatterplot is much much clearer than that kind of 2d histogram, for example when there is a non linear relation with very high correlation.


Here is a hexbin of a highly correlated nonlinear relationship (y=x^2 + gaussian(0,0.1)), together with a scatterplot with opacity:

opacity=35%: http://i.imgur.com/VB1E0.png

opacity=20%, and 2x bigger bins for the hexbin: http://i.imgur.com/yDR3j.png

Which image do you think better displays the image?

Just to make it interesting, I set it up so that the x values are not uniformly distributed. That's very easy to see in the hexbin, but still hard in the scatterplot.

The hexbin will always have a higher visual resolution than the scatterplot because the hexbin uses multiple colors to differentiate different densities. The scatterplot uses only one color.


Just FYI, the colormap you picked is terrible. There's many experiments to back this claim:

http://gvi.seas.harvard.edu/paper/evaluation-artery-visualiz...

http://www.jwave.vt.edu/~rkriz/Projects/create_color_table/c...

Colormap design is arguably as hard as visualization design. My favorite go-to place for them is http://colorbrewer2.org, but if you need to know only one thing about them, it's that varying hue continuously does not work nearly as well as you might think it does.

In addition, the fundamental reason scatterplots are bad, even with opacity, is essentially that opacity gives rise to an exponential relationship between overplotting and transparency.

There exists an alternative solution, which is to use additive blending and an adaptive linear colorscale, from zero to maximum-overdraw. Unfortunately, at present there exists no data visualization toolkits which support this.


>In addition, the fundamental reason scatterplots are bad, even with opacity, is essentially that opacity gives rise to an exponential relationship between overplotting and transparency.

can you elaborate on this, I don't see why it would not be linear.

>There exists an alternative solution, which is to use additive blending and an adaptive linear colorscale, from zero to maximum-overdraw. Unfortunately, at present there exists no data visualization toolkits which support this.

I think this _might_ be done in Mathematica, since Graphics objects can be manipulated symbolically, but I might be wrong.


It boils down to the (very reasonable) way alpha blending works. Alpha was originally designed to always lie between zero and one, which for compositing makes sense. For scatterplot colormapping, not so much:

If you create a plot with opacity alpha, and which puts N points on top of each other, the remaining 'transparency', that is, the resulting opacity is

1 - (1 - alpha)^N

This is an exponential, which has the unfortunate feature that it's flat for most of the regime, and then spikes in a relatively short scale. The spike is where we get color differentiation (different opacities get different colors). That's bad: color differentiation should be uniform across the scale.

I'm pretty certain Mathematica doesn't do this right either, because it's a pixel-based technique that requires frame buffer manipulation. Instead of rendering with the usual blending operation, you do everything with additive blending, compute the maximum overdraw, and then color-scale linearly.


They suggest the rainbow color map is a poor choice because it hides small details. I'm advocating regularization which ensures that you have no small details.

But after reading that paper, I do agree that rainbow has some significant issues. One thing that might be worth trying: make a rainbow color map, but map values to colors in such a way that |x-y|=C x cielab_dist(color(x), color(y)).


I believe the 'Spectral' colormap (with a capital S) in matplotlib does exactly that.

Based on the names, it seems [1, 2] (along with all the other capitalized matplotlib colormaps) to be a ColorBrewer [3] colormap, which are all designed with these perceptual considerations in mind [4].

[1] https://github.com/gka/chroma.js/wiki/Predefined-Colors

[2] http://matplotlib.sourceforge.net/examples/pylab_examples/sh...

[3] http://colorbrewer2.com/

[4] http://vis4.net/blog/posts/avoid-equidistant-hsv-colors/


The superiority of your hexbin follows from setting the density too high for the scatter plot. With points this dense, opacity of 5 or lower is necessary to see the uneven distribution along the x axis. With appropriate opacity, the two plots are pretty similar visually. What's more, the hexbin by definition has lower resolution, since you lump data into discrete bins.

This is an example of bad plotting practice, not a bad plotting method. That said, eyeballing a plot is a weak way to analyze data this dense. That's what statistics are for.


Ok, redid it with 5% opacity. You are correct, at that level, the density distribution in x is qualitatively visible.

http://i.imgur.com/geKbT.png

I disagree that the hexbin has lower resolution. The color dimension allows the human eye to easily differentiate regions having similar densities (e.g., 70 vs 50). The difference between deep red and orange is a lot bigger than the difference between dark blue and slightly less dark blue.

The hexbin has lower spatial resolution, it's true, but I'd argue that the spatial resolution you get in a scatterplot is illusory. It doesn't reflect the underlying probability distribution, only the particular sample.


> The difference between deep red and orange is a lot bigger than the difference between dark blue and slightly less dark blue.

That may be true, but is it better? Does the scatterplot underemphasize differen densities, or does the density plot overemphasize them? I think the scatterplot is more intuitive. There are 21 equal steps between 0 and 100% black. Two points are twice as dark as one, four points are twice as dark as two. Darker means more, lighter means less.

Compare that to shifting from blue to red. Does the shift from orange to red indicate the same density difference as the shift from blue to orange? To decide you need to consult the color scale. The scatterplot is intuitive, and requires no scale.

> The hexbin has lower spatial resolution, it's true, but I'd argue that the spatial resolution you get in a scatterplot is illusory. It doesn't reflect the underlying probability distribution, only the particular sample.

The spatial resolution of a scatterplot represents empirical reality. Each point corresponds to a single observation, with no probability distribution implied or imposed. The density plot, in contrast, imposes a probability distribution, which may or may not reflect the true distribution of the population. The larger the bins, the more likely the displayed pattern is 'illusory'.


The spatial resolution of a scatterplot represents empirical reality. Each point corresponds to a single observation, with no probability distribution implied or imposed.

tl;dr; I'm a Bayesian, you are a Frequentist (at least with regards to plotting).


I don't know what this has to do with bayes vs frequentist. I am not arguing that the data do not have a probabilty distribution. I am arguing that it is better to show all the data when possible, rather than an eye-catching but lossy summary.


Just a note, but you're using awfully small hexes and large points there. The two plots approach each other asymptotically, especially if you switch to a monochromatic colormap like some here have argued for.

Histograms tend to work best when the data is well understood, while scatterplots are better for samples from an unknown distribution (incl. lattices, multimodals or even double exponentials).

Try also small samples. When the piecewise uniform prior (on both the data and the intensity) is approximately accurate, histograms are far better, as they guide the eye away from nonexistent patterns. But the bandwidth needs to be judiciously set, and often the data transformed.

Clustering is hard to automate.


I have 20-70 data points per hex, at least in the high density regions. While I could have used bigger hexes, I think the difference between a hexbin and a scatterplot is well illustrated.


Fair enough. But your point markers are almost half the size of the hex. That makes color the principal difference.


(It also helps to set edgecolor='none' to remove the black line around each point.)


If you have enough point density for this to be a problem there's surely something better you can do though, whether it be binning the data or something a little more sophisticated like plotting a kernel density estimate.


Choosing the right tool for data visualization is important and scatterplots aren't good for everything, but they have their place and can be really useful. For instance this scatterplot from carsabi.com shows a lot of very useful information: http://cl.ly/GvnM

Scatterplots can be a lot more useful when you do things like display the points in different colors or highlight a specific point of interest as in the Carsabi scatterplots.

Do you really think scatterplots should never be used? I think a headline like that takes away from your credibility. The problem isn't with scatterplots, it's choosing the right tool for the job.


Headline: Don't use Scatterplots

Postscript: Some people seem to be interpreting me as making a stronger claim than I intend. There are obviously a few cases when a scatterplot truly is the right tool.

Inflammatory, overstated headlines are a rhetorical device that we're apparently stuck with for a long time to come. They're like television advertisements used to be: only curmudgeons[1] complain about them anymore because people assume the content wouldn't exist without them.

[1] Hello.


> For instance this scatterplot from carsabi.com shows a lot of very useful information: http://cl.ly/GvnM

This graph by itself is arbitrarily lossy in that we don't know how many overlapping samples there are at each point. And we don't know if blue samples are overlapping orange ones or vice-versa. I'm not sure I would be as categorical as the author about never using scatter plots, but he makes a really good point.


"Never" is a strong word, but I wanted a succinct title.

I think the default should be a density plot. It's only in special cases that a scatterplot would be appropriate. For example, that Carsabi plot actually works well, due to the fact that the reader is interested in finding a specific data point rather than understanding the global behavior.


I think the default should be the method that displays the most information. Why hide information if you don't have to? In the case of one dimensional data, a dotplot shows the reader everything. Using a boxplot reduces information content, mean-plus-errorbars reduces this further. The mean plus errorbars imposes a probability distribution, which may be wrong, it doesn't reveal a hidden truth.

The same holds in two dimensions. Show me all the data, and include a regression line or a spline to highlight a trend. Only start hiding information when the scatterplot becomes misleading. That is, when overplotting prevents me from accurately assessing the actual distribution of the points.

Jumping immediately to a density plot also restricts me to your interpretation. The original data is lost. With a scatterplot, the raw data can be recovered from the plot, so i can do my own analysis should i be interested. This is common in meta-analyses that extract data from multiple published papers. If those original papers had used density plots instead of scatterplots, reanalysis will require direct access to the underlying data. Once the original author dies, or loses the data, all further use of the data is lost.


The original data would be well represented in 100x100 matrix. since the data (grades 0-100) is already discrete. Basically the first picture in the article with a alpha setting that matches 1 (1=opaque) when multiplied with the maximum number of entries per field. e.g Max entries = 5 => alpha = 1/5 = 0.2. Alternatively aggregating for 10x10 20x20 25x25 50x50 would work to if the data is too sparse. There is in need for Hex binning in this case!

Best practice: http://www.nytimes.com/interactive/2012/05/09/us/politics/sa...


Alpha doesn't work additively for pretty much anydrawing packages:

http://en.wikipedia.org/wiki/Alpha_compositing

When overplotting, the usual compositing operator gives a final alpha of

1 - (1-alpha)^N

So your alpha = 1/5 overdrawn 5 times would give a final opacity of ~0.673. By its very nature, there is no alpha < 1 which when composited together a finite number of times gives alpha = 1.


I was aware of that this approach towards alpha was oversimplistic to begin with, should have pointed that out. Thanks for posting the correct formula.


Oh boy, this is apparently the scattershot approach to data visualization.

Always explore you data first and than use the visualization that best transports your intended message. (Yes there should be one, why else are you making a graphic?)


Alot of us make visualizations during data exploration, where we don't have a message.


What's really weird is how different 2008 and 2009 are for "same teacher, different subjects".

In 2008, it looks like a blob (no real trend). In 2009, some teachers obviously learn how to do well in VAM (excelling at both math and english "teaching"), and others don't. I guess that some teachers learnt what it takes to get good VAM scores.


As a (UK, math(s)) teacher I have to warn you that gaming the scores will inevitably happen when you have scores and those scores may impact on career.

http://www.ofsted.gov.uk/resources/mathematics-made-measure

Long, UK specific, but possibly of interest to those with school age children in educational systems using these metrics.

PS: most of the data I plot on scatter plots has low density and high variability so little stacking.


Other ways to handle this include transparency, smaller points, and jittered data.

I used to like the concept of density plots, but more and more I feel that they can be misleading. Plot your actual data if you can; use summarizing models as a fallback and a second step.


One could hybridise the two - a scatter plot whose points change colour as a function of proximity (the equivalent of a density plot with 1px wide bins and the background colour for zero).

This would preserve the outlier flagging of a scatter plot while alleviating its risk of obfuscation via density. Plus, one need not worry about picking an appropriate bin width any longer.


You still have to pick a proximity radius for coloring. That's really just the bin width (or bandwidth of a density plot) in disguise.


True, but this is primarily to address the decreased ability of a density map to indicate outliers. One could also have the proximity be a power function over the entire field (one could do this with density, too, but it makes more sense with a point).


Sometimes I've added random jitter to scatterplots to get something that represents density better when data is truncated.


Would the stuff you guys are talking about in this thread be statistical analysis techniques or something more along the lines of data visualization? I ask because I'm fascinated by it and would love to find where I can learn more. I'm just not sure where to even start.


Wen plotting dense, largely unconstrained data (i.e. not on a 0-100 scale), I've found a hybrid scatter/density plot to be nice. Basically, for each point you do a kernel density estimation, then color the points according to that. For the sparse data, the scatter plot seems just as useful, but at the dense parts, you seamlessly switch to color as the useful metric. Something like this: http://www.mathworks.com/matlabcentral/fileexchange/8577-sca...


The point about the scatterplots is valid, but if he thinks a correlation of .26 (same teacher, subject and year, but different grade classroom) is sufficient to justify VAM, which arguably has no validity let alone reliability, then it's impossible to take him seriously.


Just add a jitter.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: