(actually it looks like this wasn't the author's decision, but something built into the graphing library)
And if you do this overplotting with partly transparent data points, you can even continue using scatterplots (which have several benefits over binned color density plots, such as showing the actual data, rather than summaries).
... without seriously considering the benefits of a scatterplot.
Here is some data I plotted recently:
Yes, you need to be careful about overlap (transparency can help), but without a scatterplot, I would not see the sharp edges, or have my attention drawn to the outliers.
Density plots imply a model: By creating bins, square or hex, you are adding a layer of interpretation on top of the 2D data, which can be bad. Also the bins of this article have sharp edges (high frequency content) and add artificial structure. I think smooth density plots, not covered by this article, are superior.
Best of both worlds? --> http://www.survey-design.com.au/graphs/density_plot1.png
Support you have lots of data where there are overlaps that aggregate to a value of 2, then a few that overlap to 5, and one or two that overlap to a value of 9.
Depending on how you distribute the colors you may either:
1. Only really see the colors at the low end and the high end of the spectrum, thus missing the 5 values (or they may be so near the color for 2 that you can't perceive them).
2. Clearly delinieate between the 2s the 5s and the 9s, but it's not clear that the 5s are more than magnitude than the 2s, while the 9s are not as big of a multiplier from the 5s.
3. Some other distribution that shows a different story on the magnitude of the density, but can be interpreted wildly differently.
Different color gradients can also be perceived by our visual systems differently, blue-to-red doesn't always cut it.
Though, as http://www.statisticalanalysisconsulting.com/scatterplots-de... points out, past 10,000 points, jitter isn't enough and alternative plots are probably best.
Note that I am thinking small (every once in a while; small amounts of data). Something like a subscription service for a small monthly fee. Or the ability to do one time payment for one-off jobs. Things I would look for in such a service are (1) willingness to clean data a little bit (most times I get excel spreadsheets, but I always need to do some tweaking to how the data is presented) (2) support to do a NDA -- this is mostly so the company I am with doesn't get annoyed that I am sending sensitive data to a random outside service.
My needs might be small, but to me it seems like this might be a valuable service for organizations that continually handle large amounts of data.
Just a thought...
The post author provides some nice python code to create the plots, though the choice of hex bins when the original data was rounded to the nearest integer is a bit strange. Square bins set to some integer size would be better I believe.
If you're using Matlab, I've recently found cloudPlot (http://www.mathworks.com/matlabcentral/fileexchange/23238-cl...) to work rather nicely for "dense scatterplots".
I agree - a gaussian regularization would also work nicely. But matplotlib comes with a hexbin, so that's what I used.
edit: Wikipedia tells me hexagonal tiling is conjectured to be the tiling with the smallest perimeter per cell, and is the densest way to arrange circles on a plane. So that's a big plus when binning.
(Downside being there's still a somewhat-arbitrary choice to be made for the smoothing bandwidth, and it's not quite as visually obvious as the choice of number of bins in a histogram plot.)
In fact, the wikipedia article even says "A scatter plot is used when a variable exists that is under the control of the experimenter". https://en.wikipedia.org/wiki/Scatter_plot
If the author had read that part of the wikipedia article, I guess his claim would have been more specific :-)
It's simple to do and mimics reversing the effect of truncation of the data (at least for continuous quantities). Just use uniformly distributed values that are as wide as one bin width.
For most purposes, I prefer adding dither, and then using transparency, to moving to a density plot, for exactly the reason you mention -- the density plot introduces another parameter, the smoothing method, which puts another layer between you and the data.
Furthermore, if my data has two outliers that are near each other, they may well be indistinguishable in the hexplot from one (or five) clustered outliers.
Your post was very interesting and your examples are great. I'll definitely use hexplots in the future. But I will still default to scatterplots. It's just easier to see if there's something wrong with the data, and they require less interpretation.
opacity=20%, and 2x bigger bins for the hexbin: http://i.imgur.com/yDR3j.png
Which image do you think better displays the image?
Just to make it interesting, I set it up so that the x values are not uniformly distributed. That's very easy to see in the hexbin, but still hard in the scatterplot.
The hexbin will always have a higher visual resolution than the scatterplot because the hexbin uses multiple colors to differentiate different densities. The scatterplot uses only one color.
Colormap design is arguably as hard as visualization design. My favorite go-to place for them is http://colorbrewer2.org, but if you need to know only one thing about them, it's that varying hue continuously does not work nearly as well as you might think it does.
In addition, the fundamental reason scatterplots are bad, even with opacity, is essentially that opacity gives rise to an exponential relationship between overplotting and transparency.
There exists an alternative solution, which is to use additive blending and an adaptive linear colorscale, from zero to maximum-overdraw. Unfortunately, at present there exists no data visualization toolkits which support this.
can you elaborate on this, I don't see why it would not be linear.
>There exists an alternative solution, which is to use additive blending and an adaptive linear colorscale, from zero to maximum-overdraw. Unfortunately, at present there exists no data visualization toolkits which support this.
I think this _might_ be done in Mathematica, since Graphics objects can be manipulated symbolically, but I might be wrong.
If you create a plot with opacity alpha, and which puts N points on top of each other, the remaining 'transparency', that is, the resulting opacity is
1 - (1 - alpha)^N
This is an exponential, which has the unfortunate feature that it's flat for most of the regime, and then spikes in a relatively short scale. The spike is where we get color differentiation (different opacities get different colors). That's bad: color differentiation should be uniform across the scale.
I'm pretty certain Mathematica doesn't do this right either, because it's a pixel-based technique that requires frame buffer manipulation. Instead of rendering with the usual blending operation, you do everything with additive blending, compute the maximum overdraw, and then color-scale linearly.
But after reading that paper, I do agree that rainbow has some significant issues. One thing that might be worth trying: make a rainbow color map, but map values to colors in such a way that |x-y|=C x cielab_dist(color(x), color(y)).
Based on the names, it seems [1, 2] (along with all the other capitalized matplotlib colormaps) to be a ColorBrewer  colormap, which are all designed with these perceptual considerations in mind .
This is an example of bad plotting practice, not a bad plotting method. That said, eyeballing a plot is a weak way to analyze data this dense. That's what statistics are for.
I disagree that the hexbin has lower resolution. The color dimension allows the human eye to easily differentiate regions having similar densities (e.g., 70 vs 50). The difference between deep red and orange is a lot bigger than the difference between dark blue and slightly less dark blue.
The hexbin has lower spatial resolution, it's true, but I'd argue that the spatial resolution you get in a scatterplot is illusory. It doesn't reflect the underlying probability distribution, only the particular sample.
That may be true, but is it better? Does the scatterplot underemphasize differen densities, or does the density plot overemphasize them? I think the scatterplot is more intuitive. There are 21 equal steps between 0 and 100% black. Two points are twice as dark as one, four points are twice as dark as two. Darker means more, lighter means less.
Compare that to shifting from blue to red. Does the shift from orange to red indicate the same density difference as the shift from blue to orange? To decide you need to consult the color scale. The scatterplot is intuitive, and requires no scale.
> The hexbin has lower spatial resolution, it's true, but I'd argue that the spatial resolution you get in a scatterplot is illusory. It doesn't reflect the underlying probability distribution, only the particular sample.
The spatial resolution of a scatterplot represents empirical reality. Each point corresponds to a single observation, with no probability distribution implied or imposed. The density plot, in contrast, imposes a probability distribution, which may or may not reflect the true distribution of the population. The larger the bins, the more likely the displayed pattern is 'illusory'.
tl;dr; I'm a Bayesian, you are a Frequentist (at least with regards to plotting).
Histograms tend to work best when the data is well understood, while scatterplots are better for samples from an unknown distribution (incl. lattices, multimodals or even double exponentials).
Try also small samples. When the piecewise uniform prior (on both the data and the intensity) is approximately accurate, histograms are far better, as they guide the eye away from nonexistent patterns. But the bandwidth needs to be judiciously set, and often the data transformed.
Clustering is hard to automate.
Scatterplots can be a lot more useful when you do things like display the points in different colors or highlight a specific point of interest as in the Carsabi scatterplots.
Do you really think scatterplots should never be used? I think a headline like that takes away from your credibility. The problem isn't with scatterplots, it's choosing the right tool for the job.
Postscript: Some people seem to be interpreting me as making a stronger claim than I intend. There are obviously a few cases when a scatterplot truly is the right tool.
Inflammatory, overstated headlines are a rhetorical device that we're apparently stuck with for a long time to come. They're like television advertisements used to be: only curmudgeons complain about them anymore because people assume the content wouldn't exist without them.
This graph by itself is arbitrarily lossy in that we don't know how many overlapping samples there are at each point. And we don't know if blue samples are overlapping orange ones or vice-versa. I'm not sure I would be as categorical as the author about never using scatter plots, but he makes a really good point.
I think the default should be a density plot. It's only in special cases that a scatterplot would be appropriate. For example, that Carsabi plot actually works well, due to the fact that the reader is interested in finding a specific data point rather than understanding the global behavior.
The same holds in two dimensions. Show me all the data, and include a regression line or a spline to highlight a trend. Only start hiding information when the scatterplot becomes misleading. That is, when overplotting prevents me from accurately assessing the actual distribution of the points.
Jumping immediately to a density plot also restricts me to your interpretation. The original data is lost. With a scatterplot, the raw data can be recovered from the plot, so i can do my own analysis should i be interested. This is common in meta-analyses that extract data from multiple published papers. If those original papers had used density plots instead of scatterplots, reanalysis will require direct access to the underlying data. Once the original author dies, or loses the data, all further use of the data is lost.
When overplotting, the usual compositing operator gives a final alpha of
1 - (1-alpha)^N
So your alpha = 1/5 overdrawn 5 times would give a final opacity of ~0.673. By its very nature, there is no alpha < 1 which when composited together a finite number of times gives alpha = 1.
Always explore you data first and than use the visualization that best transports your intended message. (Yes there should be one, why else are you making a graphic?)
In 2008, it looks like a blob (no real trend). In 2009, some teachers obviously learn how to do well in VAM (excelling at both math and english "teaching"), and others don't. I guess that some teachers learnt what it takes to get good VAM scores.
Long, UK specific, but possibly of interest to those with school age children in educational systems using these metrics.
PS: most of the data I plot on scatter plots has low density and high variability so little stacking.
I used to like the concept of density plots, but more and more I feel that they can be misleading. Plot your actual data if you can; use summarizing models as a fallback and a second step.
This would preserve the outlier flagging of a scatter plot while alleviating its risk of obfuscation via density. Plus, one need not worry about picking an appropriate bin width any longer.