
Don't use Scatterplots - yummyfajitas
http://www.chrisstucchio.com/blog/2012/dont_use_scatterplots.html
======
zacharyvoase
Great points, but in the author's 2D histograms the use of 'color' to show
point density is a little misleading. Color is not a scalar, it's a vector,
and simply varying hue doesn't map out well to the human perception of
lightness or intensity. It also doesn't allow for meaningful interpolation. If
you take a look at <http://vis4.net/blog/posts/avoid-equidistant-hsv-colors/>
you can see a strategy for building perception-aware color palettes/gradients,
which would be much more useful in this case.

(actually it looks like this wasn't the author's decision, but something built
into the graphing library)

~~~
pacmon
Interesting article you linked to and something to consider certainly. That
being said, I somewhat disagree with you about the histograms being
misleading. His color scheme looks to be based much the same as thermal
imaging. Which I find is quite clear about what are the 'hot' zones (or in
this case high point density) and the 'cold' zones (or low point density).

~~~
harpastum
I think zacharyvoase's point was that it might either emphasize or deemphasize
how _much_ hotter or colder something was. Because the eye doesn't see
blue->yellow->red hue changes consistently, what looks like a "big" change on
the graph might actually be fairly small.

------
ahuibers
Don't use Density Plots ..

... without seriously considering the benefits of a scatterplot.

Here is some data I plotted recently: <http://i.imgur.com/lvraM.png>

Yes, you need to be careful about overlap (transparency can help), but without
a scatterplot, I would not see the sharp edges, or have my attention drawn to
the outliers.

Density plots imply a model: By creating bins, square or hex, you are adding a
layer of interpretation on top of the 2D data, which can be bad. Also the bins
of this article have sharp edges (high frequency content) and add artificial
structure. I think smooth density plots, not covered by this article, are
superior.

Best of both worlds? --> <http://www.survey-
design.com.au/graphs/density_plot1.png>

------
bane
Having spent quite a bit of time looking at geospatial data in density plots,
I'd warn that visualizing overlapping point data as a density plot is not a
panacea for this problem. As it turns out, the distribution of colors along
the density vector is also critically important.

e.g. Support you have lots of data where there are overlaps that aggregate to
a value of 2, then a few that overlap to 5, and one or two that overlap to a
value of 9.

Depending on how you distribute the colors you may either:

1\. Only really see the colors at the low end and the high end of the
spectrum, thus missing the 5 values (or they may be so near the color for 2
that you can't perceive them).

2\. Clearly delinieate between the 2s the 5s and the 9s, but it's not clear
that the 5s are more than magnitude than the 2s, while the 9s are not as big
of a multiplier from the 5s.

3\. Some other distribution that shows a different story on the magnitude of
the density, but can be interpreted wildly differently.

Different color gradients can also be perceived by our visual systems
differently, blue-to-red doesn't always cut it.

~~~
carver
Putting the data in log-scale before coloring should improve your ability to
visualize magnitude differences. (issue 2 that you raised)

~~~
bane
Another great one is Jenks. It doesn't do much to help with comparing
magnitude between colored classification levels, but it can help visualize the
fact the categories exist at all.

[http://en.wikipedia.org/wiki/Jenks_Natural_Breaks_Optimizati...](http://en.wikipedia.org/wiki/Jenks_Natural_Breaks_Optimization)

------
mwexler
One of his complaints is about truncation and therefore overlapping points.
It's very common when you have multiple points occupying the same space to add
a 3rd dimension of color ala the hexplot or size of point... or just add
"jitter", which is easy in R or SAS.

Though, as [http://www.statisticalanalysisconsulting.com/scatterplots-
de...](http://www.statisticalanalysisconsulting.com/scatterplots-dealing-with-
overplotting/) points out, past 10,000 points, jitter isn't enough and
alternative plots are probably best.

------
grok2
(Going off-tangent here...sorry, hacker news makes me think like this). Chris,
what do you think about setting up a stats-plotting-as-a-service kind of
thing? People would give you data in some acceptable format and you chart it
up and send it back? I am thinking of this more as a consumer service. At
work, every once in a while, I get data back from testing and then I am
scrambling to use gnuplot to chart data because using charts allows me to
quickly visualize patterns and communicate the result of the data to others.
But, I am not an expert at this and am continually refreshing my rusty
knowledge and have always thought it would be great to send data out to
someone who is comfortable doing this and get it back charted from them.

Note that I am thinking small (every once in a while; small amounts of data).
Something like a subscription service for a small monthly fee. Or the ability
to do one time payment for one-off jobs. Things I would look for in such a
service are (1) willingness to clean data a little bit (most times I get excel
spreadsheets, but I always need to do some tweaking to how the data is
presented) (2) support to do a NDA -- this is mostly so the company I am with
doesn't get annoyed that I am sending sensitive data to a random outside
service.

My needs might be small, but to me it seems like this might be a valuable
service for organizations that continually handle large amounts of data.

Just a thought...

~~~
revorad
We've got a service like this. We're currently working on a new version. See
<http://prettygraph.com>

------
rubidium
The guy who made the original scatter plot is even a math teacher... yikes.

The post author provides some nice python code to create the plots, though the
choice of hex bins when the original data was rounded to the nearest integer
is a bit strange. Square bins set to some integer size would be better I
believe.

If you're using Matlab, I've recently found cloudPlot
([http://www.mathworks.com/matlabcentral/fileexchange/23238-cl...](http://www.mathworks.com/matlabcentral/fileexchange/23238-cloudplot))
to work rather nicely for "dense scatterplots".

~~~
mjw
See also
[http://www.mathworks.com/matlabcentral/fileexchange/19280-bi...](http://www.mathworks.com/matlabcentral/fileexchange/19280-bivariant-
kernel-density-estimation-v2-0/content/html/gkde2test.html) amongst other
density estimators, for a nice smooth plot of the estimated joint density.

(Downside being there's still a somewhat-arbitrary choice to be made for the
smoothing bandwidth, and it's not quite as visually obvious as the choice of
number of bins in a histogram plot.)

~~~
wnoise
Somewhat arbitrary, but not completely. There's some heavy math that suggests
sqrt(n) scaling in the 1-d case.

------
rjdagost
I've never seen a data visualization tool that is universally applicable, so a
simple edict like "don't use scatterplots" is a bit too simple. This hexagonal
plot looks cool for the problem under question but there are obvious cases
where it would be unnecessarily complicated and less informative that a
scatter plot. There's a reason why all of the various plot types were
invented.

~~~
yummyfajitas
Under what circumstance do you believe a scatterplot is superior to a density
plot?

~~~
twstws
If your data can be displayed without points overlapping, a scatterplot can
display all the information, while a density plot will always display only a
summary of the data. The larger your grid size, the greater the loss of
information.

~~~
mturmon
And, by adding noise ("jitter" or "dither") to each point, you can still use a
plain scatterplot even for many kinds of overlapping data.

It's simple to do and mimics reversing the effect of truncation of the data
(at least for continuous quantities). Just use uniformly distributed values
that are as wide as one bin width.

For most purposes, I prefer adding dither, and then using transparency, to
moving to a density plot, for exactly the reason you mention -- the density
plot introduces another parameter, the smoothing method, which puts another
layer between you and the data.

------
carlob
Or use scatterplots with points that have opacity < 100%. C'mon there is still
a bunch of cases where a scatterplot is much much clearer than that kind of 2d
histogram, for example when there is a non linear relation with very high
correlation.

~~~
yummyfajitas
Here is a hexbin of a highly correlated nonlinear relationship (y=x^2 +
gaussian(0,0.1)), together with a scatterplot with opacity:

opacity=35%: <http://i.imgur.com/VB1E0.png>

opacity=20%, and 2x bigger bins for the hexbin: <http://i.imgur.com/yDR3j.png>

Which image do you think better displays the image?

Just to make it interesting, I set it up so that the x values are not
uniformly distributed. That's very easy to see in the hexbin, but still hard
in the scatterplot.

The hexbin will always have a higher visual resolution than the scatterplot
because the hexbin uses multiple colors to differentiate different densities.
The scatterplot uses only one color.

~~~
cscheid
Just FYI, the colormap you picked is terrible. There's many experiments to
back this claim:

[http://gvi.seas.harvard.edu/paper/evaluation-artery-
visualiz...](http://gvi.seas.harvard.edu/paper/evaluation-artery-
visualizations-heart-disease-diagnosis)

[http://www.jwave.vt.edu/~rkriz/Projects/create_color_table/c...](http://www.jwave.vt.edu/~rkriz/Projects/create_color_table/color_07.pdf)

Colormap design is arguably as hard as visualization design. My favorite go-to
place for them is <http://colorbrewer2.org>, but if you need to know only one
thing about them, it's that varying hue continuously does not work nearly as
well as you might think it does.

In addition, the fundamental reason scatterplots are bad, even with opacity,
is essentially that opacity gives rise to an exponential relationship between
overplotting and transparency.

There exists an alternative solution, which is to use additive blending and an
adaptive linear colorscale, from zero to maximum-overdraw. Unfortunately, at
present there exists no data visualization toolkits which support this.

~~~
carlob
>In addition, the fundamental reason scatterplots are bad, even with opacity,
is essentially that opacity gives rise to an exponential relationship between
overplotting and transparency.

can you elaborate on this, I don't see why it would not be linear.

>There exists an alternative solution, which is to use additive blending and
an adaptive linear colorscale, from zero to maximum-overdraw. Unfortunately,
at present there exists no data visualization toolkits which support this.

I think this _might_ be done in Mathematica, since Graphics objects can be
manipulated symbolically, but I might be wrong.

~~~
cscheid
It boils down to the (very reasonable) way alpha blending works. Alpha was
originally designed to always lie between zero and one, which for compositing
makes sense. For scatterplot colormapping, not so much:

If you create a plot with opacity alpha, and which puts N points on top of
each other, the remaining 'transparency', that is, the resulting opacity is

1 - (1 - alpha)^N

This is an exponential, which has the unfortunate feature that it's flat for
most of the regime, and then spikes in a relatively short scale. The spike is
where we get color differentiation (different opacities get different colors).
That's bad: color differentiation should be uniform across the scale.

I'm pretty certain Mathematica doesn't do this right either, because it's a
pixel-based technique that requires frame buffer manipulation. Instead of
rendering with the usual blending operation, you do everything with additive
blending, compute the maximum overdraw, and then color-scale linearly.

------
zefhous
Choosing the right tool for data visualization is important and scatterplots
aren't good for everything, but they have their place and can be really
useful. For instance this scatterplot from carsabi.com shows a lot of very
useful information: <http://cl.ly/GvnM>

Scatterplots can be a lot more useful when you do things like display the
points in different colors or highlight a specific point of interest as in the
Carsabi scatterplots.

Do you really think scatterplots should never be used? I think a headline like
that takes away from your credibility. The problem isn't with scatterplots,
it's choosing the right tool for the job.

~~~
yummyfajitas
"Never" is a strong word, but I wanted a succinct title.

I think the default should be a density plot. It's only in special cases that
a scatterplot would be appropriate. For example, that Carsabi plot actually
works well, due to the fact that the reader is interested in finding a
specific data point rather than understanding the global behavior.

~~~
twstws
I think the default should be the method that displays the most information.
Why hide information if you don't have to? In the case of one dimensional
data, a dotplot shows the reader everything. Using a boxplot reduces
information content, mean-plus-errorbars reduces this further. The mean plus
errorbars imposes a probability distribution, which may be wrong, it doesn't
reveal a hidden truth.

The same holds in two dimensions. Show me all the data, and include a
regression line or a spline to highlight a trend. Only start hiding
information when the scatterplot becomes misleading. That is, when
overplotting prevents me from accurately assessing the actual distribution of
the points.

Jumping immediately to a density plot also restricts me to your
interpretation. The original data is lost. With a scatterplot, the raw data
can be recovered from the plot, so i can do my own analysis should i be
interested. This is common in meta-analyses that extract data from multiple
published papers. If those original papers had used density plots instead of
scatterplots, reanalysis will require direct access to the underlying data.
Once the original author dies, or loses the data, all further use of the data
is lost.

~~~
mxfh
The original data would be well represented in 100x100 matrix. since the data
(grades 0-100) is already discrete. Basically the first picture in the article
with a alpha setting that matches 1 (1=opaque) when multiplied with the
maximum number of entries per field. e.g Max entries = 5 => alpha = 1/5 = 0.2.
Alternatively aggregating for 10x10 20x20 25x25 50x50 would work to if the
data is too sparse. There is in need for Hex binning in this case!

Best practice:
[http://www.nytimes.com/interactive/2012/05/09/us/politics/sa...](http://www.nytimes.com/interactive/2012/05/09/us/politics/same-
sex-marriage.html)

~~~
cscheid
Alpha doesn't work additively for pretty much anydrawing packages:

<http://en.wikipedia.org/wiki/Alpha_compositing>

When overplotting, the usual compositing operator gives a final alpha of

1 - (1-alpha)^N

So your alpha = 1/5 overdrawn 5 times would give a final opacity of ~0.673. By
its very nature, there is no alpha < 1 which when composited together a finite
number of times gives alpha = 1.

~~~
mxfh
I was aware of that this approach towards alpha was oversimplistic to begin
with, should have pointed that out. Thanks for posting the correct formula.

------
mxfh
Oh boy, this is apparently the scattershot approach to data visualization.

Always explore you data first and than use the visualization that best
transports your intended message. (Yes there should be one, why else are you
making a graphic?)

~~~
hogu
Alot of us make visualizations during data exploration, where we don't have a
message.

------
wisty
What's really weird is how different 2008 and 2009 are for "same teacher,
different subjects".

In 2008, it looks like a blob (no real trend). In 2009, some teachers
obviously learn how to do well in VAM (excelling at both math and english
"teaching"), and others don't. I guess that some teachers learnt what it takes
to get good VAM scores.

~~~
keithpeter
As a (UK, math(s)) teacher I have to warn you that gaming the scores will
inevitably happen when you have scores and those scores may impact on career.

<http://www.ofsted.gov.uk/resources/mathematics-made-measure>

Long, UK specific, but possibly of interest to those with school age children
in educational systems using these metrics.

PS: most of the data I plot on scatter plots has low density and high
variability so little stacking.

------
tel
Other ways to handle this include transparency, smaller points, and jittered
data.

I used to like the concept of density plots, but more and more I feel that
they can be misleading. Plot your actual data if you can; use summarizing
models as a fallback and a second step.

------
JumpCrisscross
One could hybridise the two - a scatter plot whose points change colour as a
function of proximity (the equivalent of a density plot with 1px wide bins and
the background colour for zero).

This would preserve the outlier flagging of a scatter plot while alleviating
its risk of obfuscation via density. Plus, one need not worry about picking an
appropriate bin width any longer.

~~~
rcthompson
You still have to pick a proximity radius for coloring. That's really just the
bin width (or bandwidth of a density plot) in disguise.

~~~
JumpCrisscross
True, but this is primarily to address the decreased ability of a density map
to indicate outliers. One could also have the proximity be a power function
over the entire field (one could do this with density, too, but it makes more
sense with a point).

------
PaulHoule
Sometimes I've added random jitter to scatterplots to get something that
represents density better when data is truncated.

------
lbotos
Would the stuff you guys are talking about in this thread be statistical
analysis techniques or something more along the lines of data visualization? I
ask because I'm fascinated by it and would love to find where I can learn
more. I'm just not sure where to even start.

------
rflrob
Wen plotting dense, largely unconstrained data (i.e. not on a 0-100 scale),
I've found a hybrid scatter/density plot to be nice. Basically, for each point
you do a kernel density estimation, then color the points according to that.
For the sparse data, the scatter plot seems just as useful, but at the dense
parts, you seamlessly switch to color as the useful metric. Something like
this:
[http://www.mathworks.com/matlabcentral/fileexchange/8577-sca...](http://www.mathworks.com/matlabcentral/fileexchange/8577-scatplot)

------
littlewheel
The point about the scatterplots is valid, but if he thinks a correlation of
.26 (same teacher, subject and year, but different grade classroom) is
sufficient to justify VAM, which arguably has no validity let alone
reliability, then it's impossible to take him seriously.

------
stewbrew
Just add a jitter.

