

Why you should be wary of relying on a single histogram of a data set - aton
http://stats.stackexchange.com/questions/51718/assessing-approximate-distribution-of-data-based-on-histogram/51753#51753

======
jfim
As mentioned, one should really be using a kernel density plot instead of a
histogram, except when there are already classes in the data.

In R, one can simply do:

    
    
      library("ggplot2")
      library("datasets")
      ggplot(faithful, aes(x=eruptions)) + geom_density() + geom_rug()
    

which gives a chart like this (<http://jean-francois.im/temp/eruptions-
kde.png>). Contrast with:

    
    
      ggplot(faithful, aes(x=eruptions)) + geom_histogram(binwidth=1)
    

which gives a chart like this (<http://jean-francois.im/temp/eruptions-
histogram.png>).

Edit: Other plots mentioned in this discussion:

    
    
      ggplot(faithful, aes(x = eruptions)) + stat_ecdf(geom = "step")
    

Cumulative distribution, as suggested by leot (<http://jean-
francois.im/temp/eruptions-ecdf.png>)

    
    
      qqnorm (faithful$eruptions)
    

Q-Q plot, as suggested by christopheraden (<http://jean-
francois.im/temp/eruptions-qq.png>)

~~~
xfs
But then you would have to choose a certain kernel and assume the data
conforms to that distribution which isn't always true.

~~~
jfim
Indeed, but that estimate is likely to be less misleading in most cases than a
histogram(which is just a uniform kernel that is always aligned with bin
boundaries).

~~~
xfs
One particular parameter of the kernel, bandwidth, can result in highly
misleading visualization given arbitrarily chosen values. Here is an example
[http://en.wikipedia.org/wiki/File:Comparison_of_1D_bandwidth...](http://en.wikipedia.org/wiki/File:Comparison_of_1D_bandwidth_selectors.png)

The smoothing give unsavvy readers a false sense of accuracy. With histogram
they can at least tell it's an approximation.

~~~
noelwelsh
Yup. Luckily there are good methods for choosing the bandwidth:

[http://www.umiacs.umd.edu/labs/cvl/pirl/vikas/Software/optim...](http://www.umiacs.umd.edu/labs/cvl/pirl/vikas/Software/optimal_bw/optimal_bw_code.htm)

------
leot
Yes, probability density estimation might be fun, but the simplest thing to do
when comparing distributions, if you're worried about binning issues, is to
plot their empirical cumulative distribution functions.

~~~
crayola
Completely agree. If your audience gets them, that's the most robust, easiest
to interpret way to visualize a distribution (continuous or discrete). But it
may require a few words of explanation depending on the audience.

------
dude_abides
This is what you should be doing:

    
    
      plot(density(Annie), col="red")
      lines(density(Brian), col="blue")
      lines(density(Chris), col="green")
      lines(density(Zoe), col="cyan")
    

This is the plot you get: <http://i.imgur.com/sY2awX7.png>

------
tantalor
Reminds me of <http://en.wikipedia.org/wiki/Simpsons_paradox>

~~~
shardling
Hmm, reminds me much more of Anscombe's quartet:
<http://en.wikipedia.org/wiki/Anscombe%27s_quartet>

------
christopheraden
Interesting paradox. I haven't seen that many statisticians using just a
histogram when determining whether a certain distribution fits data
reasonably. Kernel Density Estimators are a much better choice (for continuous
data, like the data in the post), but they are also affected by your choice of
bandwidth. When it comes down to it, like going to the doctor, sometimes the
best choice is to get a second (or third!) opinion. For what it's worth,
drawing a QQ Plot (something I've seen in every statistical consultation I've
ever done) reveals the dependent structure of the data immediately and
obviously in the form of a perfect linear relationship between any two
variables.

~~~
jfim
Indeed, although Q-Q plots are very unlikely to be understood by people who
don't have a good grasp of statistics, whereas a misleading histogram will be
(and probably without knowledge of the caveats behind histograms).

~~~
christopheraden
A great point, but therein lies my biggest complaint with the simplification
of statistics that I see in the startup world--sometimes the technical details
are actually important. As an analogy, while mass-production has given us a
car that anyone can operate, we are largely helpless when one breaks down.
Complications abound when individuals try to leverage an overly-simplistic
view of a subject (raise your hand if you've heard "We are 95% sure the true
[...] lies in this confidence interval").

To the credit of the shadier individuals in my profession, this histogram
subtlety nicely highlights how it can be quite easy to bend the data to your
argument using ad-hoc procedures (KDEs, hists, QQs, boxplots). A carefully
chosen bin width, smoothing parameter, or covariate can present a different
view of the data than some other parameter/covariate. That's why it's nice to
have other statisticians capable of reproducing and disseminating the work.

------
radarsat1
Is this basically just an effect of quantization aliasing?

~~~
msellout
In other words, rounding error.

There's a great story of a histogram of heights in Napoleon's army having two
peaks, eliciting all sorts of theories. Reality was that height in the army
was normally distributed, but that the data was collected in centimeters and
people were looking at a histogram binned by inches. The middle bin contained
only two centimeter counts while the bins on either side contained three each,
thus having dramatically more counts.

~~~
waldrews
Would love to find the citation for that one...

