
Why scientists need to be better at data visualization - Anon84
https://www.knowablemagazine.org/article/mind/2019/science-data-visualization
======
stared
When I was a Ph.D. student I was surprised that in academia (my field was:
quantum information theory) the majority of researchers put zero effort, and
thought, in presenting data in a clean and clear way. (As a reaction I devoted
a section in my thesis to say what is data vis.) One my day "but it is for
other specialists". Nope, it is not the problem. Of course, every data vis
should be tailored for a specific audience. But there is typically zero though
put their either. See section 3.1.2 from
[https://arxiv.org/abs/1412.6796](https://arxiv.org/abs/1412.6796).

From a more recent one ("Simple diagrams of convoluted neural networks",
[https://medium.com/inbrowserai/simple-diagrams-of-
convoluted...](https://medium.com/inbrowserai/simple-diagrams-of-convoluted-
neural-networks-39c097d2925b)):

"[In academic] research, visualization is a mere afterthought (with a few
notable exceptions, including the Distill journal,
[https://distill.pub/](https://distill.pub/)). One may argue that developing
new algorithms and tuning hyperparameters are Real Science/Engineering™, while
the visual presentation is the domain of art and has no value. I couldn’t
disagree more! Sure, for computers running a program it does not matter if
your code is without indentations and has obscurely named variables. But for
people — it does. Academic papers are not a means of discovery — they are a
means of communication."

~~~
amitport
In my experience, it's just about cost-effectiveness. Proper visualization is
hard, researchers have a lot of ground to cover and theory and proper
experiments are just more important (not only for journal reviewers). Sure, if
have a big research group with man hours to spare someone should focus on
that, but it's rare to have that capacity.

~~~
stared
Sure, creating a really good custom data visualization takes time and effort.
Using default color themes in a non-ugly library - only minimal care.

------
jofer
A key aspect that this misses is the target audience. The general public is
not the target of most scientific visualizations. Instead, it's other
scientists in the field. There are invariably lots of conventions and general
historical trends that it's important to follow to communicate clearly.

For example, in seismology high sonic velocities are always shown in blue/cool
colors and low sonic velocities are always shown in red/warm colors. This is
non-ideal for colorblind users and confusing for other audiences (red != high
value), but it's a near-universal convention. We can use a more perceptually
uniform red to blue colormap, but keeping the warm/cool convention is very
important.

If your audience is used to seeing a certain type data in a particular way,
you'll confuse people if you don't follow that convention. Clear labels and
legends are great, but people follow convention first, labels second.

When the goal is clear communication, following established convention is much
more important than making a "better" visualization. That's not to say that
these guidelines aren't very important, just that it's vital to keep your
audience's expectations in mind.

> "Scientists also tend to follow convention when it comes to how they display
> data, which perpetuates bad practices."

In my experience, scientists follow convention because it enables clear,
consistent communication with other scientists in the field. Violating
convention is okay, but should be done with utmost care and the knowledge that
you'll need to spend more time explaining things.

~~~
mnky9800n
the article even talks about cultural expectations making visualizations
harder to read if the cultural expectations are violated. Except that the
article completely ignores your point. Science has a culture and violating
that culture to make a "better" visualization must be done with care.

------
rflrob
While I absolutely agree that scientists need more formal training in data
visualization, the claim that "few scientists take the same amount of care
with visuals as they do with generating data or writing about it" doesn't ring
true to me about the scientists in my field (genetics/genomics). There is a
widespread recognition that effective figures are the strongest way to
communicate your message, the question is what is the the best way to create
those effective figures. Part of the problem is that as datasets get bigger,
it's rarely sustainable to put a lot of care into each and every version of a
plot, but automating creation of figures is really hard.

If I were going to start designing a course in creating scientific figures, I
think I'd have a roughly even split between the psychophysics of visual
perception (e.g. distinguishing between similar quantities of
lengths/angles/colors/etc; designing for color-blind readers; ) and hands-on
work in a real programming environment turning data to figures.

~~~
jfim
> Part of the problem is that as datasets get bigger, it's rarely sustainable
> to put a lot of care into each and every version of a plot, but automating
> creation of figures is really hard.

That's actually a good reason to learn R and the ggplot2 package. Whenever I
write a paper, what I do is that I make a quick shell script that invokes
Rscript, with a simple R program that takes a CSV file and outputs a PDF file
of the plot, which can be automatically loaded in LaTeX.

Whenever the data changes, it's just a matter of updating the CSV file and
running the script that rebuilds the figures and the LaTeX document. As an
added bonus, it makes keeping the data with the paper easy, since they're part
of the same source control repository.

~~~
airstrike
Is there a reason you don't use RMarkdown in RStudio? It's built precisely for
this use case

~~~
jfim
Mostly because I wrote those scripts many years ago and I've been reusing them
since. I'll look into RMarkdown for the next paper, thanks!

------
hinkley
Everyone needs to be better at data visualization. If for no other reason than
to be an informed consumer. If Mark Twain had lived through informatics I'm
sure it would be four ways to lie: "Lies, Damned Lies, Statistics, and Charts"

There are a lot of bullshit techniques used in data viz that are tantamount to
lying and people are often sincerely shocked when you call them on it.

"I didn't do that on purpose," as if they didn't learn when they're 4 that it
doesn't matter if you meant it if you did it. You still have to apologize and
try to make amends.

Things people do from ignorance or malice:

Remove the origin on the graph and the relative height of the lines is skewed.
3d pie charts are 'larger' on the bottom half. Circle charts conflate diameter
with area (humans are bad at judging area). On a log scale plot, a fat enough
line can make anything look like a trend, because the end of the line is
literally orders of magnitude wider than at the origin.

Friends don't let friends use any of these techniques, and the jaded instantly
distrust anyone who is using them.

The used bookstore on the closest university campus has an entire shelf full
of old editions of Edward Tufte books. I don't know if you'll be so lucky, but
it's well worth a shot.

~~~
mattkrause
Dogmatically insisting on origins-at-zero is silly: it depends too much on the
data, the message, and the audience.

For example, my lab studies the electrical activity of brain cells. A neuron
normally sits at -70 mV relative to the extracellular space. When it fires, it
can briefly cross 0 mV (though it probably won’t reach +70 mV) and sometimes
you might want to show that complete spike. Other times, subthreshold changes
in the cells’ activity are more important to a hypothesis, and then you’ll
zoom in on a smaller range (maybe -90 to -55 mV). Neither of these situations
would benefit from a plot centered at zero; in fact, it would look decidedly
odd.

(Your other comment about log-log plots also seems odd to me, because the
width of the line is almost never meaningful, unless there’s something like a
confidence/credible interval band, which case that’s the whole point.

~~~
hmwhy
I don't think the comment on log-log plots is limited to just log-log plots,
and the point is that there are times when the width of the line of best fit,
together with the size of data points, misleads the reader to think that it's
more linear than it actually is.

Problematic presentation of a similar type is particularly pronounced in, for
example, finite-size scaled curves, where the perceived goodness of fit is
heavily influenced by the line widths of the curves, the size of the data
points, and the resolution of the plot.

These misleading data representations are something that can be found even in
recent publications in prestigious journals (I don't think it's always
intentional and therefore won't include any references); also, sometimes it's
difficult to notice there are issues unless you are very familiar with the
methods, or have tried to reproduce the results (and have wasted months or
even years of your life by then).

------
nycticorax
This is like a non-programmer telling programmers they need to comment their
code better. In a sense, it's probably true most of the time. But it's also
sort of hard to take very seriously, because the non-programmer doesn't have
any real understanding of the day-to-day challenges and trade-offs faced by
people who write code for a living.

~~~
setr
One major difference is that code comments aren't at all meant for public
consumption (even in open sourced code, comments have a very constrained
expected audience).

If anything, public documentation is probably a better analogy -- expected
audience of developers, but of potentially wildly different experience levels.

And in that case, it's generally fair for say a novice to complain about the
lack of examples, clarity,etc.

And these charts are the same -- this is the primary entry point for a novice
in the subject, and a helpful tool for experts. A poor chart would be harmful
for all expected audiences.

~~~
nycticorax
The fact the code comments aren't meant for public consumption, and figures
are, is completely irrelevant to the analogy. The point is that lots of things
we do could be done better, but having an outsider focus on one of them, and
tell you you could be doing it better, is not super-helpful. It's like being
told by three people that you need to put a cover sheet on your TPS reports.

That said, and having looked at the article more carefully (ahem), most of the
concrete suggestions they offer are sound, and are things that I think are
pretty well-known, at least to the practicing scientists I talk to. "Pie
charts are bad", and "Beware of false color images, especially ones that use
the 'jet' colormap." are both (parapharases of) statements that I hear from
working scientists.

But then there's this: "And yet few scientists take the same amount of care
with visuals as they do with generating data or writing about it. The graphs
and diagrams that accompany most scientific publications tend to be the last
things researchers do, says data visualization scientist Seán O’Donoghue.
“Visualization is seen as really just kind of an icing on the cake.”"

This bears zero resemblance to my experience. Most scientists of my
acquaintance make the figures first, and then write the text. And certainly
I'd say that more care is taken with the figures than the text.

I guess I just feel like I see a certain amount of fetishization of beautiful
scientific illustrations (Tufte, etc), out of all proportion to its actual
importance to the scientific endeavor. The readers of most scientific articles
are probably not going to be flummoxed by a pie chart. And of course, time
spent focusing on this kind of stuff is time that might very well be better
spent doing experiments.

------
malshe
Alberto Cairo recently released his "How Charts Lie". It's a delightful read.

[https://www.amazon.com/How-Charts-Lie-Getting-
Information/dp...](https://www.amazon.com/How-Charts-Lie-Getting-
Information/dp/1324001569)

------
archgoon
I remember a decade or so ago, the only way for me to figure out certain
material values (this was material science / electrical engineering /
chemistry) was to open up MSPaint and manually draw lines on chart data to
find intercepts. Has the field gotten better about releasing raw data for
papers?

Granted, this was sometimes because the only team that had measured a
particular substance did it back in the 70s.

~~~
btrettel
There are many programs to automate that, e.g., I use g3data. Unfortunately
raw data is still rarely released today. One tip if the raw data wasn't
released as a computer file: Look for a dissertation associated with the
journal article you want data from. Dissertations often have tabulated data in
an appendix.

------
exergy
I do simulations for a living aka numerical modelling. The results that I
obtain need to be explained using plots. How behaviour changes with time, how
component sizing will affect xyz parameter etc.

What I would dearly love is some way to animate the systems I simulate. A way
to show HOW MUCH flow is going through a pipe, or how much heat transfer is
happening in a heat exchanger. Something that's easier to grasp than dull
plots.

Sadly, a) I do not even know what to google to find solutions for this, and b)
all my primitive searches seem to lead to Blender, which has a large learning
curve and way too much time investment requirements.

~~~
astrophysician
[https://yt-project.org/](https://yt-project.org/) \-- yt for python was used
by a few of my old colleagues for visualizing astrophysical simulations. There
is yet another program that I've since forgotten but will try to come back
here and comment if I remember.

If you can generate a single static figure, you can manually generate
animations by generating each frame separately using your favorite plotting
software. You can then stitch them together using many different tools (I use
matplotlib) to generate an animation (video, gif, etc.).

What language do you like to use? What's your background -- are you a Matlab
person?

~~~
mturk
I work on yt, and I just wanted to comment to say that whenever I stumble on a
mention of it somewhere, I genuinely feel warm inside. Thank you for thinking
of us!

One of the things we've tried to do -- especially by leveraging the technology
and innovations in the tools we build on (especially matplotlib!) -- is to
make it easier to make visualizations and plots that are "by-default"
communicative and information-rich.

And, of course we have lots of room for improvement, for picking up and
encouraging "best practices" at the library level, but it has been really
enjoyable and illuminating to have the opportunity to explore that with the
community.

~~~
maxnoe
Thanks for all the work on yt.

Did you consider replacing the rainbow color map images on your front page?

~~~
mturk
Yup, and we don't use a rainbow colormap by default anymore! We do provide a
few custom ones that were developed with viscm, which I believe was used by in
development of viridis, magma, etc. I think the specific examples are all from
the published literature, and we accepted them as submitted by users -- but a
refresh would be good!

I opened an issue: [https://github.com/yt-
project/website/issues/70](https://github.com/yt-project/website/issues/70)

------
martopix
This is very true, don't get me wrong. But we have a problem, which is that
scientists already need to be good (and efficient) at:

\- Doing research \- Supervising students \- Teaching \- Giving talks in a
clear way \- Writing papers in a clear way \- Writing grants in an enticing
way \- Performing administrative tasks

This unfortunately is one of the causes why, in certain fields, industry
research is much better than academic research. Google is draining brains from
compsci departments all over the world, because a scientist at google is
mostly a scientist. And there is much, much more teamwork.

------
lancebeet
The "challenges" included in that article are a bit confusing to me, in
particular the pie chart. It asks for the third largest segment and claims the
answer is B. To me, it seems like B is the second largest and H is the third
largest. I did some thresholding of the image and it also shows the order
being C, B, H, D. Maybe I'm missing something?

~~~
knowablemag
We've updated the story and added an editor's note at the end with the
correction. Thank you again for spotting! -Katie, Knowable Magazine

------
datashow
Most charting packages have a terrible default: set y axis original point
around y min, which is rarely appropriate in data visualization. In my
opinion, the default should always be zero.

Another one is bars in bar/column charts are horribly wide.

~~~
geoalchimista
> In my opinion, the default should always be zero.

This won't work well for a lot of cases. Imagine plotting the atmospheric CO2
concentration time series in the past 200 years. Setting the original point at
zero ppm would not make sense because it can never happen. I'd say setting y
axis original point at y min is a good trade-off since the plotting package is
agnostic about the underlying nature of the plot.

~~~
danso
I have mixed feelings. I agree that "y-axis should start at zero" should not
be been as a sacred guideline by any means. But I reflexively feel that in the
context of a data viz library, the default should be zero; thus requiring the
user to explicitly set a minimum when needed, and to conscientiously do so.

But I do realize that practically speaking, it is much easier to set a y-axis
to 0 (i.e. a constant value), rather than having to calculate the y-min or
near-y-min for every chart.

~~~
mattkrause
How would that work with negative data?

~~~
datashow
What do you mean? Zero original point does not work with negative data?

~~~
mattkrause
The OP's suggestion was that the y-axis should start at zero by default, not
min(y). However, this seems like it would lead to empty plots when the data is
all or mostly negative.

~~~
datashow
Not really. Plot can be shown above and below the x-axis.

------
frumiousirc
A web site talking about information visualization and includes an obscuring
top bar and throwing a huge pop up after loading. Yeah, that's a tab closing.

------
mongol
Not only scientists. Politicians too. That would hopefully change the
discourse to be more material and fact based and less retorical.

------
taurath
Maybe if they used something prettier than R it would help?

~~~
lallysingh
ggplot2 is pretty damn nice

------
pintxo
Didn't even start reading, with the prominent pie-chart directly above the
article. Not sure the author has any reasonable knowledge about visualization
at all.

~~~
maxnoe
Disadvantages and advantages of pie charts are discussed sensibly further down
in the article

