
Ask HN: What data visualisation tools do data scientists and developers use? - lasryaric
As a data scientist or software engineer working with &quot;medium &#x2F; large&quot; amount of data, how do you visualize them?<p>Lets take a few example:<p>- Working on a iPhone app dealing with signal processing, Fourier transform etc, how do you visualize your signal and frequencies?<p>- As a back-end engineer working with a directed graph data structure, how do you quickly visualize your graph? are you interested in seeing your graph changing, step by step?<p>- How do you quickly visualize massive amount of data points into time series?
======
hglaser
I'll just say upfront that I'm the cofounder of Periscope
([https://www.periscope.io/](https://www.periscope.io/)) which is specifically
marketed at data scientists, so I have a horse in this race. :)

There are two kinds of charts: Charts designed to find information, and charts
designed to sell information. The latter are often gorgeous and many-
dimensional: Heatmaps, animated bubble charts, charts with time sliders, etc.
And by all means, if selling the data is required, then sell it with the best
tool for the job.

As for actually investigating the data, it's usually a lot of tables, lines
and bars. They're simple to understand, and there's no cleverness in the
visualization that might hide critical information.

To answer your questions, at Periscope I've seen:

1\. A line graph of amplitude over time. You should see the frequency emerge
clear as day. If you want to calculate frequency explicitly, you could overlay
a second line with its own axis. Again, super simple, but gives you the answer
directly.

2\. I've seen a lot of fancy graph visualizations, but nothing that makes me
happy. Depending on what you want to know about your graph, maybe a simple
table with a structure like:

    
    
      [node name][node name][weight]
    

Or:

    
    
      [timestamp][node name][node name][weight]
    

A pivot table on top of this data, transoforming the second node column into
the table's horizontal axis, can also be useful.

3\. OK, obviously I think Periscope is a great choice here. Loads of data
analysts use it to visualize time series data on many tens/hundreds of
billions of data points.

That said, other good choices are: Excel, R/Stata/Matlab, gnuplot, Apache Pig.
And for the data storage itself, IMO Amazon Redshift is unparalleled.

~~~
lasryaric
Thanks for your input.

While working on signal processing and some graph data structures, I was
looking for a simple code snippet to add in my code and be able to __quickly
__visualise the data in my web browser. I was tempted to implement it myself a
bunch of times, but I never got the time to actually do it.

What I ended up doing is dumping my data points into a CSV files and use
excel, which was really bad if I had more than 50k data points (I am on
running Mac OS X). Periscope is amazing btw.

~~~
hglaser
Thanks for the kind words!

For ping-and-visualize, I'd recommend a tool that'll give you raw access to
the underlying data. Some choices are
[https://amplitude.com/](https://amplitude.com/) and
[https://segment.com/](https://segment.com/). Happy to chat about pros and
cons if you're curious -- harry@periscope.io.

------
Lofkin
For interactive exploration of huge and out of core datasets, python
Blaze/Dask has data manipulation with OOC parallel dataframes , and python
Bokeh has interactive webGL and downsampling for visualization for said
dataframe.

[http://blaze.pydata.org/en/latest/](http://blaze.pydata.org/en/latest/)
[http://bokeh.pydata.org/en/latest/](http://bokeh.pydata.org/en/latest/)
[http://dask.pydata.org/en/latest/](http://dask.pydata.org/en/latest/)

------
ThePhysicist
For exploratory data visualization I can recommend Matplotlib:

[http://matplotlib.org/](http://matplotlib.org/)

It's flexible, easy to use and can -with some tweaking- produce publication-
quality output. It supports a wide range of backends such as PDF, SVG, PNG and
a range of GUIs like Qt, so it is easy to use both for static and interactive
graphs. More importantly, it plays nicely with the IPython notebook
([http://ipython.org/notebook.html](http://ipython.org/notebook.html)) and
with libraries such as numpy, pandas and scikit-learn. The last point is very
important because when visualizing data half of the problem is getting the
data in shape, and Python offers some excellent tools for this (much better
than what you would get in Javascript).

The choice of the right tool depends of course on where you want to show your
plots and what kind of plots you want to generate. If you intend showing your
visualizations in a web browser then D3 or a similar JS library would probably
be a good choice (also have a look at Bokeh, which is becoming a very good
alternative to Matplotlib). For iPhone applications there are probably native
toolkits to do this, but unfortunately I have no experience there and cannot
make any recommendations.

Concerning the scalability of your visualizations: To my knowledge there are
not many tools that can handle more than a few 10.000 data points without
becoming slow. So you will probably always have to reduce the number of data
points before plotting them. In any case, showing all of your data points
without prior aggregation will probably not be a good idea in most cases,
performance aside. If you really need to work with a massive amounts of data
points have a look at Processing, which to my understanding can handle pretty
large data sets (but is not a data visualization toolkit per se). OpenGL-based
approaches can also be interesting for visualizing large amounts of data
points, especially but not only in 3D.

Concerning your examples:

\- Gephi is a nice tool for visualizing graphs
([http://gephi.github.io/](http://gephi.github.io/)), but a bit clunky to use
at time.

\- R has the excellent GGPlot module that can work with pretty large data sets
and has specialized graphs for e.g. time series data.

~~~
lasryaric
It was not clear in my question but I was mostly talking about how do people
visualise there data while they are working on it? Lets say you are a
developer working on the well known Shazam application. You want to visualize
your signals and frequencies for debugging purpose, how do you quickly
visualize them?

In my case, while working on signal processing using C and objective C, I had
no easy solution to visualise my data in a iterative way (code, visualize,
repeat).

I was looking for a simple objective-c library to include, in order to
visualise my data in a web browser.

by the way, thanks for your input, this is very valuable.

~~~
ThePhysicist
If you can work with Python/Julia/R then I would still recommend the IPython
notebook, since you will not get anything more interactive and better suited
to data analysis and visualization in one package. It basically allows you to
run arbitrary snippets of code and interactively work with your data (code,
visualize repeat). In general, I would not try to develop your algorithms
directly within your app but instead generate some example data, export that
data and then work with it in IPython (or another tool) to iteratively build
the algorithm. As soon as you have a working solution you can then
integrate/translate the code back into your application.

~~~
lasryaric
Good point!

~~~
pvnick
Ipython notebook completely changed the way I do data science, specifically
because its iterative/exploratory capabilities are so powerful. You can also
hook it up to large scale service like hadoop/spark. Embedded Matplotlib
charts have spoiled me. I cringe at the thought that I used to write scripts
to output csv, run them on the command line, then copy the output to excel to
graph the data. Never again!

------
rgbrgb
I used google's svg charting library [0] this week and I have to say it's
really improved a ton in terms of customizability. Data in, chart out... only
had to do a bit of custom html to fix up the legend but:
[https://www.openlistings.co/near/googleplex](https://www.openlistings.co/near/googleplex)

That said, my feeling is that the hard part of dataviz is figuring out what to
visualize. If you're working with big datasets, it's always going to be
aggregate statistics that you're looking at (unless you're one of those people
that can see blondes and brunettes in logs of the matrix).

[0]:
[https://developers.google.com/chart/interactive/docs/gallery...](https://developers.google.com/chart/interactive/docs/gallery/linechart)

------
lmeyerov
[Disclaimer: we're doing
[http://www.graphistry.com](http://www.graphistry.com) , which helps with
graph views of big IT infrastructure and log data.]

While doing research at Berkeley, and now building IT sec/ops visibility
tools, several phases:

Basically:

\-- Start simple: Excel :)

\-- Advance to Python: pivot tables and plots via IPython/Jupyter notebooks
with Pandas. (I used to do R but Python largely caught up, thankfully.) For
bigger stuff, we started playing with PySpark.

\-- For IT time series & logs: graphite/ganglia/ELK, and if you can afford it,
significantly more polished tools like Splunk.

\-- Network graphs are an evolving story. Small graphs are easy through
something like Neo4j's built-in explorer. Alternatively, the Linkurious team
is friendly, and Keylines is embedded all over the place. Big graphs are the
problem: Gephi can do ~50K nodes and has a _lot_ of features, but hard to use
and showing its age. We're building something more scalable and analysis
focused, and particularly focusing on sec/ops, so happy to share beta access.
(We use GPUs and focus hard on smooth integrations, which let us change the
rules a bit.) Let me know.

------
chrisalbon
To blatantly toot my own horn here, but for visual exploratory data analysis,
I just released Popily ([http://popily.com](http://popily.com)).

It is meant as a no-code solution to get the layout of a dataset and do some
fast EDA. Drag data in, then click around your data. It isn't a replacement to
in-depth analysis, but it is a heck of a lot faster for EDA than making dozens
of charts in Matplotlib.

Here are a bunch of examples
([http://popily.com/examples/](http://popily.com/examples/)) of Popily in
action. Personal favorite: Battles In The Game Of Thrones
([http://popily.com/explore/notebook/battles-in-the-game-of-
th...](http://popily.com/explore/notebook/battles-in-the-game-of-thrones/)).

------
adolgert
There's a sharp divide between what's prepackaged for typical uses and
attempts to customize visualization for any of three-dimensionality, parallel
processing, or interactivity with large data. The latter set of tools use the
Visualization Toolkit, or Paraview, MayaVI, and VisIt, all of which sit on top
of VTK. Or there are commercial applications, such as EnVision and IDL. Keep a
language-agnostic stable of prepackaged visualization techniques in Python, R,
Matlab, Mathematica, Excel (yes, even), Julia, and Javascript. Then reach for
the big hammers when absolutely necessary.

------
ironchef
In general, most data scientists don't need to see the entirety of a data set
to get an idea. A lot of the initial data exploration is about separating
signal from noise...what of the data is important.

After that (similar to your examples), it completely depends on what you're
trying to see. Am i looking at per capita data? Then chances are simply toss
it at a choropleth. Time series tends to work pretty well in a simple linear
graph type structure (as one will often overlay), etc.

------
rparrish
I’ll say upfront that I’m a Product Manager at Treasure Data, and we market to
Data Scientists. Specifically to enable you to perform analysis on large
datasets, directly from your local machine. More generally, Treasure Data
enables the collection, storage & analytics of large-scale event data.

For performing preliminary analytics, I’ll agree with what the previous
respondents have said - iPython Notebook is a GREAT tool. It’s certainly my go
to. The libraries I think of using when working within this context are as
follows:

The go-to packages: > [http://ggplot2.org/](http://ggplot2.org/) (R) >
[http://matplotlib.org/](http://matplotlib.org/) (Python)

Graph visualizations: > [http://gephi.github.io/](http://gephi.github.io/) >
[http://neo4j.com/](http://neo4j.com/)

Online dashboards: >
[https://github.com/stitchfix/pyxley](https://github.com/stitchfix/pyxley)
(<\- I’m particularly excited to try this out) >
[http://bokeh.pydata.org/](http://bokeh.pydata.org/)

Of course, the challenge is you don’t have a static dataset! New data is
continuously coming in. Your dataset is growling larger all the time. It may
be too large to fit on your local machine.

That’s why Treasure Data was founded, to enable the easy collection of, and
analytics on, this type of data stream. Treasure enables complete removal of
the engineering & devops for these collection & storage steps.

For example: > Want a continuously updated dashboard of your incoming data? =
Treasure Data + Jupiter Notebooks + Pyxley > Want to perform graph
visualizations on event data? = Treasure Data + Jupiter Notebooks + Neo4J >
Want to create visualizations in R? = Treasure Data + R + ggplot

The above is enabled through Treasure Data’s integration with Pandas & R.
([http://docs.treasuredata.com/articles/jupyter-
pandas](http://docs.treasuredata.com/articles/jupyter-pandas)).

Good luck in your work!

~~~
jvilledieu
Disclaimer: I'm a co-founder of Linkurious. Our products are used by data
scientists and less experienced-users.

We have an open-source graph visualization JS tookit called linkurious.js:
[https://github.com/Linkurious/linkurious.js](https://github.com/Linkurious/linkurious.js)

We also offer a commercial product to visualize Neo4j stored graphs:
[https://linkurio.us/product/](https://linkurio.us/product/)

------
minimaxir
It depends on the context. Is this data being rendered as a PNG for use in a
blog post? Or as an interactive application on a webpage?

For 2D time series, any 2D application is fine if the data is pre-processed,
even Excel. For 2D plots in general, I'm more biased toward R and ggplot2,
though (see my tutorial: [http://minimaxir.com/2015/02/ggplot-
tutorial/](http://minimaxir.com/2015/02/ggplot-tutorial/) ).

Graph data structures are a bit harder. I know Gephi is used for creating
PNGs, but I have less experience with it.

For interactive web charts, you have to use libraries like d3.js or
Highcharts, but I am not a fan for using interactive charts for static data
unless necessary. (Mostly because they never play well with mobile devices
without significant QA)

~~~
abannin
IMHO, ggplot2 is the best graphing library I have ever encountered. Because it
works naturally with data, it's very easy to create great visualizations.

~~~
angdis
Yep, very practical for getting stuff displayed cogently and with minimal
fuss. I wish other platforms besides R would have something like this.

------
bluesmoon
We use Julia (running in IJulia) with D3 using either iframe communication,
window.postMessage or hooking directly onto the websocket to communicate
between the D3 viz and the Julia backend.

We decided to use D3 because of the interactivity. There are performance
problems with the default SVG examples when the number of visualized nodes
grows so I wrote a hybrid CANVAS + SVG rendering layer. We use CANVAS for the
bulk of the drawing, SVG for text nodes or a few nodes that need mouse
interaction, and event delegation on the document to determine if the mouse
has interacted with anything else.

------
evandrix
for large-scale EDA time-series viz, `gnuplot` has served me well, even
understanding ms.

------
malux85
It depends how fluid I need to be.

If I have a very specific, quantifiable goal in mind, then I use test driven
development - added bonus of having a test suite at the end.

If I am working with large datasets on servers, then I simply subsample, and
then scale up

------
andegre
Machete is a good one!

[https://www.machete.io/beta](https://www.machete.io/beta)

~~~
bengali3
It appears none of the Featured Products are currently working.

------
ahamino
We use python a lot, so matplotlib is our tool of choice, combined with NumPy
or Panda!

