Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What data visualisation tools do data scientists and developers use?
65 points by lasryaric on July 23, 2015 | hide | past | web | favorite | 34 comments
As a data scientist or software engineer working with "medium / large" amount of data, how do you visualize them?

Lets take a few example:

- Working on a iPhone app dealing with signal processing, Fourier transform etc, how do you visualize your signal and frequencies?

- As a back-end engineer working with a directed graph data structure, how do you quickly visualize your graph? are you interested in seeing your graph changing, step by step?

- How do you quickly visualize massive amount of data points into time series?

I'll just say upfront that I'm the cofounder of Periscope (https://www.periscope.io/) which is specifically marketed at data scientists, so I have a horse in this race. :)

There are two kinds of charts: Charts designed to find information, and charts designed to sell information. The latter are often gorgeous and many-dimensional: Heatmaps, animated bubble charts, charts with time sliders, etc. And by all means, if selling the data is required, then sell it with the best tool for the job.

As for actually investigating the data, it's usually a lot of tables, lines and bars. They're simple to understand, and there's no cleverness in the visualization that might hide critical information.

To answer your questions, at Periscope I've seen:

1. A line graph of amplitude over time. You should see the frequency emerge clear as day. If you want to calculate frequency explicitly, you could overlay a second line with its own axis. Again, super simple, but gives you the answer directly.

2. I've seen a lot of fancy graph visualizations, but nothing that makes me happy. Depending on what you want to know about your graph, maybe a simple table with a structure like:

  [node name][node name][weight]

  [timestamp][node name][node name][weight]
A pivot table on top of this data, transoforming the second node column into the table's horizontal axis, can also be useful.

3. OK, obviously I think Periscope is a great choice here. Loads of data analysts use it to visualize time series data on many tens/hundreds of billions of data points.

That said, other good choices are: Excel, R/Stata/Matlab, gnuplot, Apache Pig. And for the data storage itself, IMO Amazon Redshift is unparalleled.

Thanks for your input.

While working on signal processing and some graph data structures, I was looking for a simple code snippet to add in my code and be able to quickly visualise the data in my web browser. I was tempted to implement it myself a bunch of times, but I never got the time to actually do it.

What I ended up doing is dumping my data points into a CSV files and use excel, which was really bad if I had more than 50k data points (I am on running Mac OS X). Periscope is amazing btw.

Thanks for the kind words!

For ping-and-visualize, I'd recommend a tool that'll give you raw access to the underlying data. Some choices are https://amplitude.com/ and https://segment.com/. Happy to chat about pros and cons if you're curious -- harry@periscope.io.

For a simple viz of graph data, I'd would use linkurious.js[0] (it's a very active fork of sigma.js)

Disclaimer: I work on it ;)


Off-topic: At the bottom of the landing page Ellen Pao is listed as an investor in Periscope and CEO of Reddit. The latter is no longer true.

Off-Topic: Just curious here, are you guys having problems because of the other startup named Periscope?

For interactive exploration of huge and out of core datasets, python Blaze/Dask has data manipulation with OOC parallel dataframes , and python Bokeh has interactive webGL and downsampling for visualization for said dataframe.

http://blaze.pydata.org/en/latest/ http://bokeh.pydata.org/en/latest/ http://dask.pydata.org/en/latest/

For exploratory data visualization I can recommend Matplotlib:


It's flexible, easy to use and can -with some tweaking- produce publication-quality output. It supports a wide range of backends such as PDF, SVG, PNG and a range of GUIs like Qt, so it is easy to use both for static and interactive graphs. More importantly, it plays nicely with the IPython notebook (http://ipython.org/notebook.html) and with libraries such as numpy, pandas and scikit-learn. The last point is very important because when visualizing data half of the problem is getting the data in shape, and Python offers some excellent tools for this (much better than what you would get in Javascript).

The choice of the right tool depends of course on where you want to show your plots and what kind of plots you want to generate. If you intend showing your visualizations in a web browser then D3 or a similar JS library would probably be a good choice (also have a look at Bokeh, which is becoming a very good alternative to Matplotlib). For iPhone applications there are probably native toolkits to do this, but unfortunately I have no experience there and cannot make any recommendations.

Concerning the scalability of your visualizations: To my knowledge there are not many tools that can handle more than a few 10.000 data points without becoming slow. So you will probably always have to reduce the number of data points before plotting them. In any case, showing all of your data points without prior aggregation will probably not be a good idea in most cases, performance aside. If you really need to work with a massive amounts of data points have a look at Processing, which to my understanding can handle pretty large data sets (but is not a data visualization toolkit per se). OpenGL-based approaches can also be interesting for visualizing large amounts of data points, especially but not only in 3D.

Concerning your examples:

- Gephi is a nice tool for visualizing graphs (http://gephi.github.io/), but a bit clunky to use at time.

- R has the excellent GGPlot module that can work with pretty large data sets and has specialized graphs for e.g. time series data.

The default themes / colors in matplotlib leaves something to be desired. I'd recommend checking into [seaborn(http://stanford.edu/~mwaskom/software/seaborn/) for setting nice default colors and handling many types of statistical data plots. It reduces a lot of the boiler plate.

For visualizations that need to be shared IPython notebooks are great and work with Julia & R as well. The @interact widgets are fantastic for exploring data. Unfortunately interactive notebooks are difficult to post on a static website.

To answer the OP's other questions:

For that, I've been moving to a mix of html5/javascript/d3. While it's still a new-ish mix compared to matplotlib, there is something about being able to save all the data and computations to an html file which anyone with a decent browser can open and interact with is fantastic. I utilize this method to store my protocol records for experimental tests. One file can contain pretty self-describing data along with javascript code to nicely present it.

I don't think I realized how important default colors/styles are for visualization software until using seaborn. Quickly producing sensible plots in terms of aesthetics should not require 10+ lines of extra code. Fortunately, it looks like the matplotlib folks are looking to change default colors/styles with 2.0 [1].


It was not clear in my question but I was mostly talking about how do people visualise there data while they are working on it? Lets say you are a developer working on the well known Shazam application. You want to visualize your signals and frequencies for debugging purpose, how do you quickly visualize them?

In my case, while working on signal processing using C and objective C, I had no easy solution to visualise my data in a iterative way (code, visualize, repeat).

I was looking for a simple objective-c library to include, in order to visualise my data in a web browser.

by the way, thanks for your input, this is very valuable.

If you can work with Python/Julia/R then I would still recommend the IPython notebook, since you will not get anything more interactive and better suited to data analysis and visualization in one package. It basically allows you to run arbitrary snippets of code and interactively work with your data (code, visualize repeat). In general, I would not try to develop your algorithms directly within your app but instead generate some example data, export that data and then work with it in IPython (or another tool) to iteratively build the algorithm. As soon as you have a working solution you can then integrate/translate the code back into your application.

Seconding the IPython Notebook. I've used matplotlib but also like Plot.ly for its prettier, more sharable charts.

It's also not a huge amount of work to set up a JSON-Flask-D3 visualisation pipeline out of test data; that's exactly what I did for my current data project, using cubism.js to easily visualise a time series. Again, benefits are the ability to easily share.

Also, sometimes, Google Spreadsheets/Excel can be the easiest solution to throw up quickly - or using built-in data explorations in RStudio. I've also thrown graph data into Neo4j because it gives you a nice visual interface for free.

Good point!

Ipython notebook completely changed the way I do data science, specifically because its iterative/exploratory capabilities are so powerful. You can also hook it up to large scale service like hadoop/spark. Embedded Matplotlib charts have spoiled me. I cringe at the thought that I used to write scripts to output csv, run them on the command line, then copy the output to excel to graph the data. Never again!

I've often wanted to build a cmdline tool that would read from stdin and popup a live interactive chart of the values. This would make debugging/understanding complicated data-dependent algorithms possible by just adding some printf style logging.

Agreed, matplotlib is great. We use it all the time to make publication-quality graphs (output to pdf), specifically with time series.

I used google's svg charting library [0] this week and I have to say it's really improved a ton in terms of customizability. Data in, chart out... only had to do a bit of custom html to fix up the legend but: https://www.openlistings.co/near/googleplex

That said, my feeling is that the hard part of dataviz is figuring out what to visualize. If you're working with big datasets, it's always going to be aggregate statistics that you're looking at (unless you're one of those people that can see blondes and brunettes in logs of the matrix).

[0]: https://developers.google.com/chart/interactive/docs/gallery...

[Disclaimer: we're doing http://www.graphistry.com , which helps with graph views of big IT infrastructure and log data.]

While doing research at Berkeley, and now building IT sec/ops visibility tools, several phases:


-- Start simple: Excel :)

-- Advance to Python: pivot tables and plots via IPython/Jupyter notebooks with Pandas. (I used to do R but Python largely caught up, thankfully.) For bigger stuff, we started playing with PySpark.

-- For IT time series & logs: graphite/ganglia/ELK, and if you can afford it, significantly more polished tools like Splunk.

-- Network graphs are an evolving story. Small graphs are easy through something like Neo4j's built-in explorer. Alternatively, the Linkurious team is friendly, and Keylines is embedded all over the place. Big graphs are the problem: Gephi can do ~50K nodes and has a lot of features, but hard to use and showing its age. We're building something more scalable and analysis focused, and particularly focusing on sec/ops, so happy to share beta access. (We use GPUs and focus hard on smooth integrations, which let us change the rules a bit.) Let me know.

To blatantly toot my own horn here, but for visual exploratory data analysis, I just released Popily (http://popily.com).

It is meant as a no-code solution to get the layout of a dataset and do some fast EDA. Drag data in, then click around your data. It isn't a replacement to in-depth analysis, but it is a heck of a lot faster for EDA than making dozens of charts in Matplotlib.

Here are a bunch of examples (http://popily.com/examples/) of Popily in action. Personal favorite: Battles In The Game Of Thrones (http://popily.com/explore/notebook/battles-in-the-game-of-th...).

There's a sharp divide between what's prepackaged for typical uses and attempts to customize visualization for any of three-dimensionality, parallel processing, or interactivity with large data. The latter set of tools use the Visualization Toolkit, or Paraview, MayaVI, and VisIt, all of which sit on top of VTK. Or there are commercial applications, such as EnVision and IDL. Keep a language-agnostic stable of prepackaged visualization techniques in Python, R, Matlab, Mathematica, Excel (yes, even), Julia, and Javascript. Then reach for the big hammers when absolutely necessary.

In general, most data scientists don't need to see the entirety of a data set to get an idea. A lot of the initial data exploration is about separating signal from noise...what of the data is important.

After that (similar to your examples), it completely depends on what you're trying to see. Am i looking at per capita data? Then chances are simply toss it at a choropleth. Time series tends to work pretty well in a simple linear graph type structure (as one will often overlay), etc.

I’ll say upfront that I’m a Product Manager at Treasure Data, and we market to Data Scientists. Specifically to enable you to perform analysis on large datasets, directly from your local machine. More generally, Treasure Data enables the collection, storage & analytics of large-scale event data.

For performing preliminary analytics, I’ll agree with what the previous respondents have said - iPython Notebook is a GREAT tool. It’s certainly my go to. The libraries I think of using when working within this context are as follows:

The go-to packages: > http://ggplot2.org/ (R) > http://matplotlib.org/ (Python)

Graph visualizations: > http://gephi.github.io/ > http://neo4j.com/

Online dashboards: > https://github.com/stitchfix/pyxley (<- I’m particularly excited to try this out) > http://bokeh.pydata.org/

Of course, the challenge is you don’t have a static dataset! New data is continuously coming in. Your dataset is growling larger all the time. It may be too large to fit on your local machine.

That’s why Treasure Data was founded, to enable the easy collection of, and analytics on, this type of data stream. Treasure enables complete removal of the engineering & devops for these collection & storage steps.

For example: > Want a continuously updated dashboard of your incoming data? = Treasure Data + Jupiter Notebooks + Pyxley > Want to perform graph visualizations on event data? = Treasure Data + Jupiter Notebooks + Neo4J > Want to create visualizations in R? = Treasure Data + R + ggplot

The above is enabled through Treasure Data’s integration with Pandas & R. (http://docs.treasuredata.com/articles/jupyter-pandas).

Good luck in your work!

Disclaimer: I'm a co-founder of Linkurious. Our products are used by data scientists and less experienced-users.

We have an open-source graph visualization JS tookit called linkurious.js: https://github.com/Linkurious/linkurious.js

We also offer a commercial product to visualize Neo4j stored graphs: https://linkurio.us/product/

Linkurious CTO here :)

Our product connects to Neo4j databases and enables full-text search, visualization design (select nodes/edges, apply size/color to nodes/edges according to properties) with a point-and-click interface.

Try it out: http://linkurio.us/demo/ (the "create account" button generates a temporary login/password for the demo)

It depends on the context. Is this data being rendered as a PNG for use in a blog post? Or as an interactive application on a webpage?

For 2D time series, any 2D application is fine if the data is pre-processed, even Excel. For 2D plots in general, I'm more biased toward R and ggplot2, though (see my tutorial: http://minimaxir.com/2015/02/ggplot-tutorial/ ).

Graph data structures are a bit harder. I know Gephi is used for creating PNGs, but I have less experience with it.

For interactive web charts, you have to use libraries like d3.js or Highcharts, but I am not a fan for using interactive charts for static data unless necessary. (Mostly because they never play well with mobile devices without significant QA)

IMHO, ggplot2 is the best graphing library I have ever encountered. Because it works naturally with data, it's very easy to create great visualizations.

Yep, very practical for getting stuff displayed cogently and with minimal fuss. I wish other platforms besides R would have something like this.

We use Julia (running in IJulia) with D3 using either iframe communication, window.postMessage or hooking directly onto the websocket to communicate between the D3 viz and the Julia backend.

We decided to use D3 because of the interactivity. There are performance problems with the default SVG examples when the number of visualized nodes grows so I wrote a hybrid CANVAS + SVG rendering layer. We use CANVAS for the bulk of the drawing, SVG for text nodes or a few nodes that need mouse interaction, and event delegation on the document to determine if the mouse has interacted with anything else.

for large-scale EDA time-series viz, `gnuplot` has served me well, even understanding ms.

It depends how fluid I need to be.

If I have a very specific, quantifiable goal in mind, then I use test driven development - added bonus of having a test suite at the end.

If I am working with large datasets on servers, then I simply subsample, and then scale up

Machete is a good one!


It appears none of the Featured Products are currently working.

We use python a lot, so matplotlib is our tool of choice, combined with NumPy or Panda!

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact