

Visualising Correlations using Graphs - wslh
https://medium.com/@JavierBurroni/visualising-correlations-using-graph-2169c6415427

======
taliesinb
The most well-known practitioners of this sort of 'topological' approach are
Ayasdi, they have some slick demos [1]. The general name for this idea is
topological data analysis [2].

I replicated this particular experiment in WL, of course, because it's a
5-minute thing to do [3], and I could actually do the community detection the
author alluded to.

But I noticed that the correlation matrix itself is much more suggestive than
the graph ends up being, with or without community detection. Take a look at
the correlation matrix (note that MatrixPlot does some clever combination of
rank and absolute value to get high dynamic range):

[http://imgur.com/WKn029o](http://imgur.com/WKn029o)

The tri-diagonal structure is because the original dataset is derived from the
pixel counts from successive 4x4 tiles on NIST written-digit images [4].

Those 8x8 matrix of tiles is flattened onto the 64 random variables, so the
large correlation with tiles on the left and right explain the 1-off-diagonal
orange lines, the other two diagonals are offset by 8 and correspond the high
correlation with the tiles above and below. That's the 'connectivity kernel'
of a 2D manifold, so to speak.

The curious squiggles in all the other blocks of this matrix are unusual. I
don't know what's going on there. Maybe something interesting.

[1] [http://www.ayasdi.com/](http://www.ayasdi.com/)

[2]
[http://en.wikipedia.org/wiki/Topological_data_analysis](http://en.wikipedia.org/wiki/Topological_data_analysis)

[3]
[https://www.wolframcloud.com/objects/c7927909-448d-4502-9c1a...](https://www.wolframcloud.com/objects/c7927909-448d-4502-9c1a-ff65243c1f5a)

[4] [https://archive.ics.uci.edu/ml/machine-learning-
databases/op...](https://archive.ics.uci.edu/ml/machine-learning-
databases/optdigits/optdigits.names)

~~~
taliesinb
It turns out that the graph is a bit more interesting when you can understand
what part of the original image each node comes from. I've color coded each
variable, along with a little legend image:

[http://imgur.com/TeOspAj](http://imgur.com/TeOspAj)

~~~
jburroni
This idea is actually very interesting!

------
boombard
Nice article. You could consider maximum spanning trees as a way to prune your
correlation graph; they are very effective at suggesting underlying structure
or kinetics of a system. Just use the minimum spanning tree algorithm with the
inverse of your correlation.

[1]
[http://en.wikipedia.org/wiki/Spanning_tree](http://en.wikipedia.org/wiki/Spanning_tree)

Another approach is to use PCA on the adjacency matrix. This can generate
interesting clusters based on the latent variables. At the risk of self
promotion I co-authored a paper on this technique which validated known
pathways in a metabolic network

[2]
[http://www.biomedcentral.com/1471-2105/13/197](http://www.biomedcentral.com/1471-2105/13/197)

Anyway this is a great field to explore, glad to see it getting traction on
HN!

~~~
isani
A maximum spanning tree might be misleading, as it's easy to interpret no
vertex as no correlation. When building a tree, weak correlations may be
included out of necessity, while stronger ones that lead to cycles are
omitted.

If several dimensions are correlated just about equally strongly, you can get
very different trees based on small random variation. There's no guarantee
that all significant correlations are displayed, or that correlated dimensions
are visually close to one another.

~~~
boombard
I agree, it's not perfect - just a useful abstraction. Just the same as
arbitrary thresholds for correlation or a p<0.05 significance level - often
you lose information but gain insight. From personal experience I've seen
MST's map out underlying structures that validate classical chemical kinetics
of a system in a logical path: something that would not have been apparent in
ordinary thresh-holding approaches

Basically IMO it's good to use all of these techniques together to get a good
picture of your system. In the end the greatest limitation is our human
cognition to interpret the results, which frankly needs all the help it can
get.

~~~
jburroni
Thank you for the feedback. I prefer to use a graph instead of a tree because
I want to spot clusters of relations.

------
jcheng
Nicely explained. I'm new to igraph but this was such a perfect opportunity to
build a Shiny app, I couldn't resist:

[https://jcheng.shinyapps.io/corgraph/](https://jcheng.shinyapps.io/corgraph/)

~~~
jburroni
this is very cool. I've used IPython with its interactive capabilities to
analyse the graph in the same way you did.

------
dmourati
Fix the position of the red circles and then increment the threshold.

