Why Topological Data Analysis Works

devilsdounut · on May 5, 2015

I'd love to see a success story of this type of analysis outside of their canned examples. I keep seeing them use the same datasets over and over again without any real benchmarks to state of the art. Its amazing how a data product is being sold without any empirical studies or benchmark datasets.

Ayasdi seems successful to me in that it has a lot of flash and their results make intuitive sense, but I don't understand how a practicing data scientist would use this.

cvsv · on May 5, 2015

I've seen it used "in the wild" in this publication: "Topographical transcriptome mapping of the mouse medial ganglionic eminence by spatially resolved RNA-seq" http://genomebiology.com/2014/15/10/486/abstract (pdf: http://linnarssonlab.org/pdf/Genome%20Biology%202014.pdf ) to cluster gene expression samples from mouse brain.

I'm working on a project where I'm using similar methods, but not from Ayasdi, to study cyclic phenomena in high-dimensional data.

devilsdounut · on May 5, 2015

Ok, so here it's used to cluster. There are tons of benchmark clustering datasets. Never seen it used on any of those.

anthony_bak · on May 7, 2015

I'm a data scientist and mathematician at Ayasdi.

For a variety of reasons we don't participate in competitive data science - however - we do a lot of actual science that you can use to get some idea on the power of the technique:

http://www.ayasdi.com/resources/#publications

Work using related techniques but not directly Ayasdi's implementations of TDA can be found here (although for some reason the page hasn't been updated in the last year):

http://appliedtopology.org/preprints/

Although you'll have to take my word for it - we are frequently brought in when people have tried the standard techniques and have run out of ideas. In those cases we have to show that we can improve on those models. We do that by discovering how some outcome of interest is spread (localized) in the data. You can think of this as finding different local descriptions of a phenomena. Many methods will fine the dominant description but miss more subtle variations - we find a more complete picture.

For example, if you're looking at fraud (credit card, login etc) velocity-based measures may catch machine-assisted fraud but miss manual and more sophisticated fraud. Different kinds of fraud have different (local) descriptions.

From a data scientist workflow this means that you (1) select features. (2) create topological summaries.(3) find regions where fraud is localized (4) find underlying reasons why the fraud is being localized in these groups (5) use this information to build better models (exactly how you use this information is another long post).

It's important to note that I didn't have to think of the reasons for all of the different types of fraud ahead of time - the groupings and shape summary guide my feature search (to be fair this is usually an iterative process initial features --> shape localizations --> more/different features etc.).

I ansered a similar question as part of a kdnuggets interview. I'll excerpt it here:

Q3. What are the unique benefits of Topological Data Analysis (TDA) over other approaches?

Topology is the study and description of shape. In Big Data problems, shape arises because you have a notion of similarity or distance between data points. This can be something like Euclidean distance, correlations, a weighted graph distance or even something more esoteric. Shape is exploited in machine learning by bringing in some additional information such as “my data has well defined clusters or classes”; “this outcome is linear “; or “my signal is periodic”. Then you would use specialized tools to apply models based on this information.

Topology adds the ability to understand and describe the shape without imposing additional model information, which can be biased and misleading. This leads to a number of concrete benefits, such as predictive model improvement and a better understanding of your data.

This seems like a trivial point, but can be key to solving complex problems with a high degree of accuracy. A simple example of this comes from hospital predictive models. Hospitals want to measure how sick people are and collect a variety of clinical information (blood pressure, heart rate, temperature, breathing rate, oxygen levels etc.) or genetic information (gene expression levels). Typically, they fit a linear regression model that predicts how “sick” patients are. The underlying assumption is that there is a near linear relationship between symptoms and “sickness”.

One of Ayasdi’s academic collaborators took gene expression data for people at different stages of malaria. When examined using TDA, he found patients all lying on a circle sitting inside of a high dimensional space (~1000 features). While in retrospect the circle is obvious, your path from being healthy to sickness and back to healthy does not track up and down through the same set of symptoms, and yet nobody had thought to look for the circle.

Most real world data sets that I look at are larger and more complicated than this example and we find a variety of structures — cluster, flares, loops and higher dimensional structures — all appearing in a single data set. It is nearly impossible to guess or hypothesize the right structures ahead of time, and TDA is a tool to understand your data in an unbiased way, revealing its true complexity.

The complete interview is here:

http://www.ayasdi.com/blog/topology/kdnuggets-interview-anth...

fizixer · on May 5, 2015

The key paragraph "The projection is visualized as ... pictured as below." is very ambiguous, and completely missing the explanation of how the data was split into red, blue, and indigo clusters.

IndianAstronaut · on May 5, 2015

Isn't response surface methodology a form of topological data analysis? IIRC, it isn't used much since it has poor predictive power.

avani · on May 5, 2015

The key with Ayasdi's work is that they manage to layer the TDA with different ML filters, which does stunningly well for the datasets they like talking about.

I also second the notion to read papers from the Carlsson lab, particularly one of the more application-oriented papers such as this one: http://www.nature.com/srep/2013/130207/srep01236/pdf/srep012...

rch · on May 5, 2015

The company puts a lot of energy into making Gunnar's work accessible - maybe too much. Dig into his papers sometime.

omaranto · on May 5, 2015

Could you explain what you mean by "maybe too much"?

rch · on May 6, 2015

Just that the material featured most prominently on the site might be too imprecise for mathematicians and scientists with significant technical expertise. It would be nice if there was a convenient link I could pass along to informed, but highly skeptical, people with a genuine interest in understanding the platform on a deeper level.