Ayasdi's notion of topological data analysis has to be the most overhyped piece ...

abak · on July 21, 2014

Disclaimer: I am in the research group at Ayasdi.

What we do is simple in many ways. But before I make specific comments I'd like to point out that simplicity is a virtue not a demerit. To pull from another area of Mathematics - the derivative is a simple idea - it's just the slope of a tangent line. Any yet, much of modern physics is the result of understanding and applying this idea. With simplicity the difficulty is understanding how and why you use the technique - rather than describing the technique itself.

When you say "simplistic notion of topological invariance" I think you may be talking about persistent homology which is not currently the primary product sold by Ayasdi. I disagree that homology is simplistic - it's one of the central tools in modern mathematics.

In any case we sidestep the whole notion of inferring the neighborhood structure focus on creating meaningful (open) covers of the data.

Instead of finding a neighborhood structure (which you can think of as a particular choice of covering of your data by metric balls) we create (open) covers that summarize some important aspect of the data. I mean summary in a technical sense that is beyond the scope of this comment. (I have some videos on youtube that address this and other issues. I think the most recent is the best but there is some important material in the earlier videos that I don't repeat in the later ones https://www.youtube.com/results?search_query=anthony+bak I apologize for the vanity post)

I can briefly describe one technique for generating covers that we use instead of neighborhood structure but the details of how this fits into the bigger picture are best seen from materials on the Ayasdi resources page or from the videos above.

In the simplest (and most common) case we use a function on the data to get a map from the data to the real line. Using the function we induce a cover on our data from a cover on the real line by taking inverse images of the sets in the real line cover. This data cover is too coarse for most purposes so we break large sets into smaller sets by clustering within each inverse image set. In this way we build a useful cover of the space. To finish, we calculate the "nerve" of this cover to convert our (complex, high dimensional) space into a combinatorial object called a simplicial complex that "remembers" geometric and topological features of the original space (while forgetting others).

Why this is the right thing to do is covered in the video and on the resources page. It's ok to be confused about the "why?".

I don't see how this is a parameter sweep and I suspect again that you're talking about persistent homology not the nerve construction described here.

Stepping back for a moment, topological spaces comprise a far richer set of spaces than manifolds. In my experience real world data almost never looks like it's sampled from a manifold and the tools we need to describe what is happening need to far less rigidity than those coming from manifold learning. In particular, we want tools that make as few assumptions (manifold, homogeneity, statistical) as possible. As far as shape goes topology has (arguably) the most relaxed notion of shape in mathematics - so the fewest assumptions are needed to study the shape of the data.

As an aside I'll mention that in the video I show an example of using Topology a la Ayasdi to do manifold learning. We find a Klein bottle glued inside a Sphere along a singular set. One of the reasons this is a nice example is that it was also solved using tools from manifold learning - but the methods required knowing the local structure of the singularity. None of those assumptions were used in the topological reconstruction - we didn't need to know ahead of time what we were looking for.

I go through a bunch of examples in the videos of using these ideas on actual data. I believe I show some examples of telco churn data, insurance fraud, and mobil phone parkinsons detection. These comprise a small selection of what you can find poking around the Ayasdi resources page but go well beyond the NKI cancer data set you refer to (although, I personally find that example compelling).

Finally, I also like the other approaches you mention but I see them as complimentary not oppositional to the topological approach - and I think generally speaking there is a fair bit of overlap between the various communities. In particular, the Hodge theoretic analysis on simplicial complexes is really nice and Yuan Yao, who is a coauthor on the Hodge Ranking paper, was a postdoc with Gunnar Carlsson (Ayasdi cofounder) and I just was visiting him in Beijing. I regularly talk with collaborators of Larry Wasserman and his graduate students. Looking further a field, Vin de Silva, one of the inventors of isomap, was also a postdoc with Gunnar Carlsson and is very active in the Topological Data community. In fact, on a technical level, like I mentioned above, the topological framework can use manifold learning techniques (such as isomap) to help create the topology used as part of the nerve construction. So the fields cross fertilize technical results as well. Yes, like you say, there is exciting things going on, and we are constantly integrating those ideas into our product.

What all of these methods share is a desire to bring richer geometric (broadly speaking) toolset to modern data problems. I see particular value in the topological approach but support other work on bringing geometry to data.

micro_cam · on July 21, 2014

Thanks for the answer. I'll take a look at more ayasdi papers and things but part of my distaste is that so much of the public material is non technical making marketing stuff that claims revolutionary broad applicability and makes it hard to tell what you guys are actually selling. Telling that you can prove it should work and I confused about the why is total BS and ensures I will never pay for your product...your just using advanced math to obfuscate.

The history of data analysis indicates exactly the opposite. Methods are shown often not shown to work theoretically for years after they are accepted to work well in real world data analysis. Lots of popular methods may not even come with theoretical guarantees and lots of theoretical guarantees are useless or misleading because they depend on assumptions about the data which are rarely true or have other issues.

But I have to say your still just sidestepping how you actually do the neighborhood learning now by calling it open covers (I just used the term neighborhood structure to keep it less jargony for this audience). How do you map the text documents mentioned in the article to real space? If you are just integrating isomaps and other standard techniques and the added value is the simplicial complex vis that is fine but you aren't developing any new math.

The procedure you are describing is similar to hierarchical clustering and will suffer from similar sensitivity to selection of the initial splits and any parameters. Manifold forests, for example, are also hierarchical but used a bagged ensemble to partially addresses this. I'd like to see more public work from you guys on combating this sort of overfitting and sensitivity...I just picked on persistent homology because it is one of the only things topology seems to add in this area.

This stuff is really cool and has generated useful results. If you guys just want to be the main consulting company for doing manifold learning that is great but marketing articles like this that try to claim you're the exclusive purveyors of some new math is turning a large portion of the community away.

abak · on July 22, 2014

The process I outlined is sometimes called "Mapper" and is from published literature:

http://comptop.stanford.edu/u/preprints/mapperPBG.pdf

The more general concept of calculating the nerve of a covering is standard topology material where the technique is used to create combinatorial representations of topological spaces. http://en.wikipedia.org/wiki/Nerve_of_an_open_covering

TDA is a framework not an example or a method. It applies anywhere that you can define a notion of distance or similarity along with a function (not nec. continuous) on your data. It's hard to think of a data situation where that doesn't apply. That's why we can handle so many different data types. The method has very few assumptions or requirements.

If you are saying that for some specific example you saw somewhere (or maybe it's in the article?) we didn't tell you the metric and function I'm afraid I can't help you - I am not myself familiar with the details of the text analysis you're referring to. But generally speaking we are open about exactly what metric and function we've used.

I am not obfuscating how we create covers (what you're calling neighborhood learning) - in fact I spelled it out exactly. We pull back the cover from the real line (period!). On the real line we typically cover so that either all covers are the same size or contain the same number of points (In the software you pick the number of sets in the cover, their percentage overlap, and if you want to have them same size on the real line or contain the same number (approx) of data points). How we cover the real line is a choice and isn't part of what TDA is. There are other reasonable choices.

I think compared to most companies we are in fact open and transparent. The part of the software that isn't open primarily deals with how we've scaled this algorithm to deal with large datasets. The "mathy" parts are all either published or we give customers the formulas in our documentation.

I agree that in the real world we use things that aren't proven to work. Typically there is some range of applicability but we can't pin down the requirements for when a technique will work and when it won't - for instance, we can prove something for gaussian noise but in practice it works more broadly.

We don't use advanced math to obfuscate this issue and we certainly confront this issue on a daily basis. The details of exactly where our methods apply is complicated (what function?, how general a space can we use?) and to some degree unknown, but since topology itself has very flexible foundations we believe that lots (most) data problems lend themselves to being understood using these methods. This is clear without needing a formal proof of applicability.

I think one of the reasons that there's confusion is that TDA doesn't neatly fit into existing analytical boxes. It's not clustering, dimensionality reduction, manifold learning or feature selection, It's different enough that when you say it sounds like hierarchical clustering I think you're missing the point of what a topological summary is. We are also not doing manifold learning (we do not need underlying manifold assumptions)

The way I think of it is as a framework for analytics. In the simplest case It takes a metric space and a real valued function and produces a geometric summary. You can create the topology using any method you like (similarity spaces, standard metrics, manifold learning, metric learning, social network graph distance, some other method you know or invent) and you can use any function you might know or invent (mean of point coordinates, distance from separating hyperplane in a SVM, local curvature estimates, age of person, page rank of a node in a graph etc.). As long as the metric and the function were "sensible" for your data you'll extract some geometric/topological truth about your data.

It's not that we have a new method like a new kind of SVM or a different kind of neural network. We have a very generic way for extracting the (up till now) overlooked geometric/topological content of existing methods. This geometric information adds fidelity to existing methods (like all good frameworks, it not only uses but improves existing methods). This one way to think about what Ayadi does.

(As an aside I'll point out that higher fidelity with existing methods also means that you can build more accurate models and we are currently automating the process of model improvement using TDA).

The kinds of methods you describe (manifold forests, hierarchical clustering etc.) are all methods and for the most part you can input them into the topological framework as well. TDA doesn't fix all of your problems (parameter selection, over fitting etc.) but instead gives you more fidelity and geometric information about what you've chosen to do. We've built in many standard methods into the software but allow you to extend them with your own custom ones if you choose.

In terms of "new math" or not I guess I don't really understand the complaint. Articles on TDA are published in peer reviewed math and science journals and are done by mathematicians (Some prominent mathematicians doing and publishing on TDA: Gunnar Carlsson, Stanford Math. Robert Ghrist, UPenn Math, Shmuel Weinberger UChicago Math, John Harer, Duke Math. Robert MacPherson, Institute for Advanced Study. Konstantin Mischaikow, Rutgers Math.). People from Ayasdi publish and present results to other mathematicians who consider it new. For example, Gunnar just published a review article describing TDA in Acta Numerica: http://journals.cambridge.org/action/displayJournal?jid=ANU (subscription required). Some of this he invented, some others published elsewhere, but people in the community consider it new.

Using topology to understand point cloud data is new - methods aren't copied from existing mathematical literature even if we are inspired by results in geometry, topology, algebraic topology, and algebraic geometry. Just because we work by analogy doesn't make this not new math - in fact - I would argue that's how most "new math" is created across all of mathematics.

As a postscript I would add that we also come up with new metrics and functions to solve problems we come across. But this is not the core of what TDA is.

Edit: Added point on data types in 3rd paragraph.

barakm · on July 21, 2014

Any links for the curious?

micro_cam · on July 21, 2014

If you want to read up on ayasdi go straight to the Source: http://www.ayasdi.com/resources/ (click on publications)

There are probably some talks by Gunnar Carlsson around on the web too.

There is also a great overview from Larry Wassermann on his outstanding blog: http://normaldeviate.wordpress.com/2012/07/01/topological-da...

Other examples of manifold inspired math in data analysis that I find more useful as they describe ways to solve complex problems with simple systems:

http://web.stanford.edu/~yyye/hodgeRank2011.pdf http://www.cs.jhu.edu/~misha/Fall09/Hirani03.pdf

A really interesting method for learning manifolds from heterogeneous data (decision trees can be extended for categorical data, text data, almost anything) that doesn't boil down to just choosing a metric: http://research.microsoft.com/apps/pubs/default.aspx?id=1555...

Edit: Also they do have a really slick UI and I like the math involved. I wanted to like their product however I worry that the math just serves to lend legitimacy when the results aren't as exciting/promising as advances being made in other areas of data analysis on similar datasets.

Also them keeping it proprietary makes it hard for people to do legitimate comparisons. We implemented some of our own versions of their stuff to tory on other cancer datasets but it just didn't work very well because of numerous issues with lack normalization, batch effect, bias and noise.

mathgenius · on July 21, 2014

Here is the mother load: http://www.math.upenn.edu/~ghrist/notes.html

I would recommend diving in with wild abandon, Ie. don't be afraid of the heavy sounding math, Ghrist does a great job of holding your hand.

This field is called "Applied Topology" and is distinct from manifold learning, IIUC.

The photo from the article has a mention of "barcodes": http://www.math.upenn.edu/~ghrist/preprints/barcodes.pdf

The fascinating this about all of this is how topological (connectivity) information is extracted from messy data.

rgrieselhuber · on July 21, 2014

I'm usually not that guy but thought it might be helpful: the correct phrase is "mother lode."