
Scientific Data Has Become So Complex, We Have to Invent New Math - gurupradhan
http://www.wired.com/2013/10/topology-data-sets/all/
======
micro_cam
Ayasdi's notion of topological data analysis has to be the most overhyped
piece of "math" yet this century. Take a simplistic notion of topological
invariance, gloss over how you actually infer things like neighborhood
structure to get the topology, slap a slick UI on it, show it works on one out
of date breast cancer data set that is tiny by modern standards, charge
through the nose and see how far you can ride it.

It is kind of shocking that they are now selling it as beneficial in
heterogeneous data. Actually learning any structure from highly dimensional
heterogenous messy data in a way that doesn't just overfit is the crux and,
last time I looked, their public papers and other materials suggests they are
doing nothing beyond some standard distance metrics. There is lots of other
exciting work going on in this area. Topological invariance may provide some
assurances what you are seeing is real but kind of just becomes a fancy name
for doing parameter sweeps.

There are really exciting things going on in manifold learnings and
application of (smooth and topological) manifold theory to modern data. This
includes applications of discrete exterior calculus to abstract simplical
complexes that may not even represent a manifold allowing efficient algorithms
for problems like ranking from graph flows (e.g. page rank) and other cool
things. Maybe Ayasdi has some new proprietary magic up their sleeves but
everything I've seen from them so far is just a showy wrapper around really
basic math.

~~~
barakm
Any links for the curious?

~~~
mathgenius
Here is the mother load:
[http://www.math.upenn.edu/~ghrist/notes.html](http://www.math.upenn.edu/~ghrist/notes.html)

I would recommend diving in with wild abandon, Ie. don't be afraid of the
heavy sounding math, Ghrist does a great job of holding your hand.

This field is called "Applied Topology" and is distinct from manifold
learning, IIUC.

The photo from the article has a mention of "barcodes":
[http://www.math.upenn.edu/~ghrist/preprints/barcodes.pdf](http://www.math.upenn.edu/~ghrist/preprints/barcodes.pdf)

The fascinating this about all of this is how topological (connectivity)
information is extracted from messy data.

~~~
rgrieselhuber
I'm usually not that guy but thought it might be helpful: the correct phrase
is "mother lode."

------
dalke
"Today’s big data is noisy, unstructured, and dynamic rather than static."

As was yesterday's big data. I've been reading through the literature from the
1960s and 1970s, and it complains almost exactly the same issues. Including
coming up with new mathematical techniques.

Actually, "10 million words recorded during just under 200,000 trials" was
completely within the techniques of the 1960s, which was a combination of
human curation and computer data management. See for example the work at
Chemical Abstracts and ISI in handling chemical publication data. I happen to
know that the ISI organization in turn was based on citation techniques first
developed in the late 1800s for tracking legal judgements - which is the topic
discussed here.

------
j2kun
This is a really fascinating subject, and as of last year I had read through
essentially all the major papers in the area, implemented the central
algorithms, and given a few more-or-less casual talks on it.

There are two things that strike me as very different about topological data
analysis. The first and primary thing is that it's not a silver bullet by any
means. It's not like most machine learning and data mining, where you pick
some parameters and wham-bam-thank-you-ma'am you have 80% accuracy. No, this
kind of analysis gives you _qualitative_ features of your data, and after all
the topology is done there's still years of work before you arrive at a
mathematical model that admits an algorithm that has accuracy comparable to
the state of the art.

An interesting case study in this is Carlsson's work on texture
classification. They ran their 3x3 image-patch database through their
topological analysis algorithms and it said (essentially) "your data looks
like a Klein bottle!" That sounds interesting and fun, but how can you
actually use that to do anything? This followup paper [1] then gave an actual
model of image patches as a Klein bottle, but even then there is still a ton
of math to trudge through (specifically, functional analysis and differential
geometry) before they actually got to an algorithm, and it still wasn't
strictly better than the leading methods. The real benefit seems to be
entirely in the science part of everything. They have a novel hypothesis for
how the world acts which is demonstrably accurate. It's not like a support
vector machine with well-chosen features where it's as useful as a black box
in terms of understanding the world.

The second thing is that this field has legitimate and nontrivial results that
come from category theory, eventually leading to algorithms or proofs that you
can't hope for an algorithm. Carlsson has a nice survey for starers [2]. So
everyone who loves to talk about category theory and programming on the
internet now has a less abstract quiver of arrows (pun intended, if you're
familiar with this research area).

[1]:
[http://comptop.stanford.edu/u/preprints/KleinBottleTextureAn...](http://comptop.stanford.edu/u/preprints/KleinBottleTextureAnalysis.pdf)
[2]:
[http://www.ayasdi.com/_downloads/Topology_and_Data.pdf](http://www.ayasdi.com/_downloads/Topology_and_Data.pdf)

~~~
cr4zy
Sensitivity analysis -
[http://en.wikipedia.org/wiki/Sensitivity_analysis](http://en.wikipedia.org/wiki/Sensitivity_analysis)
is an interesting tradeoff between generating a specific algorithm and a block
box, in that it tells you which parameters had the biggest impact on the
output given some model like an SVM.

------
skierscott
I'm an undergrad in a lab that focuses on compressed sensing (and we're
getting more into sparse machine learning).

Sparsity is a way of saying that most of the data you collect is related in
some way. One advantage of sparsity (and there are many more) is sampling at
sub-Nyquist rates[0] and still being able reconstruct the signal precisely.
Why? Because the signal is sparse in the Fourier domain -- there are only a
few frequencies present in the signal.

Reading this article, I had some issues with this quote:

> According to Candes, you could take half the samples (16) and rerun the
> test. If it is positive, the infected person is in this group; if negative,
> the culprit is in the other half. You can continue to whittle down the
> possibilities by once again dividing the group in half and running the test
> again.

That's just a simple binary search[1]. Instead, why not collect other
information? Collect some genetic information, lifestyle habits, living
location, etc. Then I believe (that's a strong I believe) you could optimize
even further by testing patients that have some of these traits in common --
you'd have groups of unequal sizes based on who you think has the disease.
This corresponds to sparsity. Most traits don't matter but a select few do.

To see real-world results of sparsity, we're able to reconstruct precise
approximations after we assume our input is sparse in the wavelet basis
(corresponding to many areas of the same color). We're able to push this
sampling rate down to ~6% for simple images by doing a "sample, compute"
method. The reconstructed approximation and sampled locations are shown in
this image[2]

[0]:[https://en.wikipedia.org/wiki/Binary_search](https://en.wikipedia.org/wiki/Binary_search)

[1]:[https://en.wikipedia.org/wiki/Nyquist_rate](https://en.wikipedia.org/wiki/Nyquist_rate)

[2]:[http://imgur.com/PmYpze6](http://imgur.com/PmYpze6)

------
jlewis_st
I'm the lead frontend developer at Ayasdi, and I figure I should take this
opportunity to let the HN community know that we're actively hiring in
engineering :)

If you're a frontend engineer with an interest in machine learning and data
visualization, Ayasdi is a great place to build those skills. We use Backbone
and D3 as our core stack on the client side, and we're pushing at the edge of
what's possible when building rich data analysis applications for the web.
(Incidentally, we're talking about our approach at the next Bay Area D3 User
Group meeting [http://www.meetup.com/Bay-Area-d3-User-
Group/events/19268574...](http://www.meetup.com/Bay-Area-d3-User-
Group/events/192685742/))

Feel free to contact me directly (contact info in profile) if you're
interested in learning more about Ayasdi!

~~~
j2kun
I'm a PhD student in pure mathematics (studying theoretical computer science),
and I have a list of industry companies I might want to work at in the event I
don't get a satisfactory post-doc.

To what extent do the folks at Ayasdi engage in the research side of the
picture? Would you be interested in hiring someone with both strong
programming skills and mathematical knowhow to work on both the research and
development sides?

~~~
abak
I'm in the research group at Ayasdi and can say that we actively do research
in both TDA and Machine Learning and have developed unpublished but cutting
edge fusion between the two fields - using TDA to enhance traditional machine
learning models (not yet released in public).

I regularly attend academic conferences, give university colloquia and also
talk at industry events (eg. In the last month I gave two workshop sessions at
ICML in Beijing, gave a lecture at a drug design conference in NJ and
presented at a Big Data conference in NYC).

We do basic research but as a small company (and small group) we have to keep
our eyes on the commercial application and financial implications of our work.
For myself I've found this environment stimulating and if anything has helped
my research productivity.

It's worth pointing out that Gunnar Carlsson - who is mentioned in the article
- is both one of the inventors of TDA and someone who participates in the
daily development of the company and product. We have an open seating plan and
he currently sits diagonally to my desk and next to one of our most junior
employees (not in the research group). In that sense, there's opportunity for
people to step up as they desire with access to people who aren't just leaders
in TDA, but who invented it.

As a company we consider someone like yourself in a variety of roles - Data
Science, Research or Engineering. In all of those groups people develop novel
techniques that could be qualified as "research" and all groups contain people
with PhD's in technical fields - some coming straight out of grad schools,
some after postdocs and some after holding faculty positions. We are
broadminded regarding academic backgrounds and there are people with
undergraduate degrees making novel contributions as well. One of our presales
engineers (who has a PhD) just had a paper published in Nature - which I
mention to show how we have research capabilities spread very broadly through
the organization.

If you want to talk more feel free to reach out to me.

~~~
j2kun
Sounds great! I actually met Gunnar briefly at a conference at UChicago two
years ago (doubt he remembers me). I'll reach out to you on LinkedIn.

------
zeratul
I would say that this is true for unstructured data that does not have
availability of experts who participated in the creation of that data.

I noticed that by talking with the domain experts, for example radiologists
that read head MRIs, you can greatly simplify the problem. It usually comes at
the cost of experts annotating the data, which might be too expensive for some
businesses. But then, I think, the data is analyzed just by scholars until
they built-in the expert knowledge into their algorithms.

That's at least the pattern in healthcare data. The first one to notice a
paper or source code like that dominates the market - this is just my
conjecture.

------
thaumasiotes
This identical headline could have been run hundreds of years ago. The process
has never stopped. We always need new math.

~~~
MaysonL
Heck, the Egyptians originally developed geometry to help confirm the
boundaries of the fields after the Nile's annual floods.

~~~
rotten
That is certainly an early application of geometry. I'm not sure if we know if
geometry was first developed as an applied math, or a pure one, or a mystical
one. That they were able to prove Geometry came in handy which assuredly
helped fund further research and development.

------
gone35
_Recht and Candes may champion approaches like compressed sensing, while
Carlsson and Coifman align themselves more with the topological approach, but
fundamentally, these two methods are complementary rather than competitive._

Minor nitpick, but I think the article (and this quote in particular) might be
mischaracterizing the difference. Every form of manifold learning relies first
and foremost on the ability to compute some kind of low-dimensional embedding
that preserves structure, for which Compressed Sensing --and its more recent
generalizations: see for instance the work of Wakin _et al_ [1]-- clarifies
the conditions and transformations that make them feasible.

[1] [http://papers.nips.cc/paper/3191-random-projections-for-
mani...](http://papers.nips.cc/paper/3191-random-projections-for-manifold-
learning.pdf)

------
spingsprong
"10 million words recorded during just under 200,000 trials"

So an average of 50 words per trial? I would have expected a higher number.

------
gurupradhan
Any suggestions, links, or research on how TDA or compressed sensing can be
used to make sense of spatiotemporal datasets (gridded climatology time series
data, change/anomaly detection in time-lapsed satellite imagery etc.)?

