

Making Text Mining Accessible to Any Developer & Non-Expert - wfaler
http://blog.recursivity.com/post/13108173847/making-text-mining-accessible-to-any-developer

======
law
I have a good amount of experience in natural language processing and machine
learning, and I don't think offering an API that provides easy access to the
algorithms is the right solution. The major algorithms in text classification
aren't that complex to implement, and can be done in a few hundred lines.
Moreover, all of the most widely used, widely tested, and reliable algorithms
have public implementations that are readily adaptable to your needs. And
that's the problem: _understanding your needs_.

Understanding your needs (or your company's needs) is where people with PhDs
make their money. Machine learning isn't a panacea, and we won't be seeing a
one-size-fits-all approach for awhile. Even though data has become more
accessible, it might be noisy, incomplete, streaming, partially labeled, etc.
This is why understanding exactly what you're trying to model with these
algorithms is crucial and why "just applying" them is impractical at best and
misleading at worst.

~~~
PaulHoule
Well. Here's my take.

There are a number of text analysis SaaS offerings such as OpenCalais,
AlchemyAPI, Zemanta, and OpenAmplify. They've all got impressive science under
the hood, but none of them are accurate enough to be useful.

I spend most of my time these days thinking about why that is and what to do
about it.

For systems to do better, they'll need to incorporate world knowledge; they'll
need to test different interpretations of a text and select the ones that
"make sense". This is likely to be a form of statistical inference rather than
Cyc style logic.

Based on some systems I've worked with, I'd estimate that a space optimized
"background" knowledge base that can estimate satisfiability in the common
sense domain is on the order of 10-100 GB. It will puff out to at least an
order of magnitude beyond that in the process of creating it.

Few users will have the ability to create a KB of that type, and it would be a
serious thing to download and install.

Hosting the services of that kind of system in a SaaS manner makes a lot of
sense.

~~~
zeratul
Without looking under the hood I'd say there could be at least four reasons
why they fail (based on what most of the NLP literature is lacking):

\- did not remove contradicting information from the training sets (two very
similar vectors having contradicting labels)

\- did not try enough feature selection algorithms

\- did not estimate ALL learner parameters using the training sets with
internal CV

\- did not include domain knowledge

The last one refers to Paul Houle comment. Just, beside using tools like
OpenCyc, WordNet, UMLS, there many other ways to embed domain expertise in an
automated classification process. Injecting semantically related features into
a vector representation of a document is extremely difficult. Forward feature
selection doesn't work well for sparse and noisy data.

~~~
PaulHoule
the curse of dimensionality is the worst problem that affects machine learning

customers don't want to create training sets large enough to train text
classifiers; often the number of documents they need to sort into a category
is too small to fit in a category.

As for semantic indexing, it was hard to do in 2005. In 2011 it's easy.
DBpedia and Freebase are a chromosome map for the human memome. With large
amounts of instance information, it's possible to do things that a big rulebox
can't.

These tools are aiming for the market segment that Cyc aimed for, but will use
very different methodologies.

------
mark_l_watson
Text mining is one of my specialties and I have had similar ideas for a
business. One thing that has stopped me is the awesome (and free for about 50K
API calls a day) Open Calais service that does entity extraction and
identifies some relationships between entities in input text.

For document clustering there are many good open source tools that people and
companies can use. The commercial Ling Pipe product does a good job at
sentiment analysis.

Obtaining, scrubbing, and generally curating the data is a pain point that
users of this system may still need to worry about.

I wish this new business good luck, but there are definitely some real
problems to work around. Perhaps we should go into business together :-)

~~~
wfaler
True, but problems are there to be solved. :) We've done a lot of work around
data normalization/scrubbing from a multitude of sources as part of a sister
project, so I'm fairly confident about this aspect. Curation and
classification is another issue, but we have a few ideas.

As for business, you never know, just let me try to get off this Ramen based
diet first. ;)

------
zeratul
Text mining: most of the time is spent on gathering the data, curating the
data, and working with your annotators (domain experts). After that, you try
_a dozen or more_ ways to covert documents into a matrix format. Then, you try
_a dozen or more_ feature selection algorithms. Finally, the icing on the
cake: you get to try _a dozen or more_ machine learning algorithms, each
having _a dozen or more parameters_ to be estimated.

Yep, it would be very nice to have an API that would do all that for you. But
that would require a group of at least 10 ML experts + 10 NLP experts + 20
domain experts. Still, I think it's doable and one should make small efforts
to make it happen.

Marginal thoughts: _decision trees_ are very bad for large p >> n problems -
random forest might work, though. If TextMinr doesn't have radial SVM with
auto-tuning then it will not cope with more difficult problems.

~~~
wfaler
Appreciate your comment. And you're right about decision trees, though they
can be useful for simpler problems, such as classifying documents into
categories, whereby you have "sub categories" and the parent categories are
mutually exclusive.

Decision trees are really only useful for problems where there is mutual
exclusion between the different options, so they are definitely no silver
bullet.

~~~
law
Fully grown decision trees are notorious for their risk of overfitting your
training set. If you're uncomfortable fully growing the trees, you then have
to consider whether you want to grow them out completely and then prune them,
stop growing after a specific depth, train the trees using a random subset of
features in the feature space (and then how many do you select? Do you use the
square root? Logarithm?), etc. Even then, what are you using to choose when a
node splits? Information gain? Information gain ratio? Gini index? What about
when you have a feature like credit card numbers, which are unique?

These are all choices that the user has to make. For something as seemingly
simple as a decision tree, you can see why _some_ knowledge is required before
embarking on any machine learning mission.

------
vyrotek
Great! I was just investigating AlchemyAPI and OpenCalais this weekend. I look
forward to trying TextMinr.

TextMinr seems like a combination of those services along with the idea of
80legs.com? Is that correct?

~~~
wfaler
I'm not familiar with 80legs, but the main idea is to "democratise" access to
this technology and make it easily accessible to anyone who wants to build
something on it.

The initial few beta releases will probably be aimed at people aiming to build
applications themselves by providing them with API's, but hopefully we'll
build out the analytics side of things soon enough so it becomes accessible to
non-techies as well.

------
tomwalsham
I'm really happy to see more people moving into this space.

I've used a number of different systems (openCalais, AlchemyAPI, Zemanta...)
in a variety of projects (Sentiment analysis, document classification...), and
what I've found thus far is that while each system works extremely well within
some restricted application classes, none come close to being general purpose
APIs for the myriad applications developers try to throw at them.

A couple of pain points I've encountered are requiring a larger than expected
corpus to generate meaningful data based on overly broad scope of the
platform's analysis, or the lack of ability to apply negative signals from
external sources. I find there tends to remain a large quantity of logic
sitting rather redundantly on the application end to post-filter what's
generated.

I don't pretend to understand the level of complexity involved or what's being
worked on currently (not an NLP guy), but I do think there's a huge space to
create publicly available text mining which can more effectively be applied to
narrow domains.

------
itmag
Machine learning seems to be a meme on the rise right now. What kind of
startups are possible in that domain?

~~~
marshallp
self driving cars

autonomous helicopters

automated analysis of satellite imagery

search engines

visual search engine (google goggles)

virtual assistant (siri)

speech recognition

document classification (spam detectors)

question answering systems (ibm watson)

ad placement (google adsense)

computer guided surgery

high throughput imaging (chemo/bioinformatics)

product recommendation (amazon/netflix)

acturial science

industrial automation (inspection systems)

intelligent video surveillance

~~~
ralphc
I got most of these right away, but could you expand on what you mean by "high
throughput imaging (chemo/bioinformatics)"?

~~~
polyfractal
The high throughput imaging that I'm familiar with is in regards to cell
imaging. Cells are cultured in high-density plates (96-well or 384-well
dishes), each well is given a different experimental condition (drugs, RNAi,
etc) and then imaged on an automated microscope.

As you can imagine, this generates tons of data. Our lab did a highthrouput
screen of genetic mutants in neurons, and then used software to quantify basic
morphology such as neurite length , arborization, and cell death.

Crystallographers will use a similar system to bathe their protein in billions
of compounds to find the right combination for crystallizing. Automated
cameras will capture images and try to identify which ones have crystallized
so the researcher doesn't have to do it by hand.

------
ggwicz
This would really be useful on many, many levels. Why has this access to such
data been so roped-off?

------
danso
I've found in my data-mining experience that the most interesting data (at
least on the Web) is not particularly easy to parse, even if you write
something that automates a form's POST submissions. The second difficult part
is normalizing it, as much web/text data is formatted for _display_ to humans,
which is quite different than data in easily analyzable form.

So given that, it's just worth learning enough program to do loops,
conditionals, and regexes to get what you want.

~~~
St-Clock
This is so true. I've been mining mailing lists and framework documentation
for my Ph.D. and most of my effort and time was spent normalizing the data.
Once that was done, classifying content and linking concepts was relatively
easy...

------
marshallp
This probably won't work, as google found out with it's prediction api (it
hasn't been used much). There's already enough open source software out there
that's state of the art and easy to use.

There's good business to be had in selling data though which is where these
folks should probably divert their effort.

~~~
jfxberns
I don't know; ML has a lot of applications and the bar for most people to be
able to implement it is rather high. Lowering the bar so "mere mortals" can
have some serious infrastructure and data that's a mere API call away seems
pretty huge.

~~~
marshallp
The problem is that if you're not confident enough to get these systems
working yourself, you're probably not going to be confident enough in your
business to pay by the sip for someone else's api.

------
suivix
I honestly don't think this is feasible without regular expressions. There's
very many minor details that make data mining work which have to be custom
tailored to different solutions.

