
Ask HN: How to use Machine Learning to extract facts from the text? - dartwing
Machine Learning is evolving very quickly. What are the state of the art techniques to extract specific facts from the text automatically?<p>Are there any open source projects focused on this task that you could recommend?
======
grizzles
Facts are simply assertions that have met some burden of proof. Determining
that threshold is a subjective exercise, not an objective one. I know you want
an algorithm to do this, but there is no sentient algorithm smart enough to do
this. So, from an epistemological perspective, you are basically asking - what
are the facts as determined by someone else?

The tragedy of subjectivity is, for most people, some random ranting into a
youtube video for 15 minutes about eg. Hillary Clinton constitutes "evidence"
sufficient to determine fact.

~~~
dartwing
I define facts as structured piece of the information, which I need to extract
for specific domain area automatically. I do hope extraction to work
automatically in order for the project to make sense.

Eventually it will be measured for precision and recall using human judgement.
The quality of that judgement would impact greatly on improvements and
sustainability of the algorithm overall.

------
BjoernKW
What exactly is a fact? There's no easy answer to that question, particularly
with natural rather than formal languages. 'Facts' and statements depend on
context. The meaning of a natural language statement usually is derived from
these layers building on each other:

\- syntax (the structure of a sentence)

\- semantics (the isolated meaning of a sentence)

\- pragmatics (the meaning of a sentence in context)

Anaphora (references to previous sentences or concepts) can be particularly
nasty in this context.

Depending on the task at hand chunk parsing could be a good first take at
finding relevant phrases from unstructured textual data. There are numerous
libraries to accomplish that, for English and other Indo-European languages at
least.

~~~
dartwing
Fact for me is structured information extracted from the document. My task is
to extract what I can from the documents of the specific domain. I am fine to
start with high precision and low recall, I think. Need to try in action and
see if relevance of domain specific search and automatic validation can be
improved with this approach.

~~~
BjoernKW
In that case the information extraction frameworks Apache UIMA and GATE might
be helpful, too.

~~~
dartwing
Thank you! Will take a look.

------
PaulHoule
This system is a commercially oriented fact extraction system

[https://github.com/machinalis/iepy](https://github.com/machinalis/iepy)

that can be trained to get the kind of performance you would see in a text
extractor customized by the likes of BBN or Booz Alan Hamilton. You need
20,000 training samples to start getting good results.

~~~
dartwing
Thank you! Looks very interesting.

------
brad0
How do you define a fact?

As far as I understand it symbolic AI back in the 80s was building a massive
web of facts or "truths" that would be used to create a general AI. They
eventually ended up generating a bunch of contradictions.

~~~
dartwing
Very good question. I don't have good knowledge yet how to model this
correctly.

Currently I imagine that for given domain I can create text parser, which
would extract facts in standard formats. The example could be: "object
predicate subject". And then use facts mapped to documents for relevant domain
search and validation of some basic statements in other documents.

Not all statements require validation, I can focus only on those which have
high confidence in being parsed correctly.

------
DrNuke
Looking for something similar for .pdf academic papers in my field but nothing
really useful to automatise the extraction process exists, so the best path is
still to extract data manually, homogenise data in a standard protocol, fed ML
algos. Once a data protocol becomes a widespread standard and maybe a ISO or
similar, there is a chance automated extraction will work at the finest level,
as necessary for complex information.

~~~
lwhsiao
On system for extracting information from PDFs is Fonduer[1], which is built
on the Snorkel framework from Stanford. It may be worth checking out for your
use case. Here's a blog post introducing it [2].

Disclosure: I worked on the project.

[1] [https://arxiv.org/abs/1703.05028](https://arxiv.org/abs/1703.05028)

[2]
[https://hazyresearch.github.io/snorkel/blog/fonduer.html](https://hazyresearch.github.io/snorkel/blog/fonduer.html)

~~~
dartwing
Thank you! Will look through.

------
dartwing
Looking at SyntaxNet from Google. If there are other candidates worth looking
at - please kindly let me know.

[https://github.com/tensorflow/models/tree/master/syntaxnet](https://github.com/tensorflow/models/tree/master/syntaxnet)

~~~
gtani
Depends on the corpus. If your problem fits a conLL task, you can read lots of
papers about it. If you can build on an existing wikipedia entity/relation
graph, dictionary, gazzette that's a big boost. For academic research papers,
look at citations for your input stream then SVM tf-idf bigrams. If
sentiment/quality analysis, that's another tack

~~~
dartwing
Right now I'm hoping to extract entities and relations between them to use as
facts for relevant domain specific search and validation.

Reading up the articles, YCombinator included:
[https://blog.ycombinator.com/how-to-get-into-natural-
languag...](https://blog.ycombinator.com/how-to-get-into-natural-language-
processing/)

~~~
gtani
I think these PIs at UIUC communicate clearly about all the processes
necessary (Han wrote a good text on data mining, but probably outdated from
2011):

[http://xren7.web.engr.illinois.edu/www17-StructNet-
part1.pdf](http://xren7.web.engr.illinois.edu/www17-StructNet-part1.pdf)

[http://xren7.web.engr.illinois.edu/cikm16-profile.pdf](http://xren7.web.engr.illinois.edu/cikm16-profile.pdf)

Also solr/lucene/elastic indexes are good tools for filtering your inputs and
deciding what the unit doc will be (sentence, paragraph, numbered section of
research paper etc

~~~
dartwing
Thank you! Reading through the papers.

