
Knowledge Extraction from Unstructured Texts (2016) - homarp
https://blog.heuritech.com/2016/04/15/knowledge-extraction-from-unstructured-texts/
======
lettergram
Interesting! I use a similar technique on
[https://hnprofile.com](https://hnprofile.com)

It differs in a few key ways, largely because I’m trying to solve a different
problem.

[https://hnprofile.com/learn-more](https://hnprofile.com/learn-more)

I’m curious how this solution for knowledge extraction, vs the goal of
summarization holds up in practice. I suspect (as they mention on the
conclusion), even though data is extracted, timing is everything. Language and
information changes over time and I’m curious how multiple sets of the
similar, but different data in the corpus would be represented. For instance,
“I live in Japan”, vs “I lived in Japan” vs “we live in Japan” but “I visit
New York”

Where do I live, who are “we”, etc are still very difficult problems.
Summarizations are a bit easier in that for the most part you’re just removing
superfluous information.

~~~
miket
Here is how a relation extraction system would represent these two concepts
"live" (i.e. place of residence) and "visit" (a temporary location), as well
as the temporality of the statements.

[http://relex.diffbot.com:8085/?text=I%20live%20in%20Japan.%2...](http://relex.diffbot.com:8085/?text=I%20live%20in%20Japan.%20I%20visited%20New%20York).

~~~
haddr
Nice! What model do you use?

------
nl
It's worth noting that this is a pretty old article.

The last 3 years has seen significant development of neural models for text
(and graph) processing and almost all the state-or-the-art results listed
there are outdated.

Notably, modern large language models turn out to be very good on their own at
large parts of the knowledge extraction problem. OpenAI's GPT is state of the
art on the SNLI task[1], and GPT-2[1] is approaching human performance on
tasks similar to knowledge extraction.

[1] [https://openai.com/blog/language-
unsupervised/](https://openai.com/blog/language-unsupervised/)

[2] [https://openai.com/blog/better-language-
models/](https://openai.com/blog/better-language-models/)

~~~
riku_iki
SNLI is also not a knowledge extraction task..

~~~
nl
It's listed in this article, and inference is usually necessary for knowledge
extraction.

(And that previous sentence is a perfect example - you had to infer I was
speaking about SNLI, and can extract the knowledge that article is partially
about it)

------
dlkf
The article was a disappointment: only one sentence was dedicated to
motivating the business problem, and there was no description of the
application.

People act like the utility of this stuff is apriori obvious, but to me it's
really not. Unless you are trying to build a knowledge graph, what do you do
with these triples? (And even if you are, you need humans to independently vet
them for noise induced by the source material, as well as your algo. My
understanding is that information architects and annotators are essential to
building and maintaining Google's KG.)

Article summarization is often cited and I find this unconvincing. Most
articles can't be reduced to a trite collection of triples. They have a higher
level thesis that would be AI complete to parse.

What are some other examples?

~~~
polm23
I think that in many cases this is genuinely useless, but there are some good
and even important applications of the process.

An example is health records. Doctors enter short notes in a text editor for a
lot of these, and being able to extract information discover symptom
associations using information extraction is much faster than manually going
over data.

Here's a good presentation that covers some of this (PDF):

[https://people.csail.mit.edu/regina/talks/CNLP.pdf](https://people.csail.mit.edu/regina/talks/CNLP.pdf)

~~~
reubens
Thank you for the link. That was enlightening.

------
miket
You can try a live demo of Diffbot's knowledge extraction from text here:
[http://relex.diffbot.com:8085](http://relex.diffbot.com:8085)

~~~
wyldfire
First test that I tried went poorly IMO.

[http://relex.diffbot.com:8085/?text=Donny%20Trump%20lived%20...](http://relex.diffbot.com:8085/?text=Donny%20Trump%20lived%20in%20Manhattan%20and%20often%20travels%20to%20Kennebunkport%20and%20Palm%20Beach).

EDIT: Downvoters: Sorry this was unintentionally vaguely political.

------
hartator
Interesting. This is a hard subject. Not sure if it can be useful anymore to
you guys, but if want free training data from our API
([https://serpapi.com/knowledge-graph](https://serpapi.com/knowledge-graph)),
we'll be happy to hook you up. Hit me at julien - at - serpapi.com.

------
gauravphoenix
This is amazing. I can see that there will be an app or service which will
automatically provide summary or the key insights of lengthy articles or
books. That will be nice because I hate clickbait articles.

I am looking at you, BuzzFeed.

------
mark_l_watson
Good survey article. This is also the problem I am working on [1]

[1] [http://kgcreator.com](http://kgcreator.com)

------
DecoPerson
I didn't read on, as the article is beyond me, but I believe the initial
example of extracting structured facts from a paragraph about Marie Curie is
partially incorrect.

The text doesn't say she was born in Poland. It says she was Polish. It also
doesn't say her nationality at death was French, it says she was naturalized-
French at some point. It also states she conducted _pioneering_ research on
radioactivity, which is not captured by the example output.

The example also shows an inferrence that her job is "researcher." This is a
questionable inferrence. Imagine this conversation between two humans: "He's a
hard-surface texturing artist, but he coded sometimes when he needed to." "Oh
so his jobs are art and coding?" "No his job is art, but he can code."

As humans, we are thinking about role assignments and expectations vs people
committing acts. What ultimately defines a "job"?

The point I'm trying to make is that "She was a researcher." and "She did
research." should not result in the same output.

There's obviously a lot of inferrence required to discern any structure from
the text (like assuming "she" refers to Marie Curie), but I believe these
inferrences should be recognisable -- captured in the output in a way they can
be queried and reasoned about.

~~~
miket
Much of the knowledge that humans derive from reading text is implicit rather
than explicit. The derived knowledge is also context-dependent and
probabilistic, i.e. they are not binary facts but we assign a degree of
confidence to them.

In the context of that sentence "She was a .. physicist and chemist who
conducted .. research on radioactivity.", I think most people would say a
physicist or a chemist who conducts research _is_ a researcher. In other
contexts, such as in your example, that would be a questionable inference.
What you're describing is why natural language understanding is hard--it's
context-dependent and not syntactic.

"She did research."
[http://relex.diffbot.com:8085/?text=She%20did%20research](http://relex.diffbot.com:8085/?text=She%20did%20research).
"She was a researcher."
[http://relex.diffbot.com:8085/?text=She%20was%20a%20research...](http://relex.diffbot.com:8085/?text=She%20was%20a%20researcher).

Return different outputs from a state-of-the-art relation extraction system.

The initial example:
[http://relex.diffbot.com:8085/?text=Marie%20Curie%20was%20bo...](http://relex.diffbot.com:8085/?text=Marie%20Curie%20was%20born%20on%20November%201867%2C%207.%20She%20was%20a%20Polish%20and%20naturalized%20French%20physicist%20and%20chemist%20who%20conducted%20pioneering%20research%20on%20radioactivity).

In your example, being a coder is not inferred:
[http://relex.diffbot.com:8085/?text=He%27s%20a%20hard-
surfac...](http://relex.diffbot.com:8085/?text=He%27s%20a%20hard-
surface%20texturing%20artist%2C%20but%20he%20coded%20sometimes%20when%20he%20needed%20to).

------
gumby
[2016]

------
PaulHoule
I wish I could vote this one up more than once.

