Hacker News new | past | comments | ask | show | jobs | submit login
Knowledge Extraction from Unstructured Texts (2016) (heuritech.com)
272 points by homarp 55 days ago | hide | past | web | favorite | 32 comments



Interesting! I use a similar technique on https://hnprofile.com

It differs in a few key ways, largely because I’m trying to solve a different problem.

https://hnprofile.com/learn-more

I’m curious how this solution for knowledge extraction, vs the goal of summarization holds up in practice. I suspect (as they mention on the conclusion), even though data is extracted, timing is everything. Language and information changes over time and I’m curious how multiple sets of the similar, but different data in the corpus would be represented. For instance, “I live in Japan”, vs “I lived in Japan” vs “we live in Japan” but “I visit New York”

Where do I live, who are “we”, etc are still very difficult problems. Summarizations are a bit easier in that for the most part you’re just removing superfluous information.


Here is how a relation extraction system would represent these two concepts "live" (i.e. place of residence) and "visit" (a temporary location), as well as the temporality of the statements.

http://relex.diffbot.com:8085/?text=I%20live%20in%20Japan.%2....


Nice! What model do you use?


Can't say I'm enthused to see your algorithmic estimation of my 'mood' for the past year offered as a commodity.


I share you sentiment. But I believe his website is an expression of free speech and should not be suppressed.

Perhaps I will use throw away accounts more frequently in future or just post less regularly.


Free speech has never before included a history of moods over a period of time for any person.

It's uncomfortable.

Besides, don't the posts belong to either the user or to HN, so that those mood graphs can be GDPR'd?


You might want to look into how to make your website GDPR compliant. I don’t think you can create a database over random EU citizens’ interests, mood and activities without permission. (Edited)


Would be an interesting exercise to determine what exposure gp has to GDPR here. Theres no direct relationship to the user, just analysis of publicly accessible content.

Wonder how archive.org fares.


From what I understand, it doesn’t matter that it is publicly accessible content, you still need consent from each individual that is a GDPR data subject before you process their personal information. It also doesn’t seem to matter whether it’s a US company that doesn’t do business in Europe, although you have to wonder what teeth the law has under such circumstances. IANAL though.


I usually just let the GDPR people say as they will without debate. People seem to think GDRP makes a difference.. part of my website was to point out it doesn’t. The fact is, I’m a small company, solely based outside EU jurisdiction, with anonymous accounts I monitor (and make public).

For reference, I’m based out of the U.S. and don’t officially conduct business in the EU. As it stands, I’m completely unbound by the laws. Even EU customers would have to conduct business with me via U.S. dollars on a U.S. hosted server. I.e. they and myself would be bound by U.S. law, as we’d be in U.S. jurisdiction.

That being said, I’m also not hosting “data from An EU citizen”. I have no way of knowing where these anonymous users are posting. If I applied this to people that I can confirm their identities, then sure. However, as it stands, I have no way of knowing who “jcims” or anyone else is.


https://gdpr.eu/companies-outside-of-europe/

Article 3.1 states that the GDPR applies to organizations that are based in the EU even if the data are being stored or used outside of the EU. Article 3.2 goes even further and applies the law to organizations that are not in the EU if two conditions are met: the organization offers goods or services to people in the EU, or the organization monitors their online behavior. (Article 3.3 refers to more unusual scenarios, such as in EU embassies.)

You definitely are covered by the GDPR - its just a matter of how and when someone will take action against you.


Except (from same page)

The second exception is for organizations with fewer than 250 employees. Small and medium-sized enterprises (SMEs) are not totally exempt from the GDPR, but the regulation does free them from record-keeping obligations in most cases (see Article 30.5).


In the linked page: s/moral/morale


The article was a disappointment: only one sentence was dedicated to motivating the business problem, and there was no description of the application.

People act like the utility of this stuff is apriori obvious, but to me it's really not. Unless you are trying to build a knowledge graph, what do you do with these triples? (And even if you are, you need humans to independently vet them for noise induced by the source material, as well as your algo. My understanding is that information architects and annotators are essential to building and maintaining Google's KG.)

Article summarization is often cited and I find this unconvincing. Most articles can't be reduced to a trite collection of triples. They have a higher level thesis that would be AI complete to parse.

What are some other examples?


Extracting triplets is useful for explain-ability. You know where you took the fact from, so it help build a consistent view of the world. In fact this consistency can be used as a signal for training your algorithms in an unsupervised way. It is useful for building a search engine because it helps create relevant index automatically.

Once you have a way to construct the knowledge graph automatically, you can made better recommendations. One usually good recommendation is inferring the graph query which gave some results the user was interested in. Then you can run the query and propose to the user an exhaustive set of similar results. Basically it's a great way to intuit the intent of the user : It helps you get the "Why".

An other application, which is a little extension is to use quad-stores instead of triplets, so you can have facts about the facts. It is very useful to highlight the fake news/fake users. Or to identify the point of view of various individuals for further targeting.

The more promising ventures of this type of work is that Graph Query Language are kind of an easy programming language and are a good stepping stone toward more useful programming languages. This is probably one step on the path toward program induction.


I think that in many cases this is genuinely useless, but there are some good and even important applications of the process.

An example is health records. Doctors enter short notes in a text editor for a lot of these, and being able to extract information discover symptom associations using information extraction is much faster than manually going over data.

Here's a good presentation that covers some of this (PDF):

https://people.csail.mit.edu/regina/talks/CNLP.pdf


Thank you for the link. That was enlightening.


You might not be the intended audience.


The main problem these days is information is generated much more quickly than it can be classified. Short of forcing everyone in the world to adopt a standard for semantic markup for everything they publish, which will almost certainly never happen, the alternative is to use algorithms to extract and classify entities and relationships in unstructured data.

Logistically it would be near impossible to manually vet every single automatically classified piece of data, but it would be possible to assume a % accuracy based on known training sets representing a realistic cross section of the type of data being classified. Once the data is categorised and related across numerous taxonomy, within a certain degree of error, a great deal of useful actions are possible.

Say you want to fetch all opinion pieces where the subject is a certain political party, then run a sentiment analysis to see if they are largely favourable or unfavourable, then drill down to see what aspects cause opinion to sway (could it be the gender of the author, the age, the country) etc. Even with a certain degree of error, this kind of analysis would be very valuable for strategists, reporters, researchers, etc.

The same could be done for stock markets with companies, shareholders and traders all being able to delve into the masses of data being poured daily into the web. Through network analysis (traversing the knowledge graph) you'd be able to identify clusters of similar information, to spot a smear campaign or a sudden surge of bot-generated "fake news" in the making or otherwise identify risks and opportunities in near real-time. To be able to do this, you need some way of extracting meaning from random blobs of text.

There's already a raft of services out there that offer some form of knowledge extraction and sentiment analysis - improving the accuracy of the underlying algorithms gives a competitive edge and makes the services more valuable over time. This includes developing AI that can interpret the higher level thesis with reasonable accuracy (keep in mind many humans fail miserably at this too) and how to avoid purposeful attempts to fool or spam the system. These are all problems that can be tackled as data research continues.

Even for a researcher in any field, being able to upload your thesis, have it analysed and classified, then using the metadata to explore similar research would be advantageous. You may find connections between unlikely fields of research that might have been overlooked with standard keyword relevancy searches, which could go on to spawn new areas to study or new considerations for existing research. Quite often advancements come when disparate fields are bridged, and having tools to help automate the process (even if they aren't 100% accurate) would be very useful.

I doubt we're close to replacing the human element just yet, but having at our disposal better tools to sift through the petabytes of crap on the internet is a highly sought after and valuable outcome, and much more than an intellectual exercise.


> People act like the utility of this stuff is apriori obvious, but to me it's really not. Unless you are trying to build a knowledge graph, what do you do with these triples?

I have been very interested in this subject for years, the motivation to me seems clear, so perhaps I can shed some light on this for you. Being able to extract information from raw text would have a huge number of business applications, and could completely change the way human beings interact with machines. Besides that, there are a number of questions which are unanswered which bear on fundamental research topics in deep learning, artificial intelligence, and natural language processing.

The idea behind the triples is they provide additional data points which can be used to improve the efficiency & accuracy of information extraction.

This is a super hard subject at the forefront of modern research and its just a hobby of mine to keep up to date on the state of affairs at this point, having long ago given up the side project I had been working on. I must not have been keeping up very well because I had not seen the link before despite being from 2016.


It's worth noting that this is a pretty old article.

The last 3 years has seen significant development of neural models for text (and graph) processing and almost all the state-or-the-art results listed there are outdated.

Notably, modern large language models turn out to be very good on their own at large parts of the knowledge extraction problem. OpenAI's GPT is state of the art on the SNLI task[1], and GPT-2[1] is approaching human performance on tasks similar to knowledge extraction.

[1] https://openai.com/blog/language-unsupervised/

[2] https://openai.com/blog/better-language-models/


SNLI is also not a knowledge extraction task..


It's listed in this article, and inference is usually necessary for knowledge extraction.

(And that previous sentence is a perfect example - you had to infer I was speaking about SNLI, and can extract the knowledge that article is partially about it)


You can try a live demo of Diffbot's knowledge extraction from text here: http://relex.diffbot.com:8085


First test that I tried went poorly IMO.

http://relex.diffbot.com:8085/?text=Donny%20Trump%20lived%20....

EDIT: Downvoters: Sorry this was unintentionally vaguely political.


Interesting. This is a hard subject. Not sure if it can be useful anymore to you guys, but if want free training data from our API (https://serpapi.com/knowledge-graph), we'll be happy to hook you up. Hit me at julien - at - serpapi.com.


This is amazing. I can see that there will be an app or service which will automatically provide summary or the key insights of lengthy articles or books. That will be nice because I hate clickbait articles.

I am looking at you, BuzzFeed.


Good survey article. This is also the problem I am working on [1]

[1] http://kgcreator.com


I didn't read on, as the article is beyond me, but I believe the initial example of extracting structured facts from a paragraph about Marie Curie is partially incorrect.

The text doesn't say she was born in Poland. It says she was Polish. It also doesn't say her nationality at death was French, it says she was naturalized-French at some point. It also states she conducted _pioneering_ research on radioactivity, which is not captured by the example output.

The example also shows an inferrence that her job is "researcher." This is a questionable inferrence. Imagine this conversation between two humans: "He's a hard-surface texturing artist, but he coded sometimes when he needed to." "Oh so his jobs are art and coding?" "No his job is art, but he can code."

As humans, we are thinking about role assignments and expectations vs people committing acts. What ultimately defines a "job"?

The point I'm trying to make is that "She was a researcher." and "She did research." should not result in the same output.

There's obviously a lot of inferrence required to discern any structure from the text (like assuming "she" refers to Marie Curie), but I believe these inferrences should be recognisable -- captured in the output in a way they can be queried and reasoned about.


Much of the knowledge that humans derive from reading text is implicit rather than explicit. The derived knowledge is also context-dependent and probabilistic, i.e. they are not binary facts but we assign a degree of confidence to them.

In the context of that sentence "She was a .. physicist and chemist who conducted .. research on radioactivity.", I think most people would say a physicist or a chemist who conducts research is a researcher. In other contexts, such as in your example, that would be a questionable inference. What you're describing is why natural language understanding is hard--it's context-dependent and not syntactic.

"She did research." http://relex.diffbot.com:8085/?text=She%20did%20research. "She was a researcher." http://relex.diffbot.com:8085/?text=She%20was%20a%20research....

Return different outputs from a state-of-the-art relation extraction system.

The initial example: http://relex.diffbot.com:8085/?text=Marie%20Curie%20was%20bo....

In your example, being a coder is not inferred: http://relex.diffbot.com:8085/?text=He%27s%20a%20hard-surfac....


[2016]


I wish I could vote this one up more than once.




Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: