It differs in a few key ways, largely because I’m trying to solve a different problem.
I’m curious how this solution for knowledge extraction, vs the goal of summarization holds up in practice. I suspect (as they mention on the conclusion), even though data is extracted, timing is everything. Language and information changes over time and I’m curious how multiple sets of the similar, but different data in the corpus would be represented. For instance, “I live in Japan”, vs “I lived in Japan” vs “we live in Japan” but “I visit New York”
Where do I live, who are “we”, etc are still very difficult problems. Summarizations are a bit easier in that for the most part you’re just removing superfluous information.
Perhaps I will use throw away accounts more frequently in future or just post less regularly.
Besides, don't the posts belong to either the user or to HN, so that those mood graphs can be GDPR'd?
Wonder how archive.org fares.
For reference, I’m based out of the U.S. and don’t officially conduct business in the EU. As it stands, I’m completely unbound by the laws. Even EU customers would have to conduct business with me via U.S. dollars on a U.S. hosted server. I.e. they and myself would be bound by U.S. law, as we’d be in U.S. jurisdiction.
That being said, I’m also not hosting “data from An EU citizen”. I have no way of knowing where these anonymous users are posting. If I applied this to people that I can confirm their identities, then sure. However, as it stands, I have no way of knowing who “jcims” or anyone else is.
Article 3.1 states that the GDPR applies to organizations that are based in the EU even if the data are being stored or used outside of the EU. Article 3.2 goes even further and applies the law to organizations that are not in the EU if two conditions are met: the organization offers goods or services to people in the EU, or the organization monitors their online behavior. (Article 3.3 refers to more unusual scenarios, such as in EU embassies.)
You definitely are covered by the GDPR - its just a matter of how and when someone will take action against you.
The second exception is for organizations with fewer than 250 employees. Small and medium-sized enterprises (SMEs) are not totally exempt from the GDPR, but the regulation does free them from record-keeping obligations in most cases (see Article 30.5).
People act like the utility of this stuff is apriori obvious, but to me it's really not. Unless you are trying to build a knowledge graph, what do you do with these triples? (And even if you are, you need humans to independently vet them for noise induced by the source material, as well as your algo. My understanding is that information architects and annotators are essential to building and maintaining Google's KG.)
Article summarization is often cited and I find this unconvincing. Most articles can't be reduced to a trite collection of triples. They have a higher level thesis that would be AI complete to parse.
What are some other examples?
Once you have a way to construct the knowledge graph automatically, you can made better recommendations. One usually good recommendation is inferring the graph query which gave some results the user was interested in. Then you can run the query and propose to the user an exhaustive set of similar results. Basically it's a great way to intuit the intent of the user : It helps you get the "Why".
An other application, which is a little extension is to use quad-stores instead of triplets, so you can have facts about the facts. It is very useful to highlight the fake news/fake users. Or to identify the point of view of various individuals for further targeting.
The more promising ventures of this type of work is that Graph Query Language are kind of an easy programming language and are a good stepping stone toward more useful programming languages. This is probably one step on the path toward program induction.
An example is health records. Doctors enter short notes in a text editor for a lot of these, and being able to extract information discover symptom associations using information extraction is much faster than manually going over data.
Here's a good presentation that covers some of this (PDF):
Logistically it would be near impossible to manually vet every single automatically classified piece of data, but it would be possible to assume a % accuracy based on known training sets representing a realistic cross section of the type of data being classified. Once the data is categorised and related across numerous taxonomy, within a certain degree of error, a great deal of useful actions are possible.
Say you want to fetch all opinion pieces where the subject is a certain political party, then run a sentiment analysis to see if they are largely favourable or unfavourable, then drill down to see what aspects cause opinion to sway (could it be the gender of the author, the age, the country) etc. Even with a certain degree of error, this kind of analysis would be very valuable for strategists, reporters, researchers, etc.
The same could be done for stock markets with companies, shareholders and traders all being able to delve into the masses of data being poured daily into the web. Through network analysis (traversing the knowledge graph) you'd be able to identify clusters of similar information, to spot a smear campaign or a sudden surge of bot-generated "fake news" in the making or otherwise identify risks and opportunities in near real-time. To be able to do this, you need some way of extracting meaning from random blobs of text.
There's already a raft of services out there that offer some form of knowledge extraction and sentiment analysis - improving the accuracy of the underlying algorithms gives a competitive edge and makes the services more valuable over time. This includes developing AI that can interpret the higher level thesis with reasonable accuracy (keep in mind many humans fail miserably at this too) and how to avoid purposeful attempts to fool or spam the system. These are all problems that can be tackled as data research continues.
Even for a researcher in any field, being able to upload your thesis, have it analysed and classified, then using the metadata to explore similar research would be advantageous. You may find connections between unlikely fields of research that might have been overlooked with standard keyword relevancy searches, which could go on to spawn new areas to study or new considerations for existing research. Quite often advancements come when disparate fields are bridged, and having tools to help automate the process (even if they aren't 100% accurate) would be very useful.
I doubt we're close to replacing the human element just yet, but having at our disposal better tools to sift through the petabytes of crap on the internet is a highly sought after and valuable outcome, and much more than an intellectual exercise.
I have been very interested in this subject for years, the motivation to me seems clear, so perhaps I can shed some light on this for you. Being able to extract information from raw text would have a huge number of business applications, and could completely change the way human beings interact with machines. Besides that, there are a number of questions which are unanswered which bear on fundamental research topics in deep learning, artificial intelligence, and natural language processing.
The idea behind the triples is they provide additional data points which can be used to improve the efficiency & accuracy of information extraction.
This is a super hard subject at the forefront of modern research and its just a hobby of mine to keep up to date on the state of affairs at this point, having long ago given up the side project I had been working on. I must not have been keeping up very well because I had not seen the link before despite being from 2016.
The last 3 years has seen significant development of neural models for text (and graph) processing and almost all the state-or-the-art results listed there are outdated.
Notably, modern large language models turn out to be very good on their own at large parts of the knowledge extraction problem. OpenAI's GPT is state of the art on the SNLI task, and GPT-2 is approaching human performance on tasks similar to knowledge extraction.
(And that previous sentence is a perfect example - you had to infer I was speaking about SNLI, and can extract the knowledge that article is partially about it)
EDIT: Downvoters: Sorry this was unintentionally vaguely political.
I am looking at you, BuzzFeed.
The text doesn't say she was born in Poland. It says she was Polish. It also doesn't say her nationality at death was French, it says she was naturalized-French at some point. It also states she conducted _pioneering_ research on radioactivity, which is not captured by the example output.
The example also shows an inferrence that her job is "researcher." This is a questionable inferrence. Imagine this conversation between two humans: "He's a hard-surface texturing artist, but he coded sometimes when he needed to." "Oh so his jobs are art and coding?" "No his job is art, but he can code."
As humans, we are thinking about role assignments and expectations vs people committing acts. What ultimately defines a "job"?
The point I'm trying to make is that "She was a researcher." and "She did research." should not result in the same output.
There's obviously a lot of inferrence required to discern any structure from the text (like assuming "she" refers to Marie Curie), but I believe these inferrences should be recognisable -- captured in the output in a way they can be queried and reasoned about.
In the context of that sentence "She was a .. physicist and chemist who conducted .. research on radioactivity.", I think most people would say a physicist or a chemist who conducts research is a researcher. In other contexts, such as in your example, that would be a questionable inference. What you're describing is why natural language understanding is hard--it's context-dependent and not syntactic.
"She did research." http://relex.diffbot.com:8085/?text=She%20did%20research.
"She was a researcher." http://relex.diffbot.com:8085/?text=She%20was%20a%20research....
Return different outputs from a state-of-the-art relation extraction system.
The initial example: http://relex.diffbot.com:8085/?text=Marie%20Curie%20was%20bo....
In your example, being a coder is not inferred: http://relex.diffbot.com:8085/?text=He%27s%20a%20hard-surfac....