That's really the TL;DR I also got from the computational linguistic courses I attended.
There's probably the Pareto principle at works. Having no solution is worse than having an 80% solution that works well enough when the 100% solution is much harder to achieve (and some of the problems not even humans would be able to solve properly).
Recently I wrote a web-extension for Firefox that displays funny "Deep thought" quotes.
I wanted to analyse the quote text and fetch relevant images to animate in the background of the quote text. After reading several NLP tutorials, guess what I did as a first PoC - Pick the 3 longest words in a quote text and run an image search with those 3 words.
I get relevant images in the search results 99/100 times. The quirks of searching often result in the image adding to the funny-ness of the "Deep Thought" on display.
Its so effective that i ended up publishing the "Deep Thought Tabs" web-extension with this approach itself:
Later I tried using the nlp-compromise js library to identify "topics" of interest within a quote text - typically nouns, verbs, and adjectives. Comparing the results with my "3-longest-words" approach, I found that the longest words were anyways almost always the "topic" words that NLP identified for any given quote text.
Back in games we'd do all sorts of tricks in networking to make it look like things were happening(sound effects, decals, etc) in response to local events until we could have the server provide the definitive call on some game state.
Most players thought we had a much higher fidelity sim then we actually did. It's a pretty common technique across a lot of games. You can get away with quite a bit by being smart about what you "fake" and what you actually make work end-to-end.
So, in summary, you point is not wrong. But it's no reason for bashing computational linguistics. It is common across many disciplines to use not-yet-perfect solutions as long as you don't know how to do better.
That said, I don't fully agree with the notion that "hacking a solution" is the suggested way of doing things. Computational linguistics is a pretty wild field with a lot of sub-disciplines. In a lot of those, the state of the art consists of quite sophisticated approaches that are the result of years of research. Take speech recognition, for instance. Currently, deep learning approaches take the cake, but there is also a plethora of insights that have been gained from improving the traditional methods over decades.
I think, a more nuanced point of view is called for here.
It's surprising how often you can get very far with imperfect solutions. ELIZA is the classic example. A simple program with very little code could convince people that they were talking to another human or at least machine with an understanding of their feelings.
ELIZA was coded completely by humans. Of course, nowadays we have more sophisticated ways of doing that. We can throw a few topic tagged example sentences with connected replies at a computer and it will mostly reply with the right answers to similar sentences. This is only possible because computational linguistics provided the foundation for that.
Still many solution are hacky to this day but that is because computational linguistics is more concerned about interaction with imperfect humans than most of the other disciplines in computer science.
I wrote that in 2003 (I think?) based on @pg's "A plan for spam" essay, and then "invented" the summarization approach (I'm sure others had done similar, but I thought it up myself anyway).
Turns out it was rather well tuned. The 2003 implementation, presumably downloaded from sourceforge(!) still wins comparisons on datasets which didn't even exist when I wrote it.
I much prefer the Python implementation though, which I hadn't seen before.
Also, Textacy on top of Spacy is awesome for any kind of text work.
- Improving the readability of a text. The article only mentions "understanding how difficult to read is a text".
- Establishing relationships between entities in a body of text. E.g. we could build a fact-graph from sentences like "Burning coal increases CO2", and "CO2 increase induces global warming". Useful also in medical literature where there are millions of pathways.
- Answering a question, using a large body of facts. Like search, but now it gives a precise answer.
- Finding and correcting spelling/grammatical errors.
That's a simple example because with 'CO2' you at least have the same string that can serve as a keyword connecting those two facts. Usually in natural language we make frequent use of anaphora to refer to people, objects and concepts previously mentioned in the text by name.
Anaphora resolution is one of the really hard problems not only in NLP but in linguistics in general. The most simple anaphoric device in languages like English is pronouns and even with those it can be quite difficult to determine what a 'he' or 'she' refers to in context.
This was one of the most frustrating parts of studying Latin rhetoric. The speakers would keep referring to "That thing I was talking about," and it's a noun from a subordinate clause 2 and a half paragraphs ago.
That is essentially a Natural Language Interface. There are simple ways to implement one for bots that receives simple commands. The problem is that it quickly become very hard if you are trying to do something more open ended that a bot. So, there was simply no room to include it.
> - Improving the readability of a text. The article only mentions "understanding how difficult to read is a text".
The issue is that the formulas to measure the readability of a text cannot really be used to suggest improvements. That's because the user ends up focusing on improving the score instead of improving the text. To suggest improvements you need a much more sophisticate system.
> - Establishing relationships between entities in a body of text. E.g. we could build a fact-graph from sentences like "Burning coal increases CO2", and "CO2 increase induces global warming". Useful also in medical literature where there are millions of pathways.
This is one of the things that were axed, because in some sense it is simple if you just want to link together concepts without any causality, i.e. stuff that happens together. To do that you could link named entity recogniton (to find entities) and a simple way to find a relationship between words (i.e., they happen in the same phrase therefore they have related). However a more sophisticated form of the process, like the one that results in the Knowledge Graph would be quite hard to do.
> - Finding and correcting spelling/grammatical errors.
That's a great idea, we will add how to detect spelling errors.
However, we are thinking about creating a more advanced article on a later date.
However, if you already have experience in the topic we would be happy if you would like to write a guest post for us.
But you do need a use case and an economic reward for the substantial increase in cost than a pre-trained, vanilla, off-the-shelf parser (model) can give you. Yet, if your domain is technical enough (pharma, finance, law, ... - essentially, all but parsing news, blogs, and tweets...) it might be the only way to get a NLP system that really works.
Some examples of use-cases: are you searching for "semantically similar", or "near duplicate"? You can compare documents under different metrics and different _representations_. Some representations are: LSA, PLSA, LDA, TF-IDF, and Set representations, along with metrics such as Jaccard Distance, Cosine Distance, Euclidean distance, etc.
Doc2vec is the Word2vec analog for documents.
There is an implementation in Textacy.
There is a video by the creator of Gensim on word2vec and frieds: https://www.youtube.com/watch?v=wTp3P2UnTfQ
We didn't include it, simply because it relies on machine learning and we wanted to show simpler methods.
Take the famous example of [king] and [queen] being close neighbors in vector space after generating the word vectors ("embedding"). If you then use these vectors to represent the words in your text, a sentence about kings will also add information about the concept of queens, and vice versa. To a far lesser degree, such a sentence will also add to your knowledge of [ceo], and, further down, [mechanical engineer]. But it will not change the system's knowledge of [stereo].
Can't get it to go away, can't read the article.
- bAbI https://research.fb.com/downloads/babi/ and https://github.com/facebook/bAbI-tasks
- SQuAD https://rajpurkar.github.io/SQuAD-explorer/
- WebQuestions https://github.com/brmson/dataset-factoid-webquestions
Edit: there's also a great list of datasets on the ParlAI project page https://github.com/facebookresearch/ParlAI