
Ask HN: Natural language processing to identify grammar in a text? - davearms
Apologies for the clunky question. We have a growing base of adult English students. Our teaching methodology is content-first, basically we find any material of interest to our students, based on their interests, sector and language goals and build lesson plans and discussions around that &#x27;centerpiece&#x27;. A lot of the work is curation and creating reusable discussion questions.<p>I have been searching for a tool that can scan a paragraph and extract the grammar tenses and features (past simple, present continuous, passive voice, indirect question) as it&#x27;s a recurring question with our students. We have tools to tell us the approximate level, suggested vocabulary and word count, but does this even exist (yet?). Thank you in advance.
======
jll29
Sadly, this doesn't exist yet.

You will find that most Natural Language Processing (NLP) tools conceptualize
linguistic categories differently from how teachers do (language teaching
isn't linguistics, there are often simplifications happening, and schoolbooks
get updated more slowly than linguistics evolves).

Examples:

* English verbs have only two tenses: PAST or NONPAST. They can have PERFECTIVE aspect or not. They can have PROGRESSIVE aspect or not. Since these are 3 binary choices, there are at least 8 different ways how English verbs can be realized. I think there'd be less confusion in school if a more linguistically correct version was taught that separates out tense and aspects.

* "Future" or "Present Perfect" (something I still got taught in school) don't exist for a proper linguist.

To build what you suggest, existing tools could be combined, but there would
have to be a mapping layer on top of syntactic parsers like Charniak Parser,
Collins parser or MaltParser. Another mis-match between grammar in school and
linguistics is single versus multiple theories: in school, people usually
teach constituent trees, whereas in linguistics phrase structure (constituent)
grammar is one theory among many, one alternative (valency and dependency
grammar) that does not rely on trees but focuses on the relations between
words has recently gained a lot of traction in linguistic circles.

~~~
ampdepolymerase
Considering the success of Grammarly, it is most likely possible to do it with
traditional rule-based/experts systems combined with a tiny bit of modern day
ML. But there is nothing available out of the box for that.

------
pattusk
I don't think there's an out-of-the box library for things like the detection
of passive voice or indirect questions. But you should be able to build
something that would let you do it based on the basic NLP toolkit: dependency
parsing, POS tagging, lemmatization, named entity recognition.

I suggest you check out Spacy [0] for a quick and easy to use Python library
providing the above features. The software produced by the Stanford NLP Group
is also great [1].

If you do not want to get your hands dirty with code, there are a number of
API providers that will offer you the same as the above libraries (TextRazor,
Rosette Text Analytics...)

[0] [https://spacy.io/](https://spacy.io/)

[1] [https://nlp.stanford.edu/software/](https://nlp.stanford.edu/software/)

------
dpezely
There are several topics intertwined with solutions you seek: There's parts of
speech (PoS) tagging, reducing to Lemma form, identifying end of sentence,
etc.

After having faced a similar learning curve, I put what I know into a lengthy
document[0] written in 2018 based upon explorations over 2016-17. That will
get you deployed and operational quickly by following just the final section.
The first section explains key concepts using conventional ideas as the means
of introducing NLP jargon. In between covers theory and practice for getting
the most out of any tool you're likely to use in the end.

More general tools are probably available today, such as add-ons for
Elasticsearch. I'd start looking there. Interesting items came up when
searching ddg for: NLP elasticsearch.

[0] [http://play.org/articles/introduction-to-natural-language-
pr...](http://play.org/articles/introduction-to-natural-language-processing)

~~~
bryanrasmussen
maybe also try searching for nlp Lucene

There's also a pretty good book [https://lingpipe-blog.com/2008/06/12/book-
building-search-ap...](https://lingpipe-blog.com/2008/06/12/book-building-
search-applications-lucene-lingpipe-and-gate/)

of course the more general tools available today notice applies in regard to
this book as well.

------
nyxtom
I actually (sort of) wrote one of these a while back (though I don't think I
ever got to implementing tenses - possibly this would be somewhat easy to
implement on top of whatever I already built there, but maybe not idk). In any
case:

> copulae verbs, linking verbs, terms that are often filtered (i.e. stop
> terms), question terms, time sensitive nouns, amplifiers, clauses,
> coordinating conjunctions, negations, conditionals (ORs), and contractions

[https://github.com/nyxtom/salient/](https://github.com/nyxtom/salient/)

------
calebkaiser
As someone else alluded to, this is a task for multiple models. Fortunately,
there are a lot of great NLP libraries that combine multiple pre-trained
language models into a single pipeline you can interface with, like Stanza.
From their docs, their vanilla pipeline breaks down the sentence "Barack Obama
was born in Hawaii. He was elected president in 2008." as :

('Barack', '4', 'nsubj:pass')

('Obama', '1', 'flat')

('was', '4', 'aux:pass')

('born', '0', 'root')

('in', '6', 'case')

('Hawaii', '4', 'obl')

('.', '4', 'punct')

It should be very easy to deploy Stanza's pipeline as an API endpoint. Here is
an example of such a NLP-library-as-API endpoint, albeit with Hugging Face's
Transformers, deployed via Cortex:
[https://github.com/cortexlabs/cortex/blob/master/examples/py...](https://github.com/cortexlabs/cortex/blob/master/examples/pytorch/sentiment-
analyzer/predictor.py)

~~~
hiddencost
A language model is a model that predicts the probability of a given text,
that is all. It should not be conflated with other types of NLP tasks like
part of speech tagging. I'm guessing the popularity of transformer based
models, which are built around the LM task and then adapted to other
donations, is leading to this confusion.

------
wyldfire
Example of passive voice detection using spacy[1]:
[https://gist.github.com/armsp/30c2c1e19a0f1660944303cf079f83...](https://gist.github.com/armsp/30c2c1e19a0f1660944303cf079f831a)

[1] [https://spacy.io/](https://spacy.io/)

------
mci
My IOCCC entry [0] detects English passive constructions. Do not feel
discouraged by its looks; it is a solid tool. Since both ioccc.org and its
official mirrors (in the same domain) are down now, you can look at its
Wayback Machine cache [1].

[0] [https://www.ioccc.org/2018/ciura/](https://www.ioccc.org/2018/ciura/) [1]
[https://web.archive.org/web/20200224040340/https://www.ioccc...](https://web.archive.org/web/20200224040340/https://www.ioccc.org/2018/ciura/)

------
GistNoesis
Some parser like Spacy can give some tense additional information for verbs.
But it's probably not custom enough for what you want.

Maybe you can give GPT-3 a try.

If you want to go the custom route, the easy way but which consume a lot of
processing power, is to use a neural network and necessitate a boring dataset
phase construction.

You build a dataset corresponding to your problem. And you learn it with the
neural network.

For inspiration you can see my colorify browser extension, which uses a neural
network to learn at the same time to split sentences, predict POS tags,
predict root of the sentence, predict the parse tree which are then used to
decorate the webpage.

What I did was just programmatically build a dataset from the spacy parser to
build a custom javascript parser which does what I want. If I wanted to add
some additional information that Spacy doesn't provide like grammar tenses and
features, I can complete my dataset manually and have the network predict all
the decorations at the same time, which allows it to not need a lot of samples
because the layers are shared.

You can probably build your dataset faster by interacting with your neural
network (active learning).

For the model you can start with something simple like a convolution residual
network architecture. And later use some transformers when you want to reach
state of the art.

------
Rotten194
You should look into the English Resource Grammar:

[http://moin.delph-in.net/ErgTop](http://moin.delph-in.net/ErgTop)

Online demo:

[http://erg.delph-in.net/logon](http://erg.delph-in.net/logon)

It has all that information in the generated feature structure -- even more
than you can view in the web interface. There's a development environment you
can download, as well as a headless linux tool called ACE you can use on a
server. The ERG is complex, but far and away the most sophisticated tool in
this space.

------
3131s
Check out UDify for dependency parsing with universal parts of speech and
features.

[https://github.com/Hyperparticle/udify](https://github.com/Hyperparticle/udify)

------
Timpy
This isn't really a programming solution but you should check out this webapp:

[http://www.hemingwayapp.com/](http://www.hemingwayapp.com/)

It may not be the exact thing you're looking for but it can probably be
helpful to your students.

I would also look at Python NLTK. I've only dabbled in the toolkit, so I'm not
sure if it has what you're looking for exactly, but it's worth a look.

[http://www.nltk.org/](http://www.nltk.org/)

------
d_burfoot
To do the grammar tense analysis, you can use spaCy or another syntactic
parser. The parse tree won't directly give you the exact grammar tense, you
will need to do some simple analysis of the conjugational form of the root
verb, and the auxiliary verbs that are attached to it.

I've done extensive work in this area, including developing my own statistical
parser from scratch. I'd be happy to chat more about this project, my email is
daniel dot burfoot at gmail.com.

------
psahgal
Have you taken a look at Google's Natural Language API? Try out the demo and
switch to the "Syntax" tab to see the output. More info:
[https://cloud.google.com/natural-language](https://cloud.google.com/natural-
language)

------
laurieg
I have played around with similar projects. A good starting point is google's
NLP sentence parsing API. Be warned: the accuracy may not be good enough for
your application.

------
RandyRanderson
You can get most of the way there with Stanford CoreNlp.

------
dglass
I'm not sure if it has an API or any kind of integrations, but I know the
Hemingway App[0] can detect passive voice, and possibly other features you're
looking for.

[0]: [http://www.hemingwayapp.com/](http://www.hemingwayapp.com/)

------
unhammer
On recent debians/ubuntu, PoS tagging is just one apt away:

    
    
        $ sudo apt install -y apertium-eng
    
        $ echo "I have been searching for a tool that can scan a paragraph" |apertium eng-disam|grep -v '^;'
        "<I>"
                "prpers" prn subj p1 mf sg
        "<have>"
                "have" vbhaver inf
                "have" vbhaver pres
        "<been>"
                "be" vbser pp
        "<searching for>"
                "search# for" vblex ger SELECT:177
        "<a>"
                "a" det ind sg
        "<tool>"
                "tool" n sg
        "<that>"
                "that" cnjsub
                "that" prn dem mf sg
                "that" prn rel an mf sp
        "<can>"
                "can" vbmod pres SELECT:281
        "<scan>"
                "scan" vblex inf SELECT:140
        "<a>"
                "a" det ind sg
        "<paragraph>"
                "paragraph" n sg
        "<.>"
                "." sent
    

(grepping out lines with ; since they just show what was _not_ removed by the
disambiguator, whereas SELECT/REMOVE are just trace info saying what rules
applied. If there are multiple indented lines, then the disambiguator didn't
manage to fully disambiguate the analysis.)

If you want to e.g. mark passive, it's easy to write a Constraint Grammar rule
to do this. Put the following into rules.cg3:

    
    
        DELIMITERS = sent ;
        ADD (&PASSIVE) ("be") # Add the tag "&PASSIVE" to the word with lemma "be"
        IF
        (1* (pp)             # There is a participle to the right
         BARRIER (*) - (adv) # with nothing in between except perhaps adverbs
        );
    

and pipe it in after the above pipeline:

    
    
        $ echo "The paper is not signed by me" |apertium eng-disam |grep -v '^;'|vislcg3 -g rules.cg3
        "<The>"
                "the" det def sp
        "<paper>"
                "paper" n sg
        "<is>"
                "be" vbser pres p3 sg &PASSIVE
        "<not>"
                "not" adv
        "<signed>"
                "sign" vblex pp
                "sign" vblex past
                "signed" adj
        "<by>"
                "by" pr SELECT:470
        "<me>"
                "prpers" prn obj p1 mf sg
        "<.>"
                "." sent
    

(
[https://wiki.apertium.org/wiki/Constraint_Grammar](https://wiki.apertium.org/wiki/Constraint_Grammar)
for more info on CG )

~~~
Tainnor
That rule is really rather limited. More in general, I doubt you can properly
recognise passive constructions without doing constituent parsing, passive is
really a syntactic construction, not a morphological one.

~~~
unhammer
Or dependency parsing. But unfortunately the English package in Apertium
doesn't have a syntax CG (there are proprietary English syntax CG's out there,
with syntactic function labelling and dependency relations, while in Apertium
there are syntax CG's for some other languages).

OTOH, what's the goal? If you just want to flag some high-frequency
constructions you can get quite far with very little depth.

------
giantg2
I think Grammarly is doing something like this for text messages.

------
PaulHoule
Some tough love:

There will always be a gap between "your judgement" and the "judgement baked
into a model" \-- worse yet, if the model is very general and oriented towards
cheap computation and away from expensive people it will have vague and
contradictory judgements inside it that make the results meaningless.

That is the language of failure: the structure of success looks like the
following.

(1) The system works like a "magic magic marker", that is, you mark up a lot
of text (say 20,000 sentences) the way you think it should be marked up. This
might be a character-at-a-time or word-at-a-time. Character-at-a-time is real
and eternal, word-at-a-time is not real because there is not really such a
thing as a "word". (e.g. "red ball" can fill slots that take "ball", you can
smash together subwords to make words, for that matter people violate
punctuation rules "Amazon.com announced that...", people call themselves
n3pg34r, ...) So if you segment the text up front and segment it the wrong way
you may throw out essential information and choose to fail.

(2) You need some system to mark up the text manually and efficiently. It is a
lot of work. A typical person can make about 2000 or so up/down judgements a
day; if a sentence counts for 10 decisions then maybe you can annotate 200
sentences a day. If you can get students to do it and get teachers to review
it you might make short work of it.

This annotator

[http://brat.nlplab.org/](http://brat.nlplab.org/)

ticks the requirements, but most people find it terribly hard to use and wind
up building "easy to use" systems that don't align things right at level (1)
and... fail.

Assuming you do (1) and (2) the odds are in your favor, but you have to now

(3) build models; it does not matter if the model is a bunch of rules you
cobbled together, or hidden markov, or LSTM, or convolutional. Off the top I
would train an LSTM to 'predict the next character' on maybe 100M characters
of text, then I would stick a simple model that takes the LSTM state as an
input and labels characters at the output (could be SVM, Random Forest, Logit,
or 3 layer on NN)

(4) Accept that the system is not going to be perfect, but have the ability to
manually patch wrong results, improve the training data over time. I'd say
this is a more important practice than any particular approach to (3)

Some tool could give you (1-4) tied up in a bow

[https://www.tagtog.net/](https://www.tagtog.net/)

claims to. But (2) involves elbow grease that 90% of people aren't going to
do. Some of the 10% of people who do that elbow grease will succeed, the other
90% will fail.

