
Ask HN: How to implement an NLP grammar parser for a new natural language? - alnitak
As a novice with NLP, having tinkered before with some basic and naive models to do NLP, I would like to learn it properly this time by creating a grammar parser for a language, for which no current model is available publicly.
I can easily access a corpus of sentences for this language, and speaking it myself, I am motivated enough to produce any data needed for this.<p>Where would you recommend to start for such a project, both in terms of minimal theoretical and practical knowledge, but also the engineering aspect of it?
What open source libraries and software are available out there to speed up this process?
======
amirouche
Nobody mentionned SyntaxNet or LinkGrammar. If you did not read the article
from Chomsky about the two ways of doing AI you should read it. Basically it
says there is statistical methods and logic methods in AI. Most of NLP
libraries of today use the statistic approach. The other, the logic rules
based approach was the most popular before now. Anyway, that's what does Link
Grammar. I recommend you start with the introduction
[https://amirouche.github.io/link-grammar-
website//documentat...](https://amirouche.github.io/link-grammar-
website//documentation/dictionary/introduction.html) to get a feeling of what
it means to deliver meanings to sentences.

Also nowdays, word2vec is unrelated to the understanding of grammatical
constructs in natural languages. It's simply said a coincidence or co-
occurance of words. Grammartical interpretation of a sentence must be seens as
a general graph whereas word2vec operate on the linear structure of sentences
(one word after the other). If word2vec had to work on grammatical constructs
it should be able to ingest graph data. word2vec works on matrices where the
graphical representation of the grammar of sentence (POS tagging,
dependencies, anaphora, probably others) is graph otherwise said a sparse
matrix or a matrix with a big number of dimensions. (It seems to me machine
learning is always about dimension reduction with some noise).

I am quite ignorant about the literature on the subject of machine learning
operated to/from graphical datastructures.

~~~
curuinor
word2vec is the latest in a long line of connectionist claims that what we
call grammar does not exist. Instead of thinking of a set of deterministic
rules, you think of a dynamical system which flays off language as it toddles
on.

------
Bitcoincadre
First, North African languages are called Arabic. The proper written form of
Arabic is the same in every country. The Berber language never had a written
language or letters and only confuses the matter. It is a tool used to divide
the people. Can you imagine Palestinians demanding Caananite be included as an
official language? The most common modern standard Arabic would be found in
Syria, Lebanon,Jordan and Palestine, with the Egyptian and Iraqi dialects also
well understood. The North African dialects need a major overhaul. In
Morrocco, they have borrowed even German words and the pace is so fast half
the words are mumbled. Use modern standard Arabic as your focus,and perhaps
Latin letters to make it easier on non natives while being able to translate
it back to Arabic letters.

------
web64
I haven't tried it yet, but Spacy has a guide[1] for adding a new languages to
their python NLP framework. Maybe it can be of use to you.

[1] [https://spacy.io/docs/usage/adding-
languages](https://spacy.io/docs/usage/adding-languages)

~~~
amirouche
This is supervised algorithm, it requires that you annotate a lot of existing
text so that the algorithm knows how to annotate "new" sentences.

It statistic based.

------
probably_wrong
If you want to go directly into coding, the Stanford NLP Parser lists in point
5 of their FAQ[1] some starting instructions for parsing a new language.

If you can deal with the math, some papers such as [2] use corpora for
existing languages as a tool to parse new languages, for which there are not
too many resources available.

In both cases, you can always contact the authors. They might know how to help
with your project, and/or direct you to the right people.

[1] [http://nlp.stanford.edu/software/parser-
faq.shtml#d](http://nlp.stanford.edu/software/parser-faq.shtml#d)

[2]
[https://www.aclweb.org/anthology/Q/Q16/Q16-1022.pdf](https://www.aclweb.org/anthology/Q/Q16/Q16-1022.pdf)

------
franciscop
Stanford's NLP course is a good place to start learning about the theoretical
knowledge:
[https://youtube.com/watch?v=nfoudtpBV68](https://youtube.com/watch?v=nfoudtpBV68)

Then it highly depends on the language; for instance tokenization (split
sentence into words) is really easy in English, Spanish, etc compared to
Japanese, Chinese, etc. So I would say a good starting point would be to try
using a NLP parser for a _similar_ language. What language is it? What kind of
NLP analysis do you want to perform?

~~~
alnitak
I specifically want to address north african languages (Algerian, Moroccan,
Tunisian), commonly known as a dialect of Arabic, using a Latin alphabet, with
heavy influences from Latin languages (French, Spanish) and Berber (endemic
regional language) as well. My intent is to be able to parse a sentence
written in this language, and extract the subject, the action, the object. I
would also like to build a semantic model of the language (e.g. word2vec if I
am not abusing the field) relating similar words.

I have tried models for Arabic but they fall short because of the borrowings
from other languages, but also a different grammar in many cases. Also, on one
hand, transliterating the Latin written sentences in North African Arabic into
a regular Arabic alphabet is a challenging task (many transliterations are
possible for a given word), on the other, North African Arabic (NAA) is not
standardized, so words are commonly written phonetically using a loose set of
transcription rules, also borrowed from other languages. An example of this
last aspect would be 'the pharmacist' which in NAA could be written in any of
those forms (combinations of either change are possible): \- L'farmassian \-
Lfarmassian \- Lpharmassian \- Lpharmacian \- l frmsian ..

Thanks for the reference, I will check it out!

~~~
Eridrus
Not an expert in NLP, but my understanding is that many models are at least
partly statistical these days so you will probably need a set of labeled data.

First you should figure out what type of parsing you want. I would recommend
looking at Stanford's CoreNLP library to see which task you actually want.
There are multiple ways to parse grammar. Once you can name the actual problem
you want to solve it should be googleable.

The downside of classical NLP is that you need to learn some amount of
linguistics to create labeled parse trees for your data or even interpret
them.

So, if your goal is to build an application, rather than a library, you may
want to learn about neural nets/LSTMs. They can let you go from language to
the actual information you want without you needing to encode and interpret
parse trees.

The downside of neural nets is that they tend to need more data, but the data
is much simpler so you could farm this out to mechanical turk if you wanted.

~~~
karambahh
A pointer of potential data

Cotterell, R., & Callison-Burch, C. (2014). A Multi-Dialect, Multi-Genre
Corpus of Informal Written Arabic. In LREC (pp. 241-245).

[http://www.lrec-
conf.org/proceedings/lrec2014/pdf/641_Paper....](http://www.lrec-
conf.org/proceedings/lrec2014/pdf/641_Paper.pdf)

It's using arabic characters but at least it's labelled data of Magrhebi
arabic. So you'd "only" have to perform a translitteration or multiple
translitteration between that corpus and your data.

