Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How to implement an NLP grammar parser for a new natural language?
48 points by alnitak on Feb 20, 2017 | hide | past | web | favorite | 13 comments
As a novice with NLP, having tinkered before with some basic and naive models to do NLP, I would like to learn it properly this time by creating a grammar parser for a language, for which no current model is available publicly. I can easily access a corpus of sentences for this language, and speaking it myself, I am motivated enough to produce any data needed for this.

Where would you recommend to start for such a project, both in terms of minimal theoretical and practical knowledge, but also the engineering aspect of it? What open source libraries and software are available out there to speed up this process?

Nobody mentionned SyntaxNet or LinkGrammar. If you did not read the article from Chomsky about the two ways of doing AI you should read it. Basically it says there is statistical methods and logic methods in AI. Most of NLP libraries of today use the statistic approach. The other, the logic rules based approach was the most popular before now. Anyway, that's what does Link Grammar. I recommend you start with the introduction https://amirouche.github.io/link-grammar-website//documentat... to get a feeling of what it means to deliver meanings to sentences.

Also nowdays, word2vec is unrelated to the understanding of grammatical constructs in natural languages. It's simply said a coincidence or co-occurance of words. Grammartical interpretation of a sentence must be seens as a general graph whereas word2vec operate on the linear structure of sentences (one word after the other). If word2vec had to work on grammatical constructs it should be able to ingest graph data. word2vec works on matrices where the graphical representation of the grammar of sentence (POS tagging, dependencies, anaphora, probably others) is graph otherwise said a sparse matrix or a matrix with a big number of dimensions. (It seems to me machine learning is always about dimension reduction with some noise).

I am quite ignorant about the literature on the subject of machine learning operated to/from graphical datastructures.

word2vec is the latest in a long line of connectionist claims that what we call grammar does not exist. Instead of thinking of a set of deterministic rules, you think of a dynamical system which flays off language as it toddles on.

First, North African languages are called Arabic. The proper written form of Arabic is the same in every country. The Berber language never had a written language or letters and only confuses the matter. It is a tool used to divide the people. Can you imagine Palestinians demanding Caananite be included as an official language? The most common modern standard Arabic would be found in Syria, Lebanon,Jordan and Palestine, with the Egyptian and Iraqi dialects also well understood. The North African dialects need a major overhaul. In Morrocco, they have borrowed even German words and the pace is so fast half the words are mumbled. Use modern standard Arabic as your focus,and perhaps Latin letters to make it easier on non natives while being able to translate it back to Arabic letters.

I haven't tried it yet, but Spacy has a guide[1] for adding a new languages to their python NLP framework. Maybe it can be of use to you.

[1] https://spacy.io/docs/usage/adding-languages

This is supervised algorithm, it requires that you annotate a lot of existing text so that the algorithm knows how to annotate "new" sentences.

It statistic based.

If you want to go directly into coding, the Stanford NLP Parser lists in point 5 of their FAQ[1] some starting instructions for parsing a new language.

If you can deal with the math, some papers such as [2] use corpora for existing languages as a tool to parse new languages, for which there are not too many resources available.

In both cases, you can always contact the authors. They might know how to help with your project, and/or direct you to the right people.

[1] http://nlp.stanford.edu/software/parser-faq.shtml#d

[2] https://www.aclweb.org/anthology/Q/Q16/Q16-1022.pdf

Stanford's NLP course is a good place to start learning about the theoretical knowledge: https://youtube.com/watch?v=nfoudtpBV68

Then it highly depends on the language; for instance tokenization (split sentence into words) is really easy in English, Spanish, etc compared to Japanese, Chinese, etc. So I would say a good starting point would be to try using a NLP parser for a similar language. What language is it? What kind of NLP analysis do you want to perform?

I specifically want to address north african languages (Algerian, Moroccan, Tunisian), commonly known as a dialect of Arabic, using a Latin alphabet, with heavy influences from Latin languages (French, Spanish) and Berber (endemic regional language) as well. My intent is to be able to parse a sentence written in this language, and extract the subject, the action, the object. I would also like to build a semantic model of the language (e.g. word2vec if I am not abusing the field) relating similar words.

I have tried models for Arabic but they fall short because of the borrowings from other languages, but also a different grammar in many cases. Also, on one hand, transliterating the Latin written sentences in North African Arabic into a regular Arabic alphabet is a challenging task (many transliterations are possible for a given word), on the other, North African Arabic (NAA) is not standardized, so words are commonly written phonetically using a loose set of transcription rules, also borrowed from other languages. An example of this last aspect would be 'the pharmacist' which in NAA could be written in any of those forms (combinations of either change are possible): - L'farmassian - Lfarmassian - Lpharmassian - Lpharmacian - l frmsian ..

Thanks for the reference, I will check it out!

Not an expert in NLP, but my understanding is that many models are at least partly statistical these days so you will probably need a set of labeled data.

First you should figure out what type of parsing you want. I would recommend looking at Stanford's CoreNLP library to see which task you actually want. There are multiple ways to parse grammar. Once you can name the actual problem you want to solve it should be googleable.

The downside of classical NLP is that you need to learn some amount of linguistics to create labeled parse trees for your data or even interpret them.

So, if your goal is to build an application, rather than a library, you may want to learn about neural nets/LSTMs. They can let you go from language to the actual information you want without you needing to encode and interpret parse trees.

The downside of neural nets is that they tend to need more data, but the data is much simpler so you could farm this out to mechanical turk if you wanted.

A pointer of potential data

Cotterell, R., & Callison-Burch, C. (2014). A Multi-Dialect, Multi-Genre Corpus of Informal Written Arabic. In LREC (pp. 241-245).


It's using arabic characters but at least it's labelled data of Magrhebi arabic. So you'd "only" have to perform a translitteration or multiple translitteration between that corpus and your data.

My hobby involves free-form Internet text, which has similar normalisation problems. I found that for my purposes, character-level models where quite effective partly because they were easy to train on not much data and usually quite robust to minor misspellings[1]. I also developed a part-of-speech tagger[2] which might be useful to you, assuming your corpus has verb and noun tags available. If you've got some more questions, my email's in my profile.

[1] http://dracula.sentimentron.co.uk/sentiment-demo/

[2] https://github.com/Sentimentron/Dracula

As there are no contact information on your profile, public message.

As far as I know, noone has actually succeeded in doing NLP in NAA or amazight...

It's a topic of great interest to me but unfortunately I don't have time to invest in that subject. Please keep me informed of your progress!

I'd love to see a system doing NLP in latin alphabet for amazight and NLG to Tifinagh...

Me too. But I've not diven into it yet. I know how to speak tamazight but not how to spell. I am under the impression that I could simply start to translate a dictionary into code...

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact