I specifically want to address north african languages (Algerian, Moroccan, Tunisian), commonly known as a dialect of Arabic, using a Latin alphabet, with heavy influences from Latin languages (French, Spanish) and Berber (endemic regional language) as well.
My intent is to be able to parse a sentence written in this language, and extract the subject, the action, the object. I would also like to build a semantic model of the language (e.g. word2vec if I am not abusing the field) relating similar words.
I have tried models for Arabic but they fall short because of the borrowings from other languages, but also a different grammar in many cases.
Also, on one hand, transliterating the Latin written sentences in North African Arabic into a regular Arabic alphabet is a challenging task (many transliterations are possible for a given word), on the other, North African Arabic (NAA) is not standardized, so words are commonly written phonetically using a loose set of transcription rules, also borrowed from other languages.
An example of this last aspect would be 'the pharmacist' which in NAA could be written in any of those forms (combinations of either change are possible):
- L'farmassian
- Lfarmassian
- Lpharmassian
- Lpharmacian
- l frmsian
..
Not an expert in NLP, but my understanding is that many models are at least partly statistical these days so you will probably need a set of labeled data.
First you should figure out what type of parsing you want. I would recommend looking at Stanford's CoreNLP library to see which task you actually want. There are multiple ways to parse grammar. Once you can name the actual problem you want to solve it should be googleable.
The downside of classical NLP is that you need to learn some amount of linguistics to create labeled parse trees for your data or even interpret them.
So, if your goal is to build an application, rather than a library, you may want to learn about neural nets/LSTMs. They can let you go from language to the actual information you want without you needing to encode and interpret parse trees.
The downside of neural nets is that they tend to need more data, but the data is much simpler so you could farm this out to mechanical turk if you wanted.
It's using arabic characters but at least it's labelled data of Magrhebi arabic. So you'd "only" have to perform a translitteration or multiple translitteration between that corpus and your data.
My hobby involves free-form Internet text, which has similar normalisation problems. I found that for my purposes, character-level models where quite effective partly because they were easy to train on not much data and usually quite robust to minor misspellings[1]. I also developed a part-of-speech tagger[2] which might be useful to you, assuming your corpus has verb and noun tags available. If you've got some more questions, my email's in my profile.
Me too. But I've not diven into it yet. I know how to speak tamazight but not how to spell. I am under the impression that I could simply start to translate a dictionary into code...
I have tried models for Arabic but they fall short because of the borrowings from other languages, but also a different grammar in many cases. Also, on one hand, transliterating the Latin written sentences in North African Arabic into a regular Arabic alphabet is a challenging task (many transliterations are possible for a given word), on the other, North African Arabic (NAA) is not standardized, so words are commonly written phonetically using a loose set of transcription rules, also borrowed from other languages. An example of this last aspect would be 'the pharmacist' which in NAA could be written in any of those forms (combinations of either change are possible): - L'farmassian - Lfarmassian - Lpharmassian - Lpharmacian - l frmsian ..
Thanks for the reference, I will check it out!