Hacker News new | past | comments | ask | show | jobs | submit login

This is exciting to see. I am a Semitic philologist (Ph.D.) now breaking into the IT industry, and this sort of work is on my radar, though mostly with Hebrew and Aramaic. Arabic, being a Semitic language, has a non-linear morphology, which means that extracting the root has to be done by extracting non-inflectional consonants from all possible positions in a word. If you train a NN with full conjugation paradigms, over a data set, it should be able to begin to recognize what the various inflectional morphemes are. In other words, instead of looking for the root, look for everything that is not the root, and the root is what is left over. For example, the NN should be able to recognize that mu-, ya-, ta-, 'āC-, -ā-. -Ct-, -unna, etc. are all inflectional morphemes. It should also begin to recognize the various matres lectionis or letters indicating long vowels just as alif, waw, and ha. (I'm including vowels in my analysis, because I think like a philologist, not a typical reader of Arabic. Using unvowelled text might be more difficult for the NN.) Anyway, these are just some off-the-cuff thoughts. I look forward to digging deeper into your code and methodology sometime soon.

Thanks, that's awesome! I am a software engineer and long time student of Arabic. You're pretty much on the mark with the capabilities of the model at this point. It can recognize the simple morphemes and long vowels but stumbles on more complex constructions. Definitely ping me on GitHub if you have any questions about anything in the repo or if you just want to talk shop about linguistics / data science.

Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact