
Show HN: I trained a neural network to learn Arabic morphology - tboyd47
https://github.com/tb0yd/rootfinder
======
habeanf
Nice work! If I'm not mistaken, the root requires morphological
disambiguation, which may change depending on the context/phrase in which the
word is observed.

This is an active area of research in Morphologically Rich Languages (MRLs),
since this problem also appears in other semitic languages like Hebrew, as
well as Turkish. There's a nice body of work to learn from, both with and
without neural nets. For example, this paper from 2017
([http://aclweb.org/anthology/D17-1073](http://aclweb.org/anthology/D17-1073))
uses a neural model for morphological disambiguation. You can see a nice
comparison of tools in the recent 2018 Universal Dependencies Shared Task
results: [http://universaldependencies.org/conll18/results-
lemmas.html](http://universaldependencies.org/conll18/results-lemmas.html)
(look for ar_padt).

If you're looking for training data, the Arabic treebanks in
[http://universaldependencies.org](http://universaldependencies.org) could
help. I think some of them contain surface tokens with lemmas. I'm quite sure
they also have roots.

Also, you might want to take a look at the SIGMORPHON CONLL shared task (2017
[https://sites.google.com/view/conll-
sigmorphon2017/](https://sites.google.com/view/conll-sigmorphon2017/) and 2018
[https://sigmorphon.github.io/sharedtasks/2018/](https://sigmorphon.github.io/sharedtasks/2018/))
on morphological reinflection, which IIRC is a similar task - taking an
inflected form and reinflecting it with other morphological properties. They
also have a nice data set to train on.

~~~
tboyd47
Wow, thanks! A wealth of resources here.

------
ilimilku
This is exciting to see. I am a Semitic philologist (Ph.D.) now breaking into
the IT industry, and this sort of work is on my radar, though mostly with
Hebrew and Aramaic. Arabic, being a Semitic language, has a non-linear
morphology, which means that extracting the root has to be done by extracting
non-inflectional consonants from all possible positions in a word. If you
train a NN with full conjugation paradigms, over a data set, it should be able
to begin to recognize what the various inflectional morphemes are. In other
words, instead of looking for the root, look for everything that is not the
root, and the root is what is left over. For example, the NN should be able to
recognize that mu-, ya-, ta-, 'āC-, -ā-. -Ct-, -unna, etc. are all
inflectional morphemes. It should also begin to recognize the various matres
lectionis or letters indicating long vowels just as alif, waw, and ha. (I'm
including vowels in my analysis, because I think like a philologist, not a
typical reader of Arabic. Using unvowelled text might be more difficult for
the NN.) Anyway, these are just some off-the-cuff thoughts. I look forward to
digging deeper into your code and methodology sometime soon.

~~~
tboyd47
Thanks, that's awesome! I am a software engineer and long time student of
Arabic. You're pretty much on the mark with the capabilities of the model at
this point. It can recognize the simple morphemes and long vowels but stumbles
on more complex constructions. Definitely ping me on GitHub if you have any
questions about anything in the repo or if you just want to talk shop about
linguistics / data science.

------
ztravis
This is an interesting project, although as others have mentioned the space of
arabic words is (reasonably) bounded and an explicit parsing approach or
something that uses known language data might prove to be more efficient and
accurate.

Along those lines, I might be able to provide some useful json (the basis for
[http://www.arabicreference.com](http://www.arabicreference.com)) in case you
are interested. I've been meaning to do some fun investigations using this
data (e.g. predicting broken plurals, masadir, form I internal vowelling) but
haven't yet had time.

~~~
tboyd47
Wow, really nice app! I love the simple-yet-highly-responsive interface. I'm
definitely going to be sharing this around to my translator friends. Just
curious, are you a translator?

------
nafizh
This is really cool. Interestingly, Bayyinah (www.bayyinah.com) Institute has
a site [1] for the opposite problem where you can generate the 10
morphological families from root letters.

1\. [http://oldsite.bayyinah.com/wp-
content/uploads/2015/12/sarfG...](http://oldsite.bayyinah.com/wp-
content/uploads/2015/12/sarfGenerator.html)

------
ZainRiz
Can you please explain what the inputs and outputs are supposed to be? I'm
familiar with the Arabic alphabet but I don't know the words

~~~
tboyd47
Sure- I pulled the input set from the arabeyes wordlist and used an online
service called aratools to get the answer for each.

~~~
pavel_lishin
But what are the inputs? Are they words? And what are the outputs?

I know nothing about arabic, so to my eyes certain wiggles - that aren't a 1:1
match - are "correctly" matched, and others aren't.

~~~
ziftface
Some languages, like Arabic, have "roots" which are used to form words. These
roots usually have an abstract or ambiguous meaning. You can take those roots
and turn them into words, which have a defined meaning, by using a form, kind
of like a template. So one root for example is BDL. The kind of vague meaning
is to exchange or replace something. One template you can use is TaXXeeX, with
the root letters going in place of the Xs. So this results in the word
"tabdeel", which means "an exchange".

So what the NN does is takes as input a word, and tries to find its root.
"Tabdeel" was the first input listed, and the output was "BDL".

Some more information on this:

[https://en.wikipedia.org/wiki/Semitic_root](https://en.wikipedia.org/wiki/Semitic_root)

~~~
pavel_lishin
Oh, thanks!

------
rvense
I took some Arabic at university. It's a fascinating language. My impression
was that the morphology is quite regular, I wonder how complicated an old
fashioned, hand-written parser with comparable accuracy would be.

~~~
tboyd47
It should be possible, because Arabic morphology follows a logical set of
rules (mostly). For example, given a word like متوسط, you could match it for
the standard conjugations for the 10 standard verb forms, of which it would
match for #5 (متفعل), giving و-س-ط as the root. Verb form + conjugation would
probably get you 50% of the way there (depending on how you count the number
of valid words...) and I wouldn't be surprised if it's possible with just a
regex. It would get a little harder once you go past the 10 verb forms and
into plurals and adjective forms, which are usually shorter words, but a
little less regular in their construction. It seems like it would be
cumbersome to catch all those forms algorithmically. But someone has probably
taken the time to do it.

~~~
aogaili
Knowing arabic, I'm not very surprised of this result. I actually don't think
there is underlying pattern, most of the words are memorized and created by
social conventions.

~~~
rvense
It's been a while since I read up on this, but as I remember the (Western)
description is that there are the roots and the derived forms (which are
numbered one to ten/twelve), and then for each derived form there are one or
more patterns corresponding to a word class.

So the root d-r-s has derived verbs darasa and darrasa, and to each of those
correspond, say, one or more patterns for the verbal noun. But I don't think
there is exactly one pattern for verbal nouns derived from the form 2 verb
(e.g. from darrasa we get tadris, as I recall, but not all verbs that go like
fa33ala will necessarily have a masdar of the form taf3il, right?).

You're right of course, that even though the forms have prototypical
systematic semantic variation (like form 2 is usually a causative, "to teach"
versus "to learn"), it's not predictable which derived forms of a given root
enter into actual usage and with which exact meaning, and the patterns
obviously predict a lot of words that don't actually exist, and of course
Arabic speakers learn words just like speakers of any other language.

~~~
schoen
> and the patterns obviously predict a lot of words that don't actually exist

I think I remember that there are a handful of cases where speakers started
using some of the previously unattested forms in modern times to refer to new
concepts... is that right?

~~~
rvense
Well, new words have to come from somewhere... new roots can be introduced
(and be made to act like Arabic ones - like how the plural of film is aflam),
but the derivation patterns are also productive like in any other language. As
an example, though the loan word 'computer' is apparently common, there's also
the word 'hasuub' which I learned in my Modern Standard/Media Arabic course.
The exact form is not listed in Wehr's 1968 dictionary, so maybe it wasn't
used at that time, but it is straightforwardly derived from a root with same
meaning as the English word 'to compute' and has the form fa3uul which
(according to Wolfdietrich Fischer's Grammar of Classical Arabic) is an
'abstract or verbal noun', so basically a calque of the English.

(What a great topic - I miss this stuff!)

------
anonu
I like where this is going. But I'm still a bit skeptical at the use of neural
networks in this use case

How big is the problem space? There's a limited set of roots and
morphologies...

Would a more rules based approach work (more accurately)?

~~~
tboyd47
It really depends on what dialects of Arabic you are looking at and what your
goals are. My starting dataset is very small. I study classical Arabic, so I
am only looking at _fus-ha_ or MSA (not quite the same but very close), and I
only pulled a small subset of the total number of words because this started
out as just a toy project. The total number of _fus-ha_ words in use today is
probably in the low millions. But if you extend to all dialects of Arabic and
all time periods then you may reach half a billion to a billion words. If you
go from written Arabic to spoken Arabic, then there's no telling how big the
input set is. Practically infinite at that point.

For the purpose of just using the language, the morphological rules are well-
understood. One of the most popular dictionaries (Hans Wehr) is arranged not
in alphabetical order, but by root. And there are many online lexicons with
morphological metadata as listed in other comments here. So you're right,
machine learning is not necessary here (but that's not to say it couldn't be
done). This is mainly just for fun and learning.

Now, if you were to add _all_ dialects of Arabic, then you may have a use case
for ML...

------
trhway
Doesn't seem that deep, I mean the NN itself :) Any particular reason for
that? I mean it is pretty hard to imagine that this net can build a model of
the problem area.

Edit: thanks for the response, it is interesting result that more layers don't
improve the situation.

~~~
tboyd47
I tried more than one hidden layer and it didn't seem to improve the accuracy.

------
csomar
> Correct: ['جملة', 'جمل']

جملة means a "sentence"

جمل can mean a Camel (pronounced "jamel" you can guess the origin of the
english word)

جمل can also mean "sentences" (pronounced "jomal")

~~~
msfellag
I don't see your point. Either way the root is still correct. And you failed
to mention the most relevant iteration :

جَمَلَ means to sum-up, to summarize[1] or to concatenate. Which is what a
sentence does to words.

[1]
[https://en.wiktionary.org/wiki/%D8%AC%D9%85%D9%84](https://en.wiktionary.org/wiki/%D8%AC%D9%85%D9%84)

~~~
csomar
My point is that it needs punctuation in order to make sense?

~~~
tboyd47
I think the formatting on the output is a bit confusing. The input is jumla
جملة (sentence) and the output is J-M-L or ج-م-ل, not the word جمل, which yes,
is a word for camel. I think it would be easier to read if I just added dashes
between each output letter to emphasize that it is not a word, but a
collection of letters.

------
ulucs
Great work! Are there any plans for latin character support? I’m Turkish and
I’d love to be able to play with the words we’ve borrowed from Arabic but
unfortunately I don’t know the Arabic script.

~~~
tboyd47
I added some basic transliteration in the output formatting stage so people
who don't know Arabic script can understand the results. Accepting romanized
input would require a little more work, so I'd have to tinker a bit there.

------
caio1982
Congrats for such nice non-English project! It is always lovely to see
computer linguistics hitting the front page :-)

------
senatorobama
How does Arabic compare to Indic, Latin and Chinese scripts?

~~~
rvense
It's always written cursive, i.e. the letters are joined together. The letters
represent sounds. A letter has the same sound regardless of context. Vowels
are divided into long and short ones, and the short ones aren't written. I.e.
the word "(he) wrote" is pronounced kataba, with three short a-sounds, is
written ktb.

You can write out the short vowels but it's only done in special contexts
(like children's books, books for foreign language learners, or the Qur'an).
It's easy enough to read if you know the language, but it makes it a little
harder to learn.

~~~
senatorobama
Any idea why it ended up so differently to scripts found in Eurasia?

~~~
rvense
It's actually not very different. Hebrew works in the same way, except it's
not cursive. They share a common ancestor in Phoenician script I think, also
used for a Semitic language, for which it works well. The Greek alphabet (and
thence ours) was derived the Phoenician script that worked in a similar
fashion, by adding vowels.

