
Agglutinative Language - cubecul
https://en.wikipedia.org/wiki/Agglutinative_language
======
bonoboTP
I often wonder how much of a head start the isolating nature of English gave
for computing. It allowed ignoring a lot of inflectional and agglutinative
complexity.

Concretely I mean it's very easy to generate text using sentence templates.
Just plug in words and it works out. "The $process_name has completed
running." "Like $username's comment" "Ban $username".

Relatedly, I think focusing NLP efforts on English masks a lot of interesting
phenomena, because English text already comes in a reasonably tokenized,
chunked up and pre-digested, easy to handle form. For example speech
recognition systems started out with closed vocabularies, with larger and
larger numbers of words, and even in their toy forms you could recognize some
proper English sentences. To do that in Hungarian for example, the "upfront
costs" to a "somewhat usable" system are much higher, because closed
vocabulary doesn't get you anywhere. (Similarly, learning basic English is
very easy, you can build 100% correct sentences on day 1, you learn "I",
"you", "see" and "hear" and can say "I see" and "You see" and "I see you" and
"I hear Peter" which are all 100% correct. In Hungarian these are "nézek",
"nézel", "nézlek", "hallom Pétert" requiring learning several suffixes and
vowel harmony and definite/indefinite conjugation. The learning curve till
your first 100% correct 3-5 word sentences is just steeper.)

I don't mean it's impossible to handle agglutinative languages in NLP, I just
mean the "minimum viable model" is much simpler and attainable for English,
which on the one hand was able to kickstart and propel the early research
phases and on the other hand perhaps fueled a bit too much optimism.

English can seem very well structured and it can tempt one to think of
language in a very symbolic, within-the-box, rule-based way. In terms of
syntax trees, sets of valid sentences etc, instead of "fuzzy probabilistic
mess" that it really is. Surely, the syntax tree, generative grammar approach
(Chomsky and others) gave us a lot of computer science, but this kind of
"clean" and pure symbolic parsing doesn't seem to drive today's NLP progress.

In summary, I wonder how linguistics and especially computational linguistics
and NLP would have evolved in a non-Anglo culture, e.g. Slavic or Hungarian.

~~~
romwell
>I often wonder how much of a head start the isolating nature of English gave
for computing.

That's like saying "I wonder when you stopped beating your wife"; you assume
there _was_ a head start, when, in fact, the world's first commercial computer
was German[1].

And until recently, natural languages had a near-zero effect on computing.
Worst case, users ended up seeing messages which weren't grammatically
perfect, and it wasn't a big deal.

>I wonder how linguistics and especially computational linguistics and NLP
would have evolved in a non-Anglo culture, e.g. Slavic

 _Would have_? NLP has only started to matter recently, at a time when it has
to work in all languages from the get-go. The current evolution includes
contributions of people from many languages and cultures.

And for that matter, English makes a lot of things harder.

[1][https://en.wikipedia.org/wiki/Z4_(computer)](https://en.wikipedia.org/wiki/Z4_\(computer\))

~~~
bonoboTP
> you assume there was a head start, when, in fact, the world's first
> commercial computer was German

Did the Z4 do a lot of German language text generation, or German language
input parsing? But anyway German is also not agglutinative, but it does have
complexities like gendered declension of articles and adjectives.

> And until recently, natural languages had a near-zero effect on computing.

Seems like we're talking past each other and I packed multiple things in the
comment. I meant user-facing messages there. I've done some software
internationalization (translation) work some years ago and in many cases the
format was just templates. You were often expected to translate templates with
pluggable strings. Whereas what you would actually need is to write a function
that looks at the word that you want to plug in, extracts the vowels,
categorizes them with some branching logic, looks at the last consonant,
decides if you need a linking vowel, decides on the vowel harmony based on the
vowels, look up if it's an exception and then apply the suffix.

In English you can generate the message "Added %s to the %s." These are
usually translated to Hungarian as if it was "%s has been added to the
following: %s". Or instead of "with %s" they must write "with the following:
%s", because applying "with" to a word or personal name requires non-trivial
logic. Whenever the translators resort to "... the following: %s", you can
know they weren't able to fit it into the sentence with proper grammar due to
the use of too primitive string interpolation-based internationalization.

Until recently, Facebook was not able to apply declension to people's names,
as it is quite complicated. Normally "$person_name likes this post." would
require putting $person_name into dative case, requiring determination of
vowel harmony. To avoid it, they picked a rarer verb form which doesn't need
the dative case but doesn't sound as natural. They've only transitioned to the
dative case in the last year or so.

A lot of this stuff is just not even on the mind of English speaking devs,
because template-based string interpolation is a good enough solution in
English for the vast majority of cases. The only exception that would need a
little bit of branching logic is applying "a" or "an" before a word or
pluralization, but these don't come up too often.

Again, my point was dynamically generating user-facing messages, UI elements
is so easy in English, while properly doing it in other languages.

> Would have? NLP has only started to matter recently, at a time when it has
> to work in all languages from the get-go. The current evolution includes
> contributions of people from many languages and cultures.

Most of the research outside of explicit machine translation research is still
based on English. How many papers are out there, e.g., on visual question
answering (VQA) systems in Polish or Finnish? In many cases I feel less
impressed by such systems because I feel like English is too easy. The order
is very predictable, the words are easily separable, the whole thing is much
more machine processable. Maybe it isn't so, it would be interesting to see
empirical results.

~~~
romwell
Ah. On that note, I guess my point was that language was never an impediment
to UI.

Sure, some things will be easier in English. In other languages, the
programmers would just roll with whatever is easier to code; the users would
gobble it up as long as it's usable.

Back in the 90's, I've seen pirated software "internationalized" by running
the UI keywords through machine translation into Russian. Knowing English was
an advantage: if you translated the UI back into English, you could figure out
what some of those things did. Still, it existed.

The complexity of language wasn't an impediment, it just lowered expectations
for the quality of user interfaces.

------
sansnomme
Turkish is probably strict enough to be used as a programming language. The
only downside is that its vocabulary is utterly alien for most speakers of
Latin/Anglo-Saxon languages aside from some borrowed words from French and
Arabic.

~~~
yabadabadoes
It's actually quite a bit easier to learn since it has few false friends with
Latin languages. I often thought search engines written by English speakers
focused on bags of words can't work very well in Turkish though?

~~~
rolleiflex
Not only search, but also autocorrect. Turkish autocorrect on iOS is a flaming
disaster even after a decade.

Here’s a real (if unlikely) word in Turkish and how this whole agglutination
business works:
[https://twitter.com/languagecrawler/status/62385880386859827...](https://twitter.com/languagecrawler/status/623858803868598272?s=21)

I don’t blame Apple though - it might actually be just impossible to do
Turkish autocorrect in the same way English autocorrect works, because the
beginning of the word indicates the actual word but the end indicates
everything else (direction, modifiers etc.). So it’s about as easy/hard as
English to guess the beginning of the word, but impossible to guess the
modifiers that get added because the moment the modifier sequence starts,
every single letter starts to change the meaning, thus there are almost no
incorrect paths. A correct Turkish autocorrect implementation would
autocomplete the word root, but leave at the halfway-compete word at where the
modifier suffixes start so that the user can complete the modifier sequence on
his / her own.

~~~
bonoboTP
Seems like you're talking about autocompletion, not autocorrect. In
autocorrect you have completed the word, hit space and then the software fixes
your typos. In autocomplete you get a list of suggested words while typing and
you can tap them if your intended word is shown.

~~~
rolleiflex
No - while auto completion is also broken, I’m talking about autocorrect. It’s
a very common occurrence in Turkish iOS that something that you typed in
_correctly_ gets autocorrected to something else that makes no sense.

------
romwell
Sumerian, an agglutinative language, is an important plot point in a famous
cyberpunk novel, Snow Crash by Neal Stephenson[1] (which also popularized the
word "avatar" as we use it today).

If you find the concept interesting, you will enjoy reading the novel.

[1][https://en.wikipedia.org/wiki/Snow_Crash](https://en.wikipedia.org/wiki/Snow_Crash)

------
Bootwizard
Can someone here explain this in an easier to understand way? This was a bit
too dense for my understanding...

~~~
bonoboTP
Agglutinative just means you glue (the -glu- refers to this) pieces (suffixes)
at the end of words to express lots of things. This exists in English as well,
but in restricted forms. For example blue+ish, quick+ly, blue+ness, look+ed.
In an agglutinative language, this is how most of the things are expressed.

For example a totally normal Hungarian word is: szolgáltatásaiért =
szolgá+l+tat+ás+a+i+ért = for his/her/its services. Szolga means servant, from
Slavic origin. Szolgál is a verb meaning to serve. Szolgáltat means to provide
service. Szolgáltatás means service (as in "goods and services", "internet
service", etc.). Szolgáltatása means his/her/its service. Szolgáltatásai means
his/her/its services. Szolgáltatásaiért means "for his/her/its services".

~~~
niftich
More on this:

szolga = servant

\+ l = verb-forming suffix [1]

\+ tat = causative suffix [2]

\+ ás = gerund-forming suffix, makes a verbal noun from a verb [3]

\+ ai = marker for plural possession [4]

\+ ért = causal-final suffix, denotes the reason for the action [5]

These suffixes are morphemes you can (with some simple rules) just add to
words to achieve the corresponding change in meaning. In CS terms, they're
functions that take the input word and make a new one. Each one of the above,
used somewhere else:

winter -> to spend the winter: tél + l -> telel

to read -> to make them read: olvas + tat -> olvastat

to read -> the reading: olvas + ás -> olvasás

house -> his houses: ház + ai -> házai

house -> because of, affecting the house: ház + ért -> házért

[1]
[https://en.wiktionary.org/wiki/-l#Hungarian](https://en.wiktionary.org/wiki/-l#Hungarian)
[2]
[https://en.wiktionary.org/wiki/-tat#Hungarian](https://en.wiktionary.org/wiki/-tat#Hungarian)
[3]
[https://en.wiktionary.org/wiki/-%C3%A1s#Hungarian](https://en.wiktionary.org/wiki/-%C3%A1s#Hungarian)
[4]
[https://en.wiktionary.org/wiki/-ai#Hungarian](https://en.wiktionary.org/wiki/-ai#Hungarian)
[5]
[https://en.wiktionary.org/wiki/-%C3%A9rt](https://en.wiktionary.org/wiki/-%C3%A9rt)

------
beefman
More broadly, synthetic languages are like statically-typed programming
languages, whereas analytic languages[1] are like dynamically-typed
programming languages.

Also, intransitive verbs[2] are like thunks.

[1]
[https://en.wikipedia.org/wiki/Analytic_language](https://en.wikipedia.org/wiki/Analytic_language)

[2]
[https://en.wikipedia.org/wiki/Intransitive_verb](https://en.wikipedia.org/wiki/Intransitive_verb)

------
monkeycantype
I was just reading this yesterday after the term came up in a Japanese grammar
book.

------
foobar_
Forth is probably the only agglutinative language in a way.

