I often wonder how much of a head start the isolating nature of English gave for computing. It allowed ignoring a lot of inflectional and agglutinative complexity.
Concretely I mean it's very easy to generate text using sentence templates. Just plug in words and it works out. "The $process_name has completed running." "Like $username's comment" "Ban $username".
Relatedly, I think focusing NLP efforts on English masks a lot of interesting phenomena, because English text already comes in a reasonably tokenized, chunked up and pre-digested, easy to handle form. For example speech recognition systems started out with closed vocabularies, with larger and larger numbers of words, and even in their toy forms you could recognize some proper English sentences. To do that in Hungarian for example, the "upfront costs" to a "somewhat usable" system are much higher, because closed vocabulary doesn't get you anywhere. (Similarly, learning basic English is very easy, you can build 100% correct sentences on day 1, you learn "I", "you", "see" and "hear" and can say "I see" and "You see" and "I see you" and "I hear Peter" which are all 100% correct. In Hungarian these are "nézek", "nézel", "nézlek", "hallom Pétert" requiring learning several suffixes and vowel harmony and definite/indefinite conjugation. The learning curve till your first 100% correct 3-5 word sentences is just steeper.)
I don't mean it's impossible to handle agglutinative languages in NLP, I just mean the "minimum viable model" is much simpler and attainable for English, which on the one hand was able to kickstart and propel the early research phases and on the other hand perhaps fueled a bit too much optimism.
English can seem very well structured and it can tempt one to think of language in a very symbolic, within-the-box, rule-based way. In terms of syntax trees, sets of valid sentences etc, instead of "fuzzy probabilistic mess" that it really is. Surely, the syntax tree, generative grammar approach (Chomsky and others) gave us a lot of computer science, but this kind of "clean" and pure symbolic parsing doesn't seem to drive today's NLP progress.
In summary, I wonder how linguistics and especially computational linguistics and NLP would have evolved in a non-Anglo culture, e.g. Slavic or Hungarian.
>I often wonder how much of a head start the isolating nature of English gave for computing.
That's like saying "I wonder when you stopped beating your wife"; you assume there was a head start, when, in fact, the world's first commercial computer was German[1].
And until recently, natural languages had a near-zero effect on computing. Worst case, users ended up seeing messages which weren't grammatically perfect, and it wasn't a big deal.
>I wonder how linguistics and especially computational linguistics and NLP would have evolved in a non-Anglo culture, e.g. Slavic
Would have? NLP has only started to matter recently, at a time when it has to work in all languages from the get-go. The current evolution includes contributions of people from many languages and cultures.
And for that matter, English makes a lot of things harder.
> you assume there was a head start, when, in fact, the world's first commercial computer was German
Did the Z4 do a lot of German language text generation, or German language input parsing? But anyway German is also not agglutinative, but it does have complexities like gendered declension of articles and adjectives.
> And until recently, natural languages had a near-zero effect on computing.
Seems like we're talking past each other and I packed multiple things in the comment. I meant user-facing messages there. I've done some software internationalization (translation) work some years ago and in many cases the format was just templates. You were often expected to translate templates with pluggable strings. Whereas what you would actually need is to write a function that looks at the word that you want to plug in, extracts the vowels, categorizes them with some branching logic, looks at the last consonant, decides if you need a linking vowel, decides on the vowel harmony based on the vowels, look up if it's an exception and then apply the suffix.
In English you can generate the message "Added %s to the %s." These are usually translated to Hungarian as if it was "%s has been added to the following: %s". Or instead of "with %s" they must write "with the following: %s", because applying "with" to a word or personal name requires non-trivial logic. Whenever the translators resort to "... the following: %s", you can know they weren't able to fit it into the sentence with proper grammar due to the use of too primitive string interpolation-based internationalization.
Until recently, Facebook was not able to apply declension to people's names, as it is quite complicated. Normally "$person_name likes this post." would require putting $person_name into dative case, requiring determination of vowel harmony. To avoid it, they picked a rarer verb form which doesn't need the dative case but doesn't sound as natural. They've only transitioned to the dative case in the last year or so.
A lot of this stuff is just not even on the mind of English speaking devs, because template-based string interpolation is a good enough solution in English for the vast majority of cases. The only exception that would need a little bit of branching logic is applying "a" or "an" before a word or pluralization, but these don't come up too often.
Again, my point was dynamically generating user-facing messages, UI elements is so easy in English, while properly doing it in other languages.
> Would have? NLP has only started to matter recently, at a time when it has to work in all languages from the get-go. The current evolution includes contributions of people from many languages and cultures.
Most of the research outside of explicit machine translation research is still based on English. How many papers are out there, e.g., on visual question answering (VQA) systems in Polish or Finnish? In many cases I feel less impressed by such systems because I feel like English is too easy. The order is very predictable, the words are easily separable, the whole thing is much more machine processable. Maybe it isn't so, it would be interesting to see empirical results.
Ah. On that note, I guess my point was that language was never an impediment to UI.
Sure, some things will be easier in English. In other languages, the programmers would just roll with whatever is easier to code; the users would gobble it up as long as it's usable.
Back in the 90's, I've seen pirated software "internationalized" by running the UI keywords through machine translation into Russian. Knowing English was an advantage: if you translated the UI back into English, you could figure out what some of those things did. Still, it existed.
The complexity of language wasn't an impediment, it just lowered expectations for the quality of user interfaces.
Agglutinative languages would probably work as well as isolating languages, since they tend to work by just shoving things on the end of words rather than inflecting them. It does potentially raise a segmentation problem, but I'm not really sufficiently familiar with any agglutinative language to know how hard a problem it is in practice.
The difficult languages are inflectional languages, where you make things completely different instead of just tacking something on the end.
It's worth pointing out that all whitespace is completely optional in Fortran, the first programming language--doi=0,10 is exactly the same as DO I = 0, 10. So it's not like early computing relied heavily on gratuitous whitespace.
Possibly less than you think. (I'm not addressing the NLP part)
For example, every language is already used to math formatting. Programming languages draw more inspiration from math formatting than English.
That leaves naming. Here agglutinative languages should have an advantage. You can have more natural ways to describe roles like how in English we may have caller and callee, rather than more clumsily camel-casing something like sumOfLists.
> linguistics and especially computational linguistics and NLP would have evolved in a non-Anglo culture, e.g. Slavic or Hungarian.
Probably not much different, except that more elements of morphology are treated together with syntax.
> Concretely I mean it's very easy to generate text using sentence templates. Just plug in words and it works out. "The $process_name has completed running." "Like $username's comment" "Ban $username".
If computing were primarily championed by a fusional language (agglutinative languages usually have somewhat "clean" morphology), I imagine that libraries for inflection will be more prominently used. Like in English where more professional apps use a pluralizer library. One natural API for an inflection API is as a fluent API.
Certainly English's morphosyntactic simplicity helped out NLP; your phrase "minimum viable model" hits the nail on the head. But increasingly over the last 5-10 years, I think there is a lot of progress on techniques for handling morphological complexity. Some of the unsupervised tokenization methods that first saw use for English (eg Goldsmith's work) now sees play for agglutinative languages: see here for example[0]. So its not clear to me if NLP in a non-Anglo culture would just use the same techniques (arriving at practical achievements a decade later) or if there would be fundamentally different techniques that are totally unobvious to me now.
Re your point on language being a "[f]uzzy probabilistic mess" -- language is absolutely NOT a fuzzy probabilistic mess and its a damn shame that NLP based its success on black-box models, because it means no one bothers realizing that language isn't a mess at all. See Jelinek's law of speech recognizer accuracy [1]. Simply because we get results using messy black box models doesn't mean that's how things work under-the-hood.
Turkish is probably strict enough to be used as a programming language. The only downside is that its vocabulary is utterly alien for most speakers of Latin/Anglo-Saxon languages aside from some borrowed words from French and Arabic.
It's actually quite a bit easier to learn since it has few false friends with Latin languages. I often thought search engines written by English speakers focused on bags of words can't work very well in Turkish though?
All languages with synthetic morphology (both agglutinative languages, which glue chains of morphemes together, and fusional languages, which inflect morphemes) struggle with the language modelling techniques used for English.
A big issue is that in synthetic languages 'words' are much more 'rare' (because there are more morpheme combinations per word). So if you're building something like a bag-of-words or an ngram model, your input data is likely to be very sparse which translates to poor modelling of the language itself/what words speakers would judge as grammatical.
With agglutinative languages like Turkish, a technique that has been used with considerable success is just considering each morpheme a distinct token, but it has many of the same problems as word-level tokenization. I was looking at a paper recently that claimed to have found a good way to do smoothing so that unseen ngrams could be assigned a non-zero probability in a way that conformed to the rules of the language, but we'll have to see if that can work in practice.
I don’t blame Apple though - it might actually be just impossible to do Turkish autocorrect in the same way English autocorrect works, because the beginning of the word indicates the actual word but the end indicates everything else (direction, modifiers etc.). So it’s about as easy/hard as English to guess the beginning of the word, but impossible to guess the modifiers that get added because the moment the modifier sequence starts, every single letter starts to change the meaning, thus there are almost no incorrect paths. A correct Turkish autocorrect implementation would autocomplete the word root, but leave at the halfway-compete word at where the modifier suffixes start so that the user can complete the modifier sequence on his / her own.
Seems like you're talking about autocompletion, not autocorrect. In autocorrect you have completed the word, hit space and then the software fixes your typos. In autocomplete you get a list of suggested words while typing and you can tap them if your intended word is shown.
No - while auto completion is also broken, I’m talking about autocorrect. It’s a very common occurrence in Turkish iOS that something that you typed in correctly gets autocorrected to something else that makes no sense.
Sumerian, an agglutinative language, is an important plot point in a famous cyberpunk novel, Snow Crash by Neal Stephenson[1] (which also popularized the word "avatar" as we use it today).
If you find the concept interesting, you will enjoy reading the novel.
The smallest unit in language that has meaning is called a morpheme. Some languages use relatively few morphemes-per-word like English: for example, the word "cats" can be broken into two morphemes -- "cat" and "-s". "Two" can't be broken down any further, so it has a 1-to-1 mapping between morphemes and words.
Other languages use a lot of morphemes-per-word. One strategy to create words from morphemes is called agglutination (meaning to glue things together). An agglutinative language takes all the morphemes that are going to go into a word, and with minimal or no changes, glues them together to form a word.
For example, the Yupik word "tuntussuqatarniksaitengqiggtuq" means "He had not yet said again that he was going to hunt reindeer".
It is formed by taking the following morphemes and agglutinating them:
Agglutinative just means you glue (the -glu- refers to this) pieces (suffixes) at the end of words to express lots of things. This exists in English as well, but in restricted forms. For example blue+ish, quick+ly, blue+ness, look+ed. In an agglutinative language, this is how most of the things are expressed.
For example a totally normal Hungarian word is: szolgáltatásaiért = szolgá+l+tat+ás+a+i+ért = for his/her/its services. Szolga means servant, from Slavic origin. Szolgál is a verb meaning to serve. Szolgáltat means to provide service. Szolgáltatás means service (as in "goods and services", "internet service", etc.). Szolgáltatása means his/her/its service. Szolgáltatásai means his/her/its services. Szolgáltatásaiért means "for his/her/its services".
+ ás = gerund-forming suffix, makes a verbal noun from a verb [3]
+ ai = marker for plural possession [4]
+ ért = causal-final suffix, denotes the reason for the action [5]
These suffixes are morphemes you can (with some simple rules) just add to words to achieve the corresponding change in meaning. In CS terms, they're functions that take the input word and make a new one. Each one of the above, used somewhere else:
winter -> to spend the winter: tél + l -> telel
to read -> to make them read: olvas + tat -> olvastat
to read -> the reading: olvas + ás -> olvasás
house -> his houses: ház + ai -> házai
house -> because of, affecting the house: ház + ért -> házért
Simply put: a lot of grammar is based on appending to words. E.g. the Turkish word for book is Kitab (shared by a bunch of other middle eastern languages too). My Book is Kitabim. Your book is Kitabsin. (Note last example is vastly simplified, a proper Turkish speaker should correct it)
It allows for a lot of really short sentences; here's a nonsensical example:
His book is on fire - kitabı yanıyor.
The word endings is sufficient to provide context and meaning.
If you find Turkish to be too difficult to learn, try Malay. It's also agglutinative and used by ~300 million people (Malay and Indonesian are for all practical purposes the same language).
The article notes that in some languages it is possible to form sentences by chaining appendixes to them.
An example in Finnish:
- jousta (normal form of the verb run)
- juoksen (I run)
- juoksentelen (I run around)
- juoksentelisinkohan (I wonder should I run around)
- juostaankohammekohaan (I wonder do we run)
The two later forms are very rarely used, and I have no idea whether the last form is even correct. I have some friends who insist on talking like this. Usually, people express the same things with more words, such as juoksentelisinkohan is equivalent to about:
- I wonder., that. should. my (in this context, me). run. around.
The . are to separate the words.
Yet, it would be perfectly fine to just append a question mark to juoksentelisinkohan or juostaankohammekohaan and it would be a one-word sentence. An interesting remark is that in practice the question mark is redundant in both cases, as the -ko- part in the words reduces the only interpretation of the word to be a question.
I have absolutely no idea how would one formalize all this.
"Juostaankohammekohaan" is not right. It should probably be either "juoksemmekohan" "I wonder if we will run" or "juostaankohan" "I wonder if it will be run" (passive voice).
The frequentative forms would be "juoksentelemmekohan" and "juoksennellaankohan" respectively.
More broadly, synthetic languages are like statically-typed programming languages, whereas analytic languages[1] are like dynamically-typed programming languages.
Concretely I mean it's very easy to generate text using sentence templates. Just plug in words and it works out. "The $process_name has completed running." "Like $username's comment" "Ban $username".
Relatedly, I think focusing NLP efforts on English masks a lot of interesting phenomena, because English text already comes in a reasonably tokenized, chunked up and pre-digested, easy to handle form. For example speech recognition systems started out with closed vocabularies, with larger and larger numbers of words, and even in their toy forms you could recognize some proper English sentences. To do that in Hungarian for example, the "upfront costs" to a "somewhat usable" system are much higher, because closed vocabulary doesn't get you anywhere. (Similarly, learning basic English is very easy, you can build 100% correct sentences on day 1, you learn "I", "you", "see" and "hear" and can say "I see" and "You see" and "I see you" and "I hear Peter" which are all 100% correct. In Hungarian these are "nézek", "nézel", "nézlek", "hallom Pétert" requiring learning several suffixes and vowel harmony and definite/indefinite conjugation. The learning curve till your first 100% correct 3-5 word sentences is just steeper.)
I don't mean it's impossible to handle agglutinative languages in NLP, I just mean the "minimum viable model" is much simpler and attainable for English, which on the one hand was able to kickstart and propel the early research phases and on the other hand perhaps fueled a bit too much optimism.
English can seem very well structured and it can tempt one to think of language in a very symbolic, within-the-box, rule-based way. In terms of syntax trees, sets of valid sentences etc, instead of "fuzzy probabilistic mess" that it really is. Surely, the syntax tree, generative grammar approach (Chomsky and others) gave us a lot of computer science, but this kind of "clean" and pure symbolic parsing doesn't seem to drive today's NLP progress.
In summary, I wonder how linguistics and especially computational linguistics and NLP would have evolved in a non-Anglo culture, e.g. Slavic or Hungarian.