
Finnish breaks natural language processors - imartin2k
https://twitter.com/joose_rajamaeki/status/1096397000520749056
======
hpaavola
Finn here. There are countless memes about oddities of our language. The one I
like the most is "kuusi palaa". It can mean;

    
    
       the spruce is on fire
       the spruce returns
       the number six is on fire
       the number six returns
       six of them are on fire
       six of them return
       your moon is on fire
       your moon returns
       six pieces
    

Good luck for all implementing NLP :)

~~~
Whatitat90
I wonder what oddities lie in Linus Torvalds' "perkeleen vittupää"
([https://lkml.org/lkml/2013/7/13/132](https://lkml.org/lkml/2013/7/13/132)).

~~~
jacobush
I'm afraid it very unambiguously just means fucking cunts.

~~~
Ezku
In singular genetive and nominative cases no less, so, absolutely no
grammatical fancy going on there.

------
DocG
There was a joke that Estonian is more closer to Japanese than any other
European language. As to not say its related to Japanese, but how much it
differs from European ones. No future tense, no genders, but 14 cases.

This illustrates nicely where finno-ugric languages(Finnish, Hungarian,
Estonian) reside compared to other European languages.

[https://images.mentalfloss.com/sites/default/files/196.jpg](https://images.mentalfloss.com/sites/default/files/196.jpg)

Benefit is that people are somewhat safe for scammers, as no translator does a
good job. Downside is, none of the translators work and its way easier to just
translate to english for example.

~~~
a_imho
What is the benefit of genders anyway? Especially when gender neutral
alternatives are available. It never made much sense to me, they introduce a
lot of complexity for very little. Worse, nowadays you have all the problems
of misaddressing and offending folks.

~~~
iainmerrick
There was an interesting/amusing post here the other day about learning German
and how it relates to programming concepts. They suggested that gendered words
work as a kind of parity bit for error correction.

~~~
wolfgke
> There was an interesting/amusing post here the other day about learning
> German and how it relates to programming concepts. They suggested that
> gendered words work as a kind of parity bit for error correction.

Slightly off topic (not about grammatical genus, but about German's case
system): I often compare German's case system to a type system in a
programming language: the verb expects the sentence's object in the suitable
case. If you use a wrong type, it gives a compile error. In most cases, the
case (type) that you have to use makes sense, but there are rare cases where
you simply have to learn the case. The latter is like having to use a
(programming) library that was designed by some other programmer with an
interface that you would design differently.

Just like you have to write a computer program that gives no compile error
(e.g. because of typing), in German, you also have to formulate your sentences
in a way that gives no compile error to its case (type) system.

~~~
bravura
I am trying to learn German and gender is infuriating there. Why? Because you
can't simply scan an existing sentence and color-code the words based upon
simple heuristics to guess the gender.

The first thing you learn is that three genders are indicated in the
nominative case using der (masculine) / die (feminine) / das (neuter).

So you see "der" in a sentence and that thing must be masculine, right? Nope,
because "der" is feminine in the dative case.

The gender system in German wouldn't be such garbage if the articles weren't
overloaded. Imagine if in your type system "float" could mean two different
types, and you needed more information to figure out which type it is (like,
for example, memorizing the type of the object its referring to). It breaks
the idea of locality.

~~~
wolfgke
> The first thing you learn is that three genders are indicated in the
> nominative case using der (masculine) / die (feminine) / das (neuter).

 _I_ personally would never teach it that way. To me the four cases are like
different faces of, say, a cube (I know a cube has six faces - sorry), where
the cube represents the noun. Just like looking at a cube from different
perspectives (in the direction of some face) gives you a different view on it,
each case gives a different "perspective" of a way that the word relates to
some different clause in a sentence.

How the articles and endings "rotate" when the case changes ("as the cube
rotates") is something that I developed a really smart scheme for (which can
be found in no book on German grammar that I am aware of) that every
"mathematical-minded person" immediatelly grasped, told me it perfectly made
sense and wished that someone had taught it this way in school. On the other
hand, nearly every "humanities-minded person" told me that they cannot
understand what I am even talking about and what advantage this perspective is
supposed to have.

Typical language teachers are "humanities-minded". :-(

Unluckily, I have far too less time to write this all down in sufficient
detail that other people can learn from it; I have far too many other things
to do. :-(

> Imagine if in your type system "float" could mean two different types, and
> you needed more information to figure out which type it is (like, for
> example, memorizing the type of the object its referring to).

You are describing declarations and definitions in C++ and the problems of
parsing C++? ;-)

------
plq
These are common problems for languages with complex morphology. Finnish,
Japanese, Turkish and Hungarian are closely related in this sense and share a
lot of NLP research.

While implementing spell checkers for these languages need a bit more effort
than just compiling a list of words, it's far from an unsolved problem. AFAIK
Zemberek[1] is a Turkish spellchecker that implements two-level finite state
morphology.

Also, a small terminology fix: Verbs are "conjugated" but names/objects are
"declinated".

[1]: [https://github.com/ahmetaa/zemberek-
nlp](https://github.com/ahmetaa/zemberek-nlp)

~~~
szatkus
What's the problem with Japanese? It's highly regular language, so it should
be easy to tokenize.

AFAIR the whole language has maybe few irregular verbs, compared to few
hundreds in English.

~~~
isani
Japanese is usually written without spaces. Words and sentences just run into
each other. When writing in hiragana (syllabic characters), word boundaries
are often ambiguous.

Englishwouldbemuchhardertoparseifwrittenlikethis.

~~~
saagarjha
I have no stake in natural language processing, but it looks to me like a
computer might be able to do a pretty good job at splitting that given a
dictionary.

~~~
isani
Sure, you can get pretty far with a fairly simple solution. But lot of the
time, you get two (or more) ways to split the string into dictionary words.
For a simple English example, is it "justice was served" or "just ice was
served"?

~~~
saagarjha
I guess that’s where context will have to be considered. Those two are valid
sentences, so presumably humans are using context to distinguish between them,
right?

~~~
MereInterest
The murderer came to my dinner party, and I had it all planned. In one of the
ice cubes, I had frozen arsenic. The murderer would eat the same food, drink
the same drink, and nobody would guess that they would die on leaving. When
the evening was over, I knew what I would tell people.

Justicehadbeenserved.

~~~
woliveirajr
Please, share this with the world on tweeter.

~~~
MereInterest
If you would like to, feel free. For myself, I think that the comment's
context of showing how ambiguity may not be resolved merely be contextual
information is important, and that it would not stand as strongly without it.

------
frosted-flakes
It's interesting that Nokia's auto-completion worked well for Finnish, but
modern solutions don't work.

> Joose Rajamäki 🇫🇮🇪🇺 @joose_rajamaeki Feb 20 > Yes, autocompletion [on
> phones] exists. But I hardly ever manage to compose a message where it
> wouldn't encounter new words. Also, it doesn't know the inflections, so it
> needs to encounter each word in each possible form before suggesting them.

> Joose Rajamäki 🇫🇮🇪🇺 @joose_rajamaeki Feb 20 > Old Nokia phones were good.
> They let you indicate that the word root was finished and you wanted to add
> agglutination and/or inflections.

~~~
kaitai
On their Windows phones Nokia's Finnish autocomplete was actually even a bit
better -- it suggested suffixes and was pretty accurate. I wonder how they
trained it.

------
pakitan
So, does that mean that if OpenAI's text generator turns out to be as
"dangerous" as it's being marketed, we can all just switch to Finnish as
lingua franca and Skynet won't have a chance? :)

~~~
breakingcups
That implies we can all "just" switch. I have a feeling Skynet will have a
better chance than I do!

~~~
sergioisidoro
I've been trying to learn and "switch" to Finnish for 5 years. I can confirm
this.

------
miki123211
Polish has some of those problems too, at least the ones with conjugations.
The words mom (appearing in your contact list), mom (as in call mom) and mom
(as in send money to mom) are different. That applies to proper names too, and
it isn't regular. It's not easy to figure oujt who the user refers to if you
just have the contacts list and the command.

Polish had problems way before nlp, for example "Annie has sent you a message"
and "John has sent you a message" would be translated differently (because
male/female), same for "2 minutes ago" and "5 minutes ago". Also
programmatically formulating natural messages like "Dear John" is nigon
impossible.

~~~
ajuc
All Slavic languages are quite quirky in this regard. Most native speakers
don't realize it, but when you try to write correct message parametrized with
numbers in Polish, you need 3 or 4 special cases:

    
    
         - 1
         - 2-4, 22-24, 32-34, ...
         - 5-21, 25-31, 35-41, ...
    
    

If you say "out of X" you also need to handle numbers 100-199, 100 000-199
000, 100 000 000-199 000 000, ... separately (because 100 = sto, and if number
begins with "s" \- "out of X" is "ze" instead if "z") - this is in combination
with previous 3 cases, so "199 ouf of 199" is different from "99 out of 99",
and different from "122 out of 122".

So "Copied X file(s) out of Y" would be:

    
    
         - "Skopiowano 1 plik"
         - "Skopiowano 2 pliki z 3"
         - "Skopiowano 5 plików z 5"
         - "Skopiowano 1 plik ze 100"
         - "Skopiowano 2 pliki ze 100"
         - "Skopiowano 5 plików ze 100".
    

Almost no software handles this correctly :) But spell checkers work OK (they
just ignore relationships between words usually).

~~~
babuskov
> Almost no software handles this correctly

Gettext solved it 20 years ago. The problem is that people try to reinvent the
wheel instead of looking at existing solutions. So, they make their own
inferior versions.

The rule for plural has 3 cases.

1\. When n mod 10 = 1

2\. When n mod 10 is 2,3,4

3\. Everything else

And there's a special rule that the first 2 cases are not used when n mod 100
is between 11 and 29. For such numbers, the 3rd case is used instead.

This is actually not that complex. Compare to Arabic, where you have 9 cases
IIRC.

~~~
ajuc
Doesn't solve the problem with "X out of Y".

~~~
babuskov
It does. You have two numbers, so you have to split it into X and Y part. When
I look at your example, 1 plik is always 1 plik regardless of the Y quantity
and 100 always go with "ze" regardless of X quantity. It's a little bit tricky
to set up, but it works. "Copied X of Y files" needs to split into two
translation units: "Copied X" and "of Y files".

~~~
ajuc
That's "not supported" in my book. How should English or Chinese developer
know to split this? What about translating to other languages with different
splits?

There's no system to handle this, so each special case must be embedded in the
software itself, instead of properly doing it in internationalization package.

So - in effect - nobody even tries.

------
gio2j
Morphology of Finnish language has of course been studied, and spell checkers
for it exist.

E.g. open-source: [https://voikko.puimula.org/](https://voikko.puimula.org/)
(and its online version
[https://oikofix.com/contact?lang=en](https://oikofix.com/contact?lang=en),
where you can also play with the analysis/parsing capabilities)

Of course, the more complicated morphology makes it less straightforward than
English where simple word list based approaches are sufficient.

------
stonewhite
Turkish being a member of ural-altaic language family, just like Finnish,
suffers from similar problems. Google Translate improved much over the years
yet it still generates laughable text at best. Stemmers used to create show-
stopper word stems (don't really know the current situation).

Although, it imported a lot of technical terms from various european
languages, making technical texts seem to be more legible due to lack of
composited words in these contexts.

similarly from tweets: yiyecek = food yiyecek miydi = will he/she eat that
(2nd word is not a separate word, just a conjunction that is written
separately)

göz = eye gözcü = scout gözlük = glasses gözlükçü = glass salesman gözcülük =
scouting gözlükçüler = glass salesmen gözlükçüydüler = they were glass
salesmen gözlükçü müydüler = were they glass salesmen?

also an all time classic: çekoslavakyalılaştıramadıklarımızdan mısınız = are
you one of those people whom we tried unsuccessfully to assimilate into a
Czechoslovakian citizen?

~~~
Lasokki
Ural-Altaic language family is today considered an obsolete concept [1] and
the families are considered unrelated

[1]
[https://en.wikipedia.org/wiki/Ural%E2%80%93Altaic_languages](https://en.wikipedia.org/wiki/Ural%E2%80%93Altaic_languages)

~~~
plq
The models / software created for one language mostly fit others. Eg. Two-
level finite state morphology was invented for Finnish and was very
successfully adopted to parse Turkish words.

So the opinions of Linguists aside, the languages that make up the ural-altaic
family are not _that_ far apart from each other.

------
raverbashing
My suspicion is that most NLP techniques were invented by native speakers of
"easy" languages

Though romance languages/germanic languages (and even English ancestors) have
their quirks they're not at Finno-Ugric levels

(Not that English has no quirks - especially in pronunciation - but it is
"easy" to deal with most of the weird exceptions)

------
Lorkki
> tietokone = knowledge machine (literally) = computer

This is somewhat beside the point, but although in everyday speech "tieto"
usually means knowledge in the sense of knowing a fact or skill, it can also
be taken as "information", which seems more likely to have been the original
intent of the term.

Nevertheless, pretty much every elementary computing course in Finnish begins
by stating that "the computer doesn't actually _know_ anything". :)

------
toolslive
What about Hungarian ?

[https://en.wikipedia.org/wiki/Finno-
Ugric_languages](https://en.wikipedia.org/wiki/Finno-Ugric_languages)

~~~
ahoka
It's hard to spellcheck too, but there's hunspell which is pretty good and is
used by a lot of open source products:
[https://en.m.wikipedia.org/wiki/Hunspell](https://en.m.wikipedia.org/wiki/Hunspell)

------
wodenokoto
Seems like they are splitting their tokens wrong. Spaces are just a
suggestion.

But seriously, the problems outlined reminded me very much of Japanese and
korean, just turned up to 11.

Japanese has endless amounts of homonyms and at least in theory words can go
on and on and on with added conjugations. They have a lot of compound nouns,
often build from abbreviations of the words compounded.

The author mentions that these compounds are a problem in Finnish NLP because
the explode the size of the vocabulary.

Written Japanese does not contain any word boundaries at all. They split on
morphemes for NLP tasks, which helps against exploding vocabularies, but also
disseminate the parts into their own meaning-unit.

For written Japanese you have characters which can guide you in meaning for a
lot of compound words, but they blow up your character space. That is not the
case for Finnish, but I can't really decide if that is an advantage or
disadvantage.

Also, a spellchecker and a chatbot would rely on completely different
techniques, so you can have one without the other. Japanese doesn't even have
spellcheckers, but they have a ton of other tools that millions of people rely
on everyday for writing faster and better text, like kana-kanji conversion and
word suggestions.

------
jeromebaek
This is interesting. Especially the bit about compounding taking the meaning
from figurative to literal. Modern NLP embeddings still assume a canonical
meaning for each word and therefore cannot (immediately, anyway) distinguish
between a figurative or literal meaning. Not that such efforts don't exist,
but they all seem to be missing something fundamental - namely, is the
"canonical" meaning, the vector, supposed to be literal, or figurative? This
is a question with potentially no right answer, yet we all assume that the
answer is "literal".

At the same time, I do wonder if part of the problem with NLP algorithms and
Finnish isn't so much its complexity but the fact that there id very little
Finnish data to train on.

------
hibbelig
Why can't the NLP split the compounds? I think this problem of splitting
happens in multiple languages -- the submission mentions German, and then you
have Chinese which doesn't have spaces, so you have to split words. (I
understand that splitting Japanese into words is simpler.)

~~~
htns
Another complication not mentioned in the tweets is consonant gradation, in
which the stem changes in dozens of ways when suffixed (examples:
[http://users.jyu.fi/~pamakine/kieli/suomi/vaihtelu/astevaiht...](http://users.jyu.fi/~pamakine/kieli/suomi/vaihtelu/astevaihtelu.html)),
which makes Finnish a lot less cleanly agglutinative than most other
languages. It's not like it can't be done, but it's yet another step (find the
likely split points, then try to undo the consonant gradation) and I'd imagine
casual foreign researchers aren't going to bother.

------
Griceraae50100
Really interesting! I was wondering a few months ago how autocompletion on
phones works for agglutinating languages like Finnish, but I didn't have a
Finn to ask =/ If you wouldn't mind enlightening me, does it even exist for
Finnish? If so, how does it work?

~~~
sershe
I can tell you based on Russian that it works poorly compared to English, esp.
for verbs. I'd expect to be even worse for Turkish (which I used to know on
beginner level), or Finnish. When the word you are typing becomes obvious, it
would offer 3 (3 is the default on my keyboard app) random forms that do not
appear to take context into account, or if they do they're often wrong. You
can either just finish typing, or choose one of those, erase the
ending/suffixes and fix them.

------
celticninja
In a similar vein, submitted on HN this morning:

[https://linustechtips.com/main/topic/72936-english-
swedish-g...](https://linustechtips.com/main/topic/72936-english-swedish-
german-and-finnish-decline-dog/)

------
jasonhansel
Latin has similar issues. Thankfully we have Whittaker's Words and the Perseus
Word Study tool to help. I'm surprised that Latin is (apparently) better off
than Finnish in this respect.

------
aasasd
I still haven't figured out how Germans use swipe keyboards on smartphones
with the monster compound sentence-words. If they do.

~~~
hibbelig
On iOS, the keyboard just suggests compounds. This works for those you use
often. Another option is to choose a prefix, then backspace (over the space),
then add to the prefix.

Example compound word: Schleifmaschinenverleih (word boundaries:
Schleif'maschinen'verleih)

You begin with "Schl" which might suggest "Schleifer", then you delete "er",
add "m", and if you're lucky, you get "Schleifmaschine".

"Schleifen" means to grind. "Maschine" means machine. So "Schleifmaschine" is
grinding machine, and the disappearance of "en" is a grammatical artifact.
"Verleih" means rental (as in a company that does rentals).

------
dandare
I don't understand case #4, can someone explain please?

------
mettamage
Natural Language Processor broken by Finnish language.

:D

Okay, I'll see myself out now.

