Hacker News new | past | comments | ask | show | jobs | submit login
Finnish breaks natural language processors (twitter.com)
185 points by imartin2k 23 days ago | hide | past | web | favorite | 128 comments

Finn here. There are countless memes about oddities of our language. The one I like the most is "kuusi palaa". It can mean;

   the spruce is on fire
   the spruce returns
   the number six is on fire
   the number six returns
   six of them are on fire
   six of them return
   your moon is on fire
   your moon returns
   six pieces
Good luck for all implementing NLP :)

I like the "dog example" too, which highlights the different forms/cases:


I'm a recent immigrant to Finland, and while the regular grammar makes some things simple, the cases and suffixes are enough to drive me crazy! The simple things are easy to handle if you're reading a textbook, but remembering them in real-time when speaking is very hard.

No wonder Tolkien used it as a basis for Elvish.

This is a specificially constructed corner case, not a general finnish thing:

- kuu = moon

- kuusi = spruce

- palaa = returns

- pala = piece

From those you can see how one can intrepent the compound word (with or without suffix) however you wish.


- kuu+si = your moon (where 'si' is the suffix which means 'your')

- pala+a = pieces (where 'a' suffix is the plural)


- kuusi = six

- palaa = burning

I wonder what oddities lie in Linus Torvalds' "perkeleen vittupää" (https://lkml.org/lkml/2013/7/13/132).

I'm afraid it very unambiguously just means fucking cunts.

In singular genetive and nominative cases no less, so, absolutely no grammatical fancy going on there.

Its literal translation is.

Satan's cunt-head

> "kuusi palaa"

Can we get a bit more detail here? Are those variations on pronunciation? How does a native Finn able to infer the correct meaning?

Finnish is pronounced extremely regularly. The way it's written is the way it's pronounced (in almost all cases).

No, there is no variation on pronunciation. They sound exactly the same when spoken.

The only way to infer what's meant is from context.

My favourite homophone example is a 92 character poem in Chinese[0], where every syllable is a slight variation on Shi.


Interestingly when I showed the actual text to my wife, who is Chinese, she didn't understand what was funny about it until I asked her to read it to me, when she burst out laughing. That's when I found out she doesn't sub-vocalise, where I do all the time.

These aren't really homophones though, right? Because they differ in tone?

There are only 4 different tones, so by the pigeonhole principle there must be lots of homophones.

On the other hand, most of the characters used to be pronounced differently when they were still commonly used as individual words, so the poem only works because it mixes modern pronunciation with classical grammar.

There was a joke that Estonian is more closer to Japanese than any other European language. As to not say its related to Japanese, but how much it differs from European ones. No future tense, no genders, but 14 cases.

This illustrates nicely where finno-ugric languages(Finnish, Hungarian, Estonian) reside compared to other European languages.


Benefit is that people are somewhat safe for scammers, as no translator does a good job. Downside is, none of the translators work and its way easier to just translate to english for example.

What is the benefit of genders anyway? Especially when gender neutral alternatives are available. It never made much sense to me, they introduce a lot of complexity for very little. Worse, nowadays you have all the problems of misaddressing and offending folks.

There was an interesting/amusing post here the other day about learning German and how it relates to programming concepts. They suggested that gendered words work as a kind of parity bit for error correction.

> There was an interesting/amusing post here the other day about learning German and how it relates to programming concepts. They suggested that gendered words work as a kind of parity bit for error correction.

Slightly off topic (not about grammatical genus, but about German's case system): I often compare German's case system to a type system in a programming language: the verb expects the sentence's object in the suitable case. If you use a wrong type, it gives a compile error. In most cases, the case (type) that you have to use makes sense, but there are rare cases where you simply have to learn the case. The latter is like having to use a (programming) library that was designed by some other programmer with an interface that you would design differently.

Just like you have to write a computer program that gives no compile error (e.g. because of typing), in German, you also have to formulate your sentences in a way that gives no compile error to its case (type) system.

I am trying to learn German and gender is infuriating there. Why? Because you can't simply scan an existing sentence and color-code the words based upon simple heuristics to guess the gender.

The first thing you learn is that three genders are indicated in the nominative case using der (masculine) / die (feminine) / das (neuter).

So you see "der" in a sentence and that thing must be masculine, right? Nope, because "der" is feminine in the dative case.

The gender system in German wouldn't be such garbage if the articles weren't overloaded. Imagine if in your type system "float" could mean two different types, and you needed more information to figure out which type it is (like, for example, memorizing the type of the object its referring to). It breaks the idea of locality.

> The first thing you learn is that three genders are indicated in the nominative case using der (masculine) / die (feminine) / das (neuter).

I personally would never teach it that way. To me the four cases are like different faces of, say, a cube (I know a cube has six faces - sorry), where the cube represents the noun. Just like looking at a cube from different perspectives (in the direction of some face) gives you a different view on it, each case gives a different "perspective" of a way that the word relates to some different clause in a sentence.

How the articles and endings "rotate" when the case changes ("as the cube rotates") is something that I developed a really smart scheme for (which can be found in no book on German grammar that I am aware of) that every "mathematical-minded person" immediatelly grasped, told me it perfectly made sense and wished that someone had taught it this way in school. On the other hand, nearly every "humanities-minded person" told me that they cannot understand what I am even talking about and what advantage this perspective is supposed to have.

Typical language teachers are "humanities-minded". :-(

Unluckily, I have far too less time to write this all down in sufficient detail that other people can learn from it; I have far too many other things to do. :-(

> Imagine if in your type system "float" could mean two different types, and you needed more information to figure out which type it is (like, for example, memorizing the type of the object its referring to).

You are describing declarations and definitions in C++ and the problems of parsing C++? ;-)

I don't think you're wrong, but my experience has some nuanced differences maybe.

I have known plenty of German speakers, including natives and often myself, who make case/gender mistakes with no real impact on comprehension.

There certainly can be some ambiguity, but context and word order can often mitigate that, no? I feel like if you used "die" for everything and just a neuter nominative case, I could still figure most things out. My experience is that many German speakers (especially when speaking a dialect) just mush articles and suffixes together and it's not entirely clear that they use specific genders and cases most of the time anyway. For example, my relatives near Frankfurt pretty much say "d- Milch", "d- Hund", "d- Mauer", etc. SOMETIMES, you can hear a bit of an "s" when it's supposed to be das, des. YMMV of course.

I don't really find German to be less forgiving of mistakes than English or the handful of other languages I've studied.

At least, this what I take away from your comment about compile errors. Maybe -Wcase or -Wgender is a closer analogy?

Anyway, I like your analogy overall. :)

> There was an interesting/amusing post here the other day about learning German and how it relates to programming concepts.

This one: https://news.ycombinator.com/item?id=19109371

> Worse, nowadays you have all the problems of misaddressing and offending folks.

Languages are tightly inherited, even more so than genes. How languages are structured varies highly. Some have gender-like tenses for the cardinal directions, where things are not feminine but 'northy'. Some have no genders, but do have time-like tenses, where things are 'future-y' or 'present-y'. Just look at sign-language, with it's use of height similar to an exclamation point in English at times. Human communication is weird.

Even in English, our sense of gender has changed over time. It used to be that we cared more about the number of items/people over their gender, but now we care about the gender. IE: Thee/Thou/You/ became Him/Her/You.


English is, herself, a pidgin language. The irregularity of the verbs and the reliance on context make it very accommodating to new terms and nouns, but VERY difficult to learn fluently. It is standardizing though, the Afro/Urban-American vernacular use of 'be' is a perfect example (I be going to the store, not, I am going to the store). This change of 'be' is very similar to the romance languages. English is very slowly being wrenched into standardization. It will be very interesting to see how Chinese and Computer Programming languages affect English.

The Story of English is a great look into this language and it's story:


To make things 100x complicated in our "modern" world :)

For example you can select a gender pronoun in English to match your gender. https://uwm.edu/lgbtrc/support/gender-pronouns/

In Hungarian we just say "Ő" to refer to someone, privacy is baked into the language :)

A pragmatic advantage is that you use different pronouns to refer to multiple different persons (or things, with grammatical gender) in the same sentence without ambiguity. In programming terms, you have a "local variable" slot for each pronoun, instead of just a single one with gender neutrality.

Like having separate registers for even and for odd numbers. Now try building a good compiler for that... (no bitshift trickery allowed)

I don't think that analogy works. You're not forced to use the gendered forms - you can just say "the person", "them", etc - and (at least as far as the language is concerned), it's only about communication, not about reasoning. So your "calculations" are unaffected.

I don't know what you mean exactly, but genders can be something intrinsic to a language and I don't think they have been introduced as such. If you wanted to call a table in Italian, you could choose the masculine or feminine noun for it (tavolo - tavola), no neuter form though. Ask the table first.

Why would it be intrinsic? Somewhere in their evolution they must have been introduced for some reason I would guess. Also, they could be removed the same way if there is no benefit distinguishing between a masculine and feminine table.

Could it mean that gender was something so obvious and useful as a concept that early people, while trying to figure out the world around them and how it all fits together, just naturally extended the idea to just about everything?

It doesn't really make sense to anthropomorphise things once you're an adult, but I bet everyone can remember trying to figure out the world as a kid and just assuming things are of similar nature as people.

I think that even languages which currently don't have gender forms of words, like English, still have the concept psychologically firmly embedded in the culture. The generic "cat" is more female than male. The "tom-cat" is only male. A "rock", I'll bet, is also subconsciously male, and a "river" more female than male. It's all there, even if not explicit in language. OTOH, it's not really universal, different languages assign gender pretty much randomly - but they do assign it.

Social determinism and radical neutrality go against the grain of this "making sense of the world" phase of development, and probably don't result in a practical world-view.

I believe the intrinsicality of it has to do with alliterations/rhymes/rhythm (more in the vowel than in the consonantal sounds)

Similar to effects of (past) umlauts in English or the a/an difference.

PIE (Proto-Indo-European) was most likely a gendered language. So the concept goes deep into prehistory. English lost its gender system fairly recently, during the Middle English period.

  "Tommy, Frank, and Susan burned their homework and fed it to the dog."
  "Who fed it to the dog?"
  "That person."
Overhearing this conversation, you don't know who they were pointing at. If they said "She did.", you know who they were pointing at. There are many other situations where gender provides extra data, which is convenient.

Our languages keep having gender, so it's clearly been advantageous from an evolutionary perspective. The problem is when we let gender become more important than the data it's trying to express, like when a society lets gender dictate your social class. That shit needs to stop.

But that's human gender, not noun gender. In grammar, gender is a class of noun and generally unrelated to human gender.

In German, words like "library" (Bibliothek) and "street" (Strasse/Straße) are feminine, for example. It's true that some feminine words, such as "woman" (Frau) and "mother" (Mutter), are feminine. But then you have words like "girl" (Mädchen), which are not.

Your example ("she did") could actually have ambiguity in languages that extends the noun gender to pronouns, such as French. "She" doesn't need to refer to a woman, just a feminine noun. For example, "car" (voiture) is feminine, and you can refer to "it" as "elle" (she).

Noun classes are advantageous. Gender happens to be a useful way to categorize nouns because mostly we like to talk about other people. Gender (and animacy -- "it" versus "s/he") split things up efficiently for many conversations, where, say, paper versus non-paper wouldn't.

>No future tense

Many European languages also lack a future tense. English, for example. (https://www.quora.com/Is-it-true-that-English-has-no-future-...)

> English has no inflected future tense like Latin and Romance languages. In linguistics, “tense” refers generally to how a verb is inflected

English has no inflected future tense because of the way words are usually defined. Verbs in English only have 4 types of inflections because words are defined as whatever is surrounded by spaces in writing. If we define words as the unit that carries a single beat in speaking, so "He is-to-see-it tomorrow" is 3 words, then verbs also have prefixed inflections. English then does have an inflected future tense.

The "is to" future form isn't mentioned in discussions on future tense very often but seems to have the most predetermination, whereas "is going to" and "will" indicate more volition by the doer.

No, it isn't just a matter of orthography. English has no future tense on any reasonable analysis. There is no unit of English syntax which (i) behaves like a verb and (ii) has a distinct future form. This is a completely uncontroversial point. See e.g. http://languagelog.ldc.upenn.edu/nll/?p=897 for further discussion.

> If we define words as the unit that carries a single beat in speaking

Note that this definition clearly won't do what you want it to do (even ignoring the imprecision of "single beat"), as the units you want to treat as verbs can be separated to an arbitrary degree. E.g. "Is he really, in your opinion, to see her tomorrow?"

There is a standard analysis of English in generative grammar where verbs and tense are are seperate nodes at deep structure [0], and we get inflected verbs through some form of V-T movement [0]. Under this analysis, English clearly has a future tense. It just gets pronounced as its own word (will) instead instead of being involved in V-T movement. A more popular analysis is to replace "T" with "I", and do everything I talked about above. (The question here is really if other features share a d structure node with tense or if they all have their own functional nodes. I haven't seen a clear argument either way)

Edit: not clearly. There are still some additional assumptions to make with no clear evidence either way (notably, the category of "will"), and the definition of "tense" is a amnigous.

[0] English's rules for V-T movemement are relativly complicated compared to most languages

>There is a standard analysis of English in generative grammar where verbs and tense are are seperate nodes at deep structure

Yes, there is.

>Under this analysis, English clearly has a future tense.

No, this does not follow. You can just as easily (and more correctly) analyze "will" as a modal auxiliary like "can" or "would". Supposing it's true that "will" is the daughter of I/T/whatever, it doesn't follow that the I/T/whatever head has to have a future tense feature.

You could argue, and some people have, that at some level, the interpretation of "will" involves future tense in some semantic sense, which conceivably might be realized as a feature on a head somewhere in the inflectional spine of the tree. But that tense feature would not have the same kind of straightforward syntactic and morphological motivation as e.g. the +past feature does in English. In the context of a general audience, this isn't a notion of "tense" that's relevant. After all, in this highly abstract sense, even a language such as Chinese might have tense.

FWIW I have a PhD in (generative) syntax. I'm don't think that any generative syntactician addressing a general audience would describe English as having a future tense.

I just have an undergrad degree in linguistics.

I am looking at this question from the perspective of x-bar theory, and defining "tense" as as a feature of the syntactic catagory that populates the populates the "T" node at deep structure (or a feature of the I catagory). [0].

As I alluded to in my comment, I have never seen a convincing arguement in either direction for wheather "will" is a T or a V; and this was an active discussion in my syntax class about 3 years ago where the proffesor (a syntaxtician) had also said that she never saw a compelling arguement either way. If you know of one, please sharer.

Edit: I think you edited your post after I responded. I would argue that all languages have the same featureset (including tense) at d structure, but that is going far afield.

Edit2: there is also a notion of tense at the sementic level. And English seems to offer a gramaticalized way of providing the sementic future tense, regardless of how you look at the corresponding syntax/morphology.

If we accept that "will" is a syntactic "T" and carries a sementic meaning of "future" then how is it unreasonable to call it a future tense? Sure, there are debatable parts to the above assumptions, but it is a stretch to say that no analysis has future tense. Heck, there is a universal grammar analysis that would say English (and Chinese) has all the tenses, and the only question is which ones surface as distinct pronounciations.

[0] This leads to a notationally weird case where "to" is a tense with a -tense feature.

Putting "will" in T doesn't mean that T necessarily has to bear a future tense feature. It could e.g. have a present tense feature.

Are you getting at the distinction between "will" being the T and "will" being the daughter of T? Regardless, under either analysis "will" would have just as much claim to being a tense as "run" has to being a verb. You can still argue over wheather there is a syntactic feature of future tense (which is a different question) which I have never seen a solid arguement for either way.

>Regardless, under either analysis "will" would have just as much claim to being a tense as "run" has to being a verb.

If you're punning on "tense" to refer to a lexical category, then sure. But we're talking about tense in the sense of past, present and future. In many textbook analyses, lots of things go in T. For example, "can" and "might" go in T. It doesn't follow that, say, "I might go to the store" and "I can go to the store" have a different tense.

I would say that "I can/might go to the store" go to the store having different tenses is what we mean when we say "can" and "might" are lexixally T.

Even restricting ourselves to the syntactic features, I have still never seen an arguement for or against English having a syntactic future tense feature that did not come down to UG, or a disagreement about which model is simpler, and neither side is convoluted enough for me to confidently invoke Occam's razor.

The two sentences have different words occupying the "T" head, but they don't have different tenses. What you're saying is just a pun on "tense".

> There is no unit of English syntax which [...]

Here you're defining syntax around your orthographic definition of what a word is.

> "Is he really, in your opinion, to see her tomorrow?"

The opinion words "really" and "in your opinion" added into that syntactic structure parallel the manner that opinion word "bloody" is added into a lexical structure to become "abso-bloody-lutely".

>Here you're defining syntax around your orthographic definition of what a word is.

No, I'm not. As I said, there is no unit in English (regardless of whether or not it is written as a single word) that both behaves like a verb and has a distinct future form.

>The opinion words "really" and "in your opinion" added into that syntactic structure parallel the manner that opinion word "bloody" is added into a lexical structure to become "abso-bloody-lutely".

You seem to have missed that fact that even if you remove "really" and "in your opinion", the unit you identified as a verb is still broken up in the example I gave. In "Is he to see her tomorrow?", "is-to-see" is not a contiguous unit. Although the idea of treating the subject pronoun as a verbal infix might be charming (English as Mohawk!), the subject could be a much larger phrase, as in e.g. "Is the man in the red hat to see her tomorrow?"

It is simply not true that adjunction of adverbials has the same properties as expletive infixation. Even if there is some loose analogy between these two processes, I fail to see how it's relevant to units defined in terms of a single "beat in speaking".

> "is-to-see" is not a contiguous unit

You can insert stuff between "is" and "to-see" (and between "to" and "see", I hope you agree) because English mostly parses into left-branching trees, which is the only reason why you can't insert stuff between "see" and "-ing". Natural languages don't liberally mix left-branching structures and right-branching ones because they're hard for people to understand. E.g. Pinker's "The rapidity that the motion that the wing that the hummingbird has has has is remarkable" is grammatically correct but not easily understandable. The contiguity or lack of it operates on a different conceptual plane to whether a "word" can be inflected or not.

> loose analogy

To take some example language from your reply, some people speak this as two words, others (e.g. some Americans) as a compound word. Because the concept of a word is not well defined, the boundary between morphology and syntax is blurred. Putting "is" and "to" noncontiguously before a verb is the same as putting "-ed" after a verb. They are both "inflections".

Look, if you have a totally radical reanalysis of English syntax according to which "is...to" can function as some kind of discontinuous inflectional affix, I wish you luck developing that.

It's clearly thousands of miles away from being a generally accepted analysis of English syntax. So unless you've written up your analysis in detail somewhere, there's not much further to discuss.

> There is no unit of English syntax which (i) behaves like a verb and (ii) has a distinct future form.

This seems like an arbitrary and peculiar way to define tense specifically to exclude periphrastic languages like English. If these are the rules by which we are arguing, the argument doesn't seem that interesting. By this argument colloquial French and German have no past tense because they use periphrastic constructions to expression this distinction.

It is indeed the case that (spoken) French and German have no past tense, strictly speaking. (I don't speak German, so I don't know to exactly what extent the German imperfect is unused in everyday speech. I think it may not be as unused as the passé simple in French.)

Tense is a piece of technical terminology with a particular definition. There's nothing very peculiar about the definition that I can see. The definition denies English a future tense because English doesn't have periphrastic units which (i) behave like verbs and (ii) have a distinct future form.

For whatever reason, people are really resistant to taking linguists' word for it on what technical terms within linguistics mean. You don't find people on here trying to argue with physicists about whether they should maybe adopt a different definition of "force" or "mass" or "electron".

It's not a question of taking linguists' word for it. It's a question of taking particular linguists' word for it. The discussion in that language log is interesting. You will notice that it isn't simply people calling "amen".

Certain things are settled: "will" in English behaves like other English auxiliary verb (setting aside those cases where it doesn't, as in contractions). Saying "English has no future tense", though, is settled only within the confines of a particular linguistic model with a particular technical definition of the term "tense", of which there are many.

>is settled only within the confines of a particular linguistic model with a particular technical definition of the term "tense", of which there are many.

Could you then provide a reference to a linguist arguing that English has a future tense (in the usual syntactic/morphological sense of the term)?

I'm baffled by the reference to the "confines of a particular linguistic model". Linguists who can agree on hardly anything else (and linguists are pretty good at disagreeing!) can agree that English lacks a future tense.

That's a great graphic, but where are Arabic and Georgian?

These are common problems for languages with complex morphology. Finnish, Japanese, Turkish and Hungarian are closely related in this sense and share a lot of NLP research.

While implementing spell checkers for these languages need a bit more effort than just compiling a list of words, it's far from an unsolved problem. AFAIK Zemberek[1] is a Turkish spellchecker that implements two-level finite state morphology.

Also, a small terminology fix: Verbs are "conjugated" but names/objects are "declinated".

[1]: https://github.com/ahmetaa/zemberek-nlp

But a spell checker is just a tiny part of the NLP domain.

When people talk about NLP, they are usually referring to understanding the meaning of the entire text being analyzed, rather than just providing meanings for individual words. A spell checker can't tell you the intent or subtleties of what is being conveyed.

> But a spell checker is just a tiny part of the NLP domain.

Sure, but the linked tweet claims that not even a working spellchecker exists for Finnish. I'm sure Joakim Nivre would think otherwise !

What's the problem with Japanese? It's highly regular language, so it should be easy to tokenize.

AFAIR the whole language has maybe few irregular verbs, compared to few hundreds in English.

You can do it for dictionary words with a relatively high degree of accuracy, but it's very hard. I have developed the most accurate tokenizer that currently exists [1], but there are still sentences where it fails. Words written in hiragana create an inherent ambiguity because almost any combination of hiragana can be split into existing words in several ways.

Another problem is proper names. This one is inherently unsolvable, because anything can be a word. You can have a dictionary of proper names, but there will still be people, companies, fictional characters who aren't in the dictionary.

[1] https://github.com/tshatrov/ichiran

How do you test the accuracy of your tokenizer? Is there a database of ground-truth tokenized sentences that you evaluate on? It would be nice to see some kind of benchmark results comparing with MeCab.

It's not a matter of benchmarks really. It's more of a matter of MeCab simply not knowing that certain grammar exists. For example 〜ちゃった for 〜てしまった is a pretty common form. MeCab doesn't understand the word 忘れちゃった because it only knows 忘れて and しまった as completely separate words and can't combine them in any way, but Ichiran can because my word-form generator is much more sophisticated. This is also why I cannot compare the results of segmentation directly: I combine words and their suffixes into a single "compound word" which has a tree-like recursive structure [1]. MeCab and other tokenizers can only detect basic forms and leave suffixes as a completely separate "word". Which makes them automatically inferior.

[1] https://ichi.moe/cl/qr/?q=%E5%BF%98%E3%82%8C%E3%81%A1%E3%82%...

If I do echo "忘れちゃった" | mecab, I get

  忘れちゃ 動詞,*,母音動詞,タ系連用チャ形,忘れる,わすれちゃ,代表表記:忘れる/わすれる 付属動詞候補(基本) 反義:動詞:覚える/おぼえる
  った 接尾辞,動詞性接尾辞,子音動詞ワ行,タ形,う,った,代表表記:う/う
So MeCab most definitely knows about that grammar. It's special-cased even! Admittedly, the way った is split off as a separate suffix is a bit ugly.

That's the thing though, 忘れちゃ is a completely different, unrelated form that means 忘れて+は. So yes, this split is in fact incorrect and if it wasn't past tense, but dictionary form 忘れちゃう, the result would be even worse with う being split off as "rain"? or something.

Japanese is usually written without spaces. Words and sentences just run into each other. When writing in hiragana (syllabic characters), word boundaries are often ambiguous.


I have no stake in natural language processing, but it looks to me like a computer might be able to do a pretty good job at splitting that given a dictionary.

Sure, you can get pretty far with a fairly simple solution. But lot of the time, you get two (or more) ways to split the string into dictionary words. For a simple English example, is it "justice was served" or "just ice was served"?

I guess that’s where context will have to be considered. Those two are valid sentences, so presumably humans are using context to distinguish between them, right?

The murderer came to my dinner party, and I had it all planned. In one of the ice cubes, I had frozen arsenic. The murderer would eat the same food, drink the same drink, and nobody would guess that they would die on leaving. When the evening was over, I knew what I would tell people.


Please, share this with the world on tweeter.

If you would like to, feel free. For myself, I think that the comment's context of showing how ambiguity may not be resolved merely be contextual information is important, and that it would not stand as strongly without it.

The stochastic strategy is to 1. enumerate every possible tag combination 2. assign a probability to each one 3. choose the parse with highest probability.

1. can be done either deterministically or stochastically.

2. requires you to have a language model trained with either human-tagged or semi-human-tagged corpus

3. was just the Viterbi algorithm last time I looked.

Implementing 1 and 2 are require broad domain knowledge in two very different domains (linguistics and machine learning respectively)

So while nowadays sentence segmentation can be considered a solved problem, it's far from trivial to implement one that can compete with the state of the art against real-world data.

There is also a nice body of deterministic (rule-based) literature that is practically ignored nowadays.

But Japanese is not written as character soup. It mixes two (actually 3) types of characters, with the "grammatical" sounds being written in hiragana and most content sounds being written in kanji. Since the grammatical sounds are a closed class, and tend to occur at word boundries, it turns out to be relativly simple to seperate words.

Isn't it a case when parsing other languages from speech? Are there any audible cues between words when we speak?

Turkish is also highly regular and has only one irregular verb. The main problem arises from morphological complexity. Because of these well defined rules that change the rich set of suffixes (>100 different), analysis of a lot of words ends up with many possible parses, it is hard to resolve this ambiguity.

I can't speak Japanese, but if it is also morphologically rich, it should face similar problems.

In Japanese I think it's pretty straightforward to trace the original word. But there is a fair share of homophones, so if your input is speech-to-text you already have many possibilities without much grammar.

It's interesting that Nokia's auto-completion worked well for Finnish, but modern solutions don't work.

> Joose Rajamäki 🇫🇮🇪🇺 @joose_rajamaeki Feb 20 > Yes, autocompletion [on phones] exists. But I hardly ever manage to compose a message where it wouldn't encounter new words. Also, it doesn't know the inflections, so it needs to encounter each word in each possible form before suggesting them.

> Joose Rajamäki 🇫🇮🇪🇺 @joose_rajamaeki Feb 20 > Old Nokia phones were good. They let you indicate that the word root was finished and you wanted to add agglutination and/or inflections.

On their Windows phones Nokia's Finnish autocomplete was actually even a bit better -- it suggested suffixes and was pretty accurate. I wonder how they trained it.

Same goes for Polish!

So, does that mean that if OpenAI's text generator turns out to be as "dangerous" as it's being marketed, we can all just switch to Finnish as lingua franca and Skynet won't have a chance? :)

That implies we can all "just" switch. I have a feeling Skynet will have a better chance than I do!

I've been trying to learn and "switch" to Finnish for 5 years. I can confirm this.

Polish has some of those problems too, at least the ones with conjugations. The words mom (appearing in your contact list), mom (as in call mom) and mom (as in send money to mom) are different. That applies to proper names too, and it isn't regular. It's not easy to figure oujt who the user refers to if you just have the contacts list and the command.

Polish had problems way before nlp, for example "Annie has sent you a message" and "John has sent you a message" would be translated differently (because male/female), same for "2 minutes ago" and "5 minutes ago". Also programmatically formulating natural messages like "Dear John" is nigon impossible.

All Slavic languages are quite quirky in this regard. Most native speakers don't realize it, but when you try to write correct message parametrized with numbers in Polish, you need 3 or 4 special cases:

     - 1
     - 2-4, 22-24, 32-34, ...
     - 5-21, 25-31, 35-41, ...

If you say "out of X" you also need to handle numbers 100-199, 100 000-199 000, 100 000 000-199 000 000, ... separately (because 100 = sto, and if number begins with "s" - "out of X" is "ze" instead if "z") - this is in combination with previous 3 cases, so "199 ouf of 199" is different from "99 out of 99", and different from "122 out of 122".

So "Copied X file(s) out of Y" would be:

     - "Skopiowano 1 plik"
     - "Skopiowano 2 pliki z 3"
     - "Skopiowano 5 plików z 5"
     - "Skopiowano 1 plik ze 100"
     - "Skopiowano 2 pliki ze 100"
     - "Skopiowano 5 plików ze 100".
Almost no software handles this correctly :) But spell checkers work OK (they just ignore relationships between words usually).

> Almost no software handles this correctly

Gettext solved it 20 years ago. The problem is that people try to reinvent the wheel instead of looking at existing solutions. So, they make their own inferior versions.

The rule for plural has 3 cases.

1. When n mod 10 = 1

2. When n mod 10 is 2,3,4

3. Everything else

And there's a special rule that the first 2 cases are not used when n mod 100 is between 11 and 29. For such numbers, the 3rd case is used instead.

This is actually not that complex. Compare to Arabic, where you have 9 cases IIRC.

Doesn't solve the problem with "X out of Y".

It does. You have two numbers, so you have to split it into X and Y part. When I look at your example, 1 plik is always 1 plik regardless of the Y quantity and 100 always go with "ze" regardless of X quantity. It's a little bit tricky to set up, but it works. "Copied X of Y files" needs to split into two translation units: "Copied X" and "of Y files".

That's "not supported" in my book. How should English or Chinese developer know to split this? What about translating to other languages with different splits?

There's no system to handle this, so each special case must be embedded in the software itself, instead of properly doing it in internationalization package.

So - in effect - nobody even tries.

Almost no software handles this correctly because almost no software gives us the tools to handle it correctly. I've translated some software to polish once or twice and they've been using a simple templating system, like message from {name}: {content}, {time} minutes ago. I didn't have access to the source code. As a consequence of that design, some messages were very malformed. Also there were some overused strings, like "liked" for "You've just liked this", "liked" for "things you've liked" and "liked" as in "someone liked your post". Also untranslatable.

Qt framework has system in place for several plural forms, so plik/pliki/plików problem is solved, and quite elegantly. If the application programmer gives a damn, that is.

There's also a way to add context to the strings for translation, so that you can see this "like" is used in this meaning, and that other "like" is used in another meaning.

But there's no generic solution to "z/ze" problem as far as I know, you need to do it by hand if you want correct messages. And most of the time programmers don't even realize they need to parametrize such stuff, so the systems don't do any good :)

I imagine that before we have a technical solution that works well enough to be used, natural language (guided by the technological limits) will have dropped such rule in ordinary use.

Unlikely. The rule is dictated by ease of pronunciation. It's hard to say "z stu", much easier to say "ze stu".

The rule isn't hard to implement, it's just not expected to be a thing by framework/library creators.

I started learning Finnish and it looks much worse. Twice as much cases, consonant gradation and puhekieli. Everything else seems on similar difficulty level as in Polish (maybe except genders, but it's pretty regular).

But I may not have a proper perspective as Polish is my native language and I don't work with NLP (although I created a small tool to determine a word based on its inflected form with data from Wiktionary).

Puhekiäli voi olla vähä vaikiaa, mutta sitten on viälä eriksensä murtehet.

Croatian is the same. Noun declination makes some sense, as it sometimes allows you to omit prepositions (although I sure am glad English doesn't do that). But gendered verbs are just stupid.

"gendered" verbs are a thing in Italian too: "I fell" becomes "sono caduto" or "sono caduta" depending on the gender of the speaker. Transitive verbs reflect the gender of the objects when using clitics, e.g. "I ate the apple, I ate it really" -> "Ho mangiato la mela, l'ho mangiata davvero".

What's odd in Croatian is that the all the verbs behave as intransitive Italian verbs, reflecting the gender of the speaker: "I fell" -> "pao sam" or "pala sam" but also "pojeo sam jabuku" vs "pojela sam jabuku" if the speaker is a woman.

I'm not a linguist, but as a native speaker of both, I always thought it had something to do with the fact that in Croatian the verb "to be" is the only auxiliary verb used to built past tenses while in Italian transitive verbs use the verb "to have".

The past participles in many languages behave very similarly to adjectives and thus I thought it might not be surprising that you'd decline them according to number and gender as you'd do with any adjective.

So if when you fall you "are fallen", you decline the "fallen" participle/adjective according to the subject, namely you in this case.

Thus instead of "have eaten" Croatian says "am eaten" (not meaning the passive voice but just using the verb to be as axillary instead of have) as a consequence you notice the "gendered verbs" more often than in other languages.

As to why Italian declines participles when clitics are involved, I assume it adds some redundancy so you can more easily guess what the clitic refers to.

> The words mom (appearing in your contact list), mom (as in call mom) and mom (as in send money to mom) are different.

Each time I tried WP8.1 voice assistant to "zadzwoń do mama" it felt me extremely awkward due to nominative case of the word "mama". It should be "zadzwoń do mamy" which is genitive case, if I'm not mistaken, that would make whole phrase sound natural.

Interesting, Polish and Czech are so alike and so different :D I guess the word Polish "mama" is pretty much the Czech "mamka" which similarly declines to "mamky" in the genitive case, but "do mamky" here is "into mum" which is ... rather odd (I think Czech would use the Dative "mamce" - so “zavolej mamce” or “zavolejte mamce” depending on your relationship with your phone!)

I get tripped up when, for example, talking about getting presents from people and never seem to use the correct preposition - "od" or "z". I seem to use the one that would implied "came out of" not "came from"

It's the same West Slavic language group along with Slovak, thus those are similar to the certain degree but sometimes differ way too much like for example, infamous "szukać" (look for, seek; hledat) and "šukat" (to have sexual intercourse, to put it mildly) or "czerstwy" (stale) and "čerstvy" (fresh). I'm partially able to understand my Czech friend when he does speak in his language and same goes for him.

As for "mama", there's a quite good explanation why it does looks similar (if not same) in many languages: https://www.mother.ly/parenting/mama-is-most-universal-word

Morphology of Finnish language has of course been studied, and spell checkers for it exist.

E.g. open-source: https://voikko.puimula.org/ (and its online version https://oikofix.com/contact?lang=en, where you can also play with the analysis/parsing capabilities)

Of course, the more complicated morphology makes it less straightforward than English where simple word list based approaches are sufficient.

Turkish being a member of ural-altaic language family, just like Finnish, suffers from similar problems. Google Translate improved much over the years yet it still generates laughable text at best. Stemmers used to create show-stopper word stems (don't really know the current situation).

Although, it imported a lot of technical terms from various european languages, making technical texts seem to be more legible due to lack of composited words in these contexts.

similarly from tweets: yiyecek = food yiyecek miydi = will he/she eat that (2nd word is not a separate word, just a conjunction that is written separately)

göz = eye gözcü = scout gözlük = glasses gözlükçü = glass salesman gözcülük = scouting gözlükçüler = glass salesmen gözlükçüydüler = they were glass salesmen gözlükçü müydüler = were they glass salesmen?

also an all time classic: çekoslavakyalılaştıramadıklarımızdan mısınız = are you one of those people whom we tried unsuccessfully to assimilate into a Czechoslovakian citizen?

Ural-Altaic language family is today considered an obsolete concept [1] and the families are considered unrelated

[1] https://en.wikipedia.org/wiki/Ural%E2%80%93Altaic_languages

The models / software created for one language mostly fit others. Eg. Two-level finite state morphology was invented for Finnish and was very successfully adopted to parse Turkish words.

So the opinions of Linguists aside, the languages that make up the ural-altaic family are not that far apart from each other.

Yes,but both languages share some important structural similarities. Rich morphology, extreme agglutination and vowel harmony.

My suspicion is that most NLP techniques were invented by native speakers of "easy" languages

Though romance languages/germanic languages (and even English ancestors) have their quirks they're not at Finno-Ugric levels

(Not that English has no quirks - especially in pronunciation - but it is "easy" to deal with most of the weird exceptions)

> tietokone = knowledge machine (literally) = computer

This is somewhat beside the point, but although in everyday speech "tieto" usually means knowledge in the sense of knowing a fact or skill, it can also be taken as "information", which seems more likely to have been the original intent of the term.

Nevertheless, pretty much every elementary computing course in Finnish begins by stating that "the computer doesn't actually know anything". :)

It's hard to spellcheck too, but there's hunspell which is pretty good and is used by a lot of open source products: https://en.m.wikipedia.org/wiki/Hunspell

Seems like they are splitting their tokens wrong. Spaces are just a suggestion.

But seriously, the problems outlined reminded me very much of Japanese and korean, just turned up to 11.

Japanese has endless amounts of homonyms and at least in theory words can go on and on and on with added conjugations. They have a lot of compound nouns, often build from abbreviations of the words compounded.

The author mentions that these compounds are a problem in Finnish NLP because the explode the size of the vocabulary.

Written Japanese does not contain any word boundaries at all. They split on morphemes for NLP tasks, which helps against exploding vocabularies, but also disseminate the parts into their own meaning-unit.

For written Japanese you have characters which can guide you in meaning for a lot of compound words, but they blow up your character space. That is not the case for Finnish, but I can't really decide if that is an advantage or disadvantage.

Also, a spellchecker and a chatbot would rely on completely different techniques, so you can have one without the other. Japanese doesn't even have spellcheckers, but they have a ton of other tools that millions of people rely on everyday for writing faster and better text, like kana-kanji conversion and word suggestions.

This is interesting. Especially the bit about compounding taking the meaning from figurative to literal. Modern NLP embeddings still assume a canonical meaning for each word and therefore cannot (immediately, anyway) distinguish between a figurative or literal meaning. Not that such efforts don't exist, but they all seem to be missing something fundamental - namely, is the "canonical" meaning, the vector, supposed to be literal, or figurative? This is a question with potentially no right answer, yet we all assume that the answer is "literal".

At the same time, I do wonder if part of the problem with NLP algorithms and Finnish isn't so much its complexity but the fact that there id very little Finnish data to train on.

Why can't the NLP split the compounds? I think this problem of splitting happens in multiple languages -- the submission mentions German, and then you have Chinese which doesn't have spaces, so you have to split words. (I understand that splitting Japanese into words is simpler.)

Another complication not mentioned in the tweets is consonant gradation, in which the stem changes in dozens of ways when suffixed (examples: http://users.jyu.fi/~pamakine/kieli/suomi/vaihtelu/astevaiht...), which makes Finnish a lot less cleanly agglutinative than most other languages. It's not like it can't be done, but it's yet another step (find the likely split points, then try to undo the consonant gradation) and I'd imagine casual foreign researchers aren't going to bother.

Really interesting! I was wondering a few months ago how autocompletion on phones works for agglutinating languages like Finnish, but I didn't have a Finn to ask =/ If you wouldn't mind enlightening me, does it even exist for Finnish? If so, how does it work?

I can tell you based on Russian that it works poorly compared to English, esp. for verbs. I'd expect to be even worse for Turkish (which I used to know on beginner level), or Finnish. When the word you are typing becomes obvious, it would offer 3 (3 is the default on my keyboard app) random forms that do not appear to take context into account, or if they do they're often wrong. You can either just finish typing, or choose one of those, erase the ending/suffixes and fix them.

Writing Finnish on a phone is terrible until it learns all of your most commonly used words and their most common forms, at which point out becomes merely bad. Whenever I write Finnish using my phone I basically have to work around the keyboard not doing what I want. I'll still swipe to form words, but I have to go back to "fix" them because the system usually can't even handle simple compounds.

In a similar vein, submitted on HN this morning:


Latin has similar issues. Thankfully we have Whittaker's Words and the Perseus Word Study tool to help. I'm surprised that Latin is (apparently) better off than Finnish in this respect.

I still haven't figured out how Germans use swipe keyboards on smartphones with the monster compound sentence-words. If they do.

On iOS, the keyboard just suggests compounds. This works for those you use often. Another option is to choose a prefix, then backspace (over the space), then add to the prefix.

Example compound word: Schleifmaschinenverleih (word boundaries: Schleif'maschinen'verleih)

You begin with "Schl" which might suggest "Schleifer", then you delete "er", add "m", and if you're lucky, you get "Schleifmaschine".

"Schleifen" means to grind. "Maschine" means machine. So "Schleifmaschine" is grinding machine, and the disappearance of "en" is a grammatical artifact. "Verleih" means rental (as in a company that does rentals).

I just write the words separately and remove the spaces afterwards. An "I'm about to write a compound word"-button would be an easy solution, but it shouldn't be hard for the keyboard to automatically join the words after you've finished your sentence, it's mostly rule based.

There's the same problem in Swedish. I usually swipe the words separately or, if they're short, I type them manually and then go back to swiping. The decision to type them usually comes when I've tried a few times to swipe it before realising it's just not in the dictionary.

It's actually really frustrating. I absolutely love swiping, but it doesn't feel like it works very well with Swedish.

It just means phones ship with larger dictionaries.

And I would guess this isn't even the proper way to handle German. The long words are just sequences of words not split by spaces when writing. In a parallel universe the German writing system would use spaces and the language wouldn't need to change a bit.

Words are features of writing systems, not of languages.

I often just write separate words. Not correct, strictly speaking, but understandable.

I still prefer regular keyboards so (desktop and laptop) -- not sure if that's related.

I don't understand case #4, can someone explain please?

Natural Language Processor broken by Finnish language.


Okay, I'll see myself out now.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact