the spruce is on fire
the spruce returns
the number six is on fire
the number six returns
six of them are on fire
six of them return
your moon is on fire
your moon returns
I'm a recent immigrant to Finland, and while the regular grammar makes some things simple, the cases and suffixes are enough to drive me crazy! The simple things are easy to handle if you're reading a textbook, but remembering them in real-time when speaking is very hard.
- kuu = moon
- kuusi = spruce
- palaa = returns
- pala = piece
From those you can see how one can intrepent the compound word (with or without suffix) however you wish.
- kuu+si = your moon (where 'si' is the suffix which means 'your')
- pala+a = pieces (where 'a' suffix is the plural)
- kuusi = six
Can we get a bit more detail here? Are those variations on pronunciation? How does a native Finn able to infer the correct meaning?
The only way to infer what's meant is from context.
Interestingly when I showed the actual text to my wife, who is Chinese, she didn't understand what was funny about it until I asked her to read it to me, when she burst out laughing. That's when I found out she doesn't sub-vocalise, where I do all the time.
On the other hand, most of the characters used to be pronounced differently when they were still commonly used as individual words, so the poem only works because it mixes modern pronunciation with classical grammar.
This illustrates nicely where finno-ugric languages(Finnish, Hungarian, Estonian) reside compared to other European languages.
Benefit is that people are somewhat safe for scammers, as no translator does a good job. Downside is, none of the translators work and its way easier to just translate to english for example.
Slightly off topic (not about grammatical genus, but about German's case system): I often compare German's case system to a type system in a programming language: the verb expects the sentence's object in the suitable case. If you use a wrong type, it gives a compile error. In most cases, the case (type) that you have to use makes sense, but there are rare cases where you simply have to learn the case. The latter is like having to use a (programming) library that was designed by some other programmer with an interface that you would design differently.
Just like you have to write a computer program that gives no compile error (e.g. because of typing), in German, you also have to formulate your sentences in a way that gives no compile error to its case (type) system.
The first thing you learn is that three genders are indicated in the nominative case using der (masculine) / die (feminine) / das (neuter).
So you see "der" in a sentence and that thing must be masculine, right? Nope, because "der" is feminine in the dative case.
The gender system in German wouldn't be such garbage if the articles weren't overloaded. Imagine if in your type system "float" could mean two different types, and you needed more information to figure out which type it is (like, for example, memorizing the type of the object its referring to). It breaks the idea of locality.
I personally would never teach it that way. To me the four cases are like different faces of, say, a cube (I know a cube has six faces - sorry), where the cube represents the noun. Just like looking at a cube from different perspectives (in the direction of some face) gives you a different view on it, each case gives a different "perspective" of a way that the word relates to some different clause in a sentence.
How the articles and endings "rotate" when the case changes ("as the cube rotates") is something that I developed a really smart scheme for (which can be found in no book on German grammar that I am aware of) that every "mathematical-minded person" immediatelly grasped, told me it perfectly made sense and wished that someone had taught it this way in school. On the other hand, nearly every "humanities-minded person" told me that they cannot understand what I am even talking about and what advantage this perspective is supposed to have.
Typical language teachers are "humanities-minded". :-(
Unluckily, I have far too less time to write this all down in sufficient detail that other people can learn from it; I have far too many other things to do. :-(
> Imagine if in your type system "float" could mean two different types, and you needed more information to figure out which type it is (like, for example, memorizing the type of the object its referring to).
You are describing declarations and definitions in C++ and the problems of parsing C++? ;-)
I have known plenty of German speakers, including natives and often myself, who make case/gender mistakes with no real impact on comprehension.
There certainly can be some ambiguity, but context and word order can often mitigate that, no? I feel like if you used "die" for everything and just a neuter nominative case, I could still figure most things out. My experience is that many German speakers (especially when speaking a dialect) just mush articles and suffixes together and it's not entirely clear that they use specific genders and cases most of the time anyway. For example, my relatives near Frankfurt pretty much say "d- Milch", "d- Hund", "d- Mauer", etc. SOMETIMES, you can hear a bit of an "s" when it's supposed to be das, des. YMMV of course.
I don't really find German to be less forgiving of mistakes than English or the handful of other languages I've studied.
At least, this what I take away from your comment about compile errors. Maybe -Wcase or -Wgender is a closer analogy?
Anyway, I like your analogy overall. :)
This one: https://news.ycombinator.com/item?id=19109371
Languages are tightly inherited, even more so than genes. How languages are structured varies highly. Some have gender-like tenses for the cardinal directions, where things are not feminine but 'northy'. Some have no genders, but do have time-like tenses, where things are 'future-y' or 'present-y'. Just look at sign-language, with it's use of height similar to an exclamation point in English at times. Human communication is weird.
Even in English, our sense of gender has changed over time. It used to be that we cared more about the number of items/people over their gender, but now we care about the gender. IE: Thee/Thou/You/ became Him/Her/You.
English is, herself, a pidgin language. The irregularity of the verbs and the reliance on context make it very accommodating to new terms and nouns, but VERY difficult to learn fluently. It is standardizing though, the Afro/Urban-American vernacular use of 'be' is a perfect example (I be going to the store, not, I am going to the store). This change of 'be' is very similar to the romance languages. English is very slowly being wrenched into standardization. It will be very interesting to see how Chinese and Computer Programming languages affect English.
The Story of English is a great look into this language and it's story:
For example you can select a gender pronoun in English to match your gender. https://uwm.edu/lgbtrc/support/gender-pronouns/
In Hungarian we just say "Ő" to refer to someone, privacy is baked into the language :)
It doesn't really make sense to anthropomorphise things once you're an adult, but I bet everyone can remember trying to figure out the world as a kid and just assuming things are of similar nature as people.
I think that even languages which currently don't have gender forms of words, like English, still have the concept psychologically firmly embedded in the culture. The generic "cat" is more female than male. The "tom-cat" is only male. A "rock", I'll bet, is also subconsciously male, and a "river" more female than male. It's all there, even if not explicit in language. OTOH, it's not really universal, different languages assign gender pretty much randomly - but they do assign it.
Social determinism and radical neutrality go against the grain of this "making sense of the world" phase of development, and probably don't result in a practical world-view.
Similar to effects of (past) umlauts in English or the a/an difference.
"Tommy, Frank, and Susan burned their homework and fed it to the dog."
"Who fed it to the dog?"
Our languages keep having gender, so it's clearly been advantageous from an evolutionary perspective. The problem is when we let gender become more important than the data it's trying to express, like when a society lets gender dictate your social class. That shit needs to stop.
In German, words like "library" (Bibliothek) and "street" (Strasse/Straße) are feminine, for example. It's true that some feminine words, such as "woman" (Frau) and "mother" (Mutter), are feminine. But then you have words like "girl" (Mädchen), which are not.
Your example ("she did") could actually have ambiguity in languages that extends the noun gender to pronouns, such as French. "She" doesn't need to refer to a woman, just a feminine noun. For example, "car" (voiture) is feminine, and you can refer to "it" as "elle" (she).
Many European languages also lack a future tense. English, for example. (https://www.quora.com/Is-it-true-that-English-has-no-future-...)
English has no inflected future tense because of the way words are usually defined. Verbs in English only have 4 types of inflections because words are defined as whatever is surrounded by spaces in writing. If we define words as the unit that carries a single beat in speaking, so "He is-to-see-it tomorrow" is 3 words, then verbs also have prefixed inflections. English then does have an inflected future tense.
The "is to" future form isn't mentioned in discussions on future tense very often but seems to have the most predetermination, whereas "is going to" and "will" indicate more volition by the doer.
> If we define words as the unit that carries a single beat in speaking
Note that this definition clearly won't do what you want it to do (even ignoring the imprecision of "single beat"), as the units you want to treat as verbs can be separated to an arbitrary degree. E.g. "Is he really, in your opinion, to see her tomorrow?"
Edit: not clearly. There are still some additional assumptions to make with no clear evidence either way (notably, the category of "will"), and the definition of "tense" is a amnigous.
 English's rules for V-T movemement are relativly complicated compared to most languages
Yes, there is.
>Under this analysis, English clearly has a future tense.
No, this does not follow. You can just as easily (and more correctly) analyze "will" as a modal auxiliary like "can" or "would". Supposing it's true that "will" is the daughter of I/T/whatever, it doesn't follow that the I/T/whatever head has to have a future tense feature.
You could argue, and some people have, that at some level, the interpretation of "will" involves future tense in some semantic sense, which conceivably might be realized as a feature on a head somewhere in the inflectional spine of the tree. But that tense feature would not have the same kind of straightforward syntactic and morphological motivation as e.g. the +past feature does in English. In the context of a general audience, this isn't a notion of "tense" that's relevant. After all, in this highly abstract sense, even a language such as Chinese might have tense.
FWIW I have a PhD in (generative) syntax. I'm don't think that any generative syntactician addressing a general audience would describe English as having a future tense.
I am looking at this question from the perspective of x-bar theory, and defining "tense" as as a feature of the syntactic catagory that populates the populates the "T" node at deep structure (or a feature of the I catagory). .
As I alluded to in my comment, I have never seen a convincing arguement in either direction for wheather "will" is a T or a V; and this was an active discussion in my syntax class about 3 years ago where the proffesor (a syntaxtician) had also said that she never saw a compelling arguement either way. If you know of one, please sharer.
Edit: I think you edited your post after I responded. I would argue that all languages have the same featureset (including tense) at d structure, but that is going far afield.
Edit2: there is also a notion of tense at the sementic level. And English seems to offer a gramaticalized way of providing the sementic future tense, regardless of how you look at the corresponding syntax/morphology.
If we accept that "will" is a syntactic "T" and carries a sementic meaning of "future" then how is it unreasonable to call it a future tense? Sure, there are debatable parts to the above assumptions, but it is a stretch to say that no analysis has future tense. Heck, there is a universal grammar analysis that would say English (and Chinese) has all the tenses, and the only question is which ones surface as distinct pronounciations.
 This leads to a notationally weird case where "to" is a tense with a -tense feature.
If you're punning on "tense" to refer to a lexical category, then sure. But we're talking about tense in the sense of past, present and future. In many textbook analyses, lots of things go in T. For example, "can" and "might" go in T. It doesn't follow that, say, "I might go to the store" and "I can go to the store" have a different tense.
Even restricting ourselves to the syntactic features, I have still never seen an arguement for or against English having a syntactic future tense feature that did not come down to UG, or a disagreement about which model is simpler, and neither side is convoluted enough for me to confidently invoke Occam's razor.
Here you're defining syntax around your orthographic definition of what a word is.
> "Is he really, in your opinion, to see her tomorrow?"
The opinion words "really" and "in your opinion" added into that syntactic structure parallel the manner that opinion word "bloody" is added into a lexical structure to become "abso-bloody-lutely".
No, I'm not. As I said, there is no unit in English (regardless of whether or not it is written as a single word) that both behaves like a verb and has a distinct future form.
>The opinion words "really" and "in your opinion" added into that syntactic structure parallel the manner that opinion word "bloody" is added into a lexical structure to become "abso-bloody-lutely".
You seem to have missed that fact that even if you remove "really" and "in your opinion", the unit you identified as a verb is still broken up in the example I gave. In "Is he to see her tomorrow?", "is-to-see" is not a contiguous unit. Although the idea of treating the subject pronoun as a verbal infix might be charming (English as Mohawk!), the subject could be a much larger phrase, as in e.g. "Is the man in the red hat to see her tomorrow?"
It is simply not true that adjunction of adverbials has the same properties as expletive infixation. Even if there is some loose analogy between these two processes, I fail to see how it's relevant to units defined in terms of a single "beat in speaking".
You can insert stuff between "is" and "to-see" (and between "to" and "see", I hope you agree) because English mostly parses into left-branching trees, which is the only reason why you can't insert stuff between "see" and "-ing". Natural languages don't liberally mix left-branching structures and right-branching ones because they're hard for people to understand. E.g. Pinker's "The rapidity that the motion that the wing that the hummingbird has has has is remarkable" is grammatically correct but not easily understandable. The contiguity or lack of it operates on a different conceptual plane to whether a "word" can be inflected or not.
> loose analogy
To take some example language from your reply, some people speak this as two words, others (e.g. some Americans) as a compound word. Because the concept of a word is not well defined, the boundary between morphology and syntax is blurred. Putting "is" and "to" noncontiguously before a verb is the same as putting "-ed" after a verb. They are both "inflections".
It's clearly thousands of miles away from being a generally accepted analysis of English syntax. So unless you've written up your analysis in detail somewhere, there's not much further to discuss.
This seems like an arbitrary and peculiar way to define tense specifically to exclude periphrastic languages like English. If these are the rules by which we are arguing, the argument doesn't seem that interesting. By this argument colloquial French and German have no past tense because they use periphrastic constructions to expression this distinction.
Tense is a piece of technical terminology with a particular definition. There's nothing very peculiar about the definition that I can see. The definition denies English a future tense
because English doesn't have periphrastic units which (i) behave like verbs and (ii) have a distinct future form.
For whatever reason, people are really resistant to taking linguists' word for it on what technical terms within linguistics mean. You don't find people on here trying to argue with physicists about whether they should maybe adopt a different definition of "force" or "mass" or "electron".
Certain things are settled: "will" in English behaves like other English auxiliary verb (setting aside those cases where it doesn't, as in contractions). Saying "English has no future tense", though, is settled only within the confines of a particular linguistic model with a particular technical definition of the term "tense", of which there are many.
Could you then provide a reference to a linguist arguing that English has a future tense (in the usual syntactic/morphological sense of the term)?
I'm baffled by the reference to the "confines of a particular linguistic model". Linguists who can agree on hardly anything else (and linguists are pretty good at disagreeing!) can agree that English lacks a future tense.
While implementing spell checkers for these languages need a bit more effort than just compiling a list of words, it's far from an unsolved problem. AFAIK Zemberek is a Turkish spellchecker that implements two-level finite state morphology.
Also, a small terminology fix: Verbs are "conjugated" but names/objects are "declinated".
When people talk about NLP, they are usually referring to understanding the meaning of the entire text being analyzed, rather than just providing meanings for individual words. A spell checker can't tell you the intent or subtleties of what is being conveyed.
Sure, but the linked tweet claims that not even a working spellchecker exists for Finnish. I'm sure Joakim Nivre would think otherwise !
AFAIR the whole language has maybe few irregular verbs, compared to few hundreds in English.
Another problem is proper names. This one is inherently unsolvable, because anything can be a word. You can have a dictionary of proper names, but there will still be people, companies, fictional characters who aren't in the dictionary.
忘れちゃ 動詞,*,母音動詞,タ系連用チャ形,忘れる,わすれちゃ,代表表記:忘れる/わすれる 付属動詞候補（基本） 反義:動詞:覚える/おぼえる
1. can be done either deterministically or stochastically.
2. requires you to have a language model trained with either human-tagged or semi-human-tagged corpus
3. was just the Viterbi algorithm last time I looked.
Implementing 1 and 2 are require broad domain knowledge in two very different domains (linguistics and machine learning respectively)
So while nowadays sentence segmentation can be considered a solved problem, it's far from trivial to implement one that can compete with the state of the art against real-world data.
There is also a nice body of deterministic (rule-based) literature that is practically ignored nowadays.
I can't speak Japanese, but if it is also morphologically rich, it should face similar problems.
> Joose Rajamäki 🇫🇮🇪🇺
> Yes, autocompletion [on phones] exists. But I hardly ever manage to compose a message where it wouldn't encounter new words. Also, it doesn't know the inflections, so it needs to encounter each word in each possible form before suggesting them.
> Joose Rajamäki 🇫🇮🇪🇺
> Old Nokia phones were good. They let you indicate that the word root was finished and you wanted to add agglutination and/or inflections.
Polish had problems way before nlp, for example "Annie has sent you a message" and "John has sent you a message" would be translated differently (because male/female), same for "2 minutes ago" and "5 minutes ago". Also programmatically formulating natural messages like "Dear John" is nigon impossible.
- 2-4, 22-24, 32-34, ...
- 5-21, 25-31, 35-41, ...
So "Copied X file(s) out of Y" would be:
- "Skopiowano 1 plik"
- "Skopiowano 2 pliki z 3"
- "Skopiowano 5 plików z 5"
- "Skopiowano 1 plik ze 100"
- "Skopiowano 2 pliki ze 100"
- "Skopiowano 5 plików ze 100".
Gettext solved it 20 years ago. The problem is that people try to reinvent the wheel instead of looking at existing solutions. So, they make their own inferior versions.
The rule for plural has 3 cases.
1. When n mod 10 = 1
2. When n mod 10 is 2,3,4
3. Everything else
And there's a special rule that the first 2 cases are not used when n mod 100 is between 11 and 29. For such numbers, the 3rd case is used instead.
This is actually not that complex. Compare to Arabic, where you have 9 cases IIRC.
There's no system to handle this, so each special case must be embedded in the software itself, instead of properly doing it in internationalization package.
So - in effect - nobody even tries.
There's also a way to add context to the strings for translation, so that you can see this "like" is used in this meaning, and that other "like" is used in another meaning.
But there's no generic solution to "z/ze" problem as far as I know, you need to do it by hand if you want correct messages. And most of the time programmers don't even realize they need to parametrize such stuff, so the systems don't do any good :)
The rule isn't hard to implement, it's just not expected to be a thing by framework/library creators.
But I may not have a proper perspective as Polish is my native language and I don't work with NLP (although I created a small tool to determine a word based on its inflected form with data from Wiktionary).
What's odd in Croatian is that the all the verbs behave as intransitive Italian verbs, reflecting the gender of the speaker: "I fell" -> "pao sam" or "pala sam" but also "pojeo sam jabuku" vs "pojela sam jabuku" if the speaker is a woman.
I'm not a linguist, but as a native speaker of both, I always thought it had something to do with the fact that in Croatian the verb "to be" is the only auxiliary verb used to built past tenses while in Italian transitive verbs use the verb "to have".
The past participles in many languages behave very similarly to adjectives and thus I thought it might not be surprising that you'd decline them according to number and gender as you'd do with any adjective.
So if when you fall you "are fallen", you decline the "fallen" participle/adjective according to the subject, namely you in this case.
Thus instead of "have eaten" Croatian says "am eaten" (not meaning the passive voice but just using the verb to be as axillary instead of have) as a consequence you notice the "gendered verbs" more often than in other languages.
As to why Italian declines participles when clitics are involved, I assume it adds some redundancy so you can more easily guess what the clitic refers to.
Each time I tried WP8.1 voice assistant to "zadzwoń do mama" it felt me extremely awkward due to nominative case of the word "mama". It should be "zadzwoń do mamy" which is genitive case, if I'm not mistaken, that would make whole phrase sound natural.
I get tripped up when, for example, talking about getting presents from people and never seem to use the correct preposition - "od" or "z". I seem to use the one that would implied "came out of" not "came from"
As for "mama", there's a quite good explanation why it does looks similar (if not same) in many languages: https://www.mother.ly/parenting/mama-is-most-universal-word
E.g. open-source: https://voikko.puimula.org/ (and its online version https://oikofix.com/contact?lang=en, where you can also play with the analysis/parsing capabilities)
Of course, the more complicated morphology makes it less straightforward than English where simple word list based approaches are sufficient.
Although, it imported a lot of technical terms from various european languages, making technical texts seem to be more legible due to lack of composited words in these contexts.
similarly from tweets:
yiyecek = food
yiyecek miydi = will he/she eat that (2nd word is not a separate word, just a conjunction that is written separately)
göz = eye
gözcü = scout
gözlük = glasses
gözlükçü = glass salesman
gözcülük = scouting
gözlükçüler = glass salesmen
gözlükçüydüler = they were glass salesmen
gözlükçü müydüler = were they glass salesmen?
also an all time classic:
çekoslavakyalılaştıramadıklarımızdan mısınız = are you one of those people whom we tried unsuccessfully to assimilate into a Czechoslovakian citizen?
So the opinions of Linguists aside, the languages that make up the ural-altaic family are not that far apart from each other.
Though romance languages/germanic languages (and even English ancestors) have their quirks they're not at Finno-Ugric levels
(Not that English has no quirks - especially in pronunciation - but it is "easy" to deal with most of the weird exceptions)
This is somewhat beside the point, but although in everyday speech "tieto" usually means knowledge in the sense of knowing a fact or skill, it can also be taken as "information", which seems more likely to have been the original intent of the term.
Nevertheless, pretty much every elementary computing course in Finnish begins by stating that "the computer doesn't actually know anything". :)
But seriously, the problems outlined reminded me very much of Japanese and korean, just turned up to 11.
Japanese has endless amounts of homonyms and at least in theory words can go on and on and on with added conjugations. They have a lot of compound nouns, often build from abbreviations of the words compounded.
The author mentions that these compounds are a problem in Finnish NLP because the explode the size of the vocabulary.
Written Japanese does not contain any word boundaries at all. They split on morphemes for NLP tasks, which helps against exploding vocabularies, but also disseminate the parts into their own meaning-unit.
For written Japanese you have characters which can guide you in meaning for a lot of compound words, but they blow up your character space. That is not the case for Finnish, but I can't really decide if that is an advantage or disadvantage.
Also, a spellchecker and a chatbot would rely on completely different techniques, so you can have one without the other. Japanese doesn't even have spellcheckers, but they have a ton of other tools that millions of people rely on everyday for writing faster and better text, like kana-kanji conversion and word suggestions.
At the same time, I do wonder if part of the problem with NLP algorithms and Finnish isn't so much its complexity but the fact that there id very little Finnish data to train on.
Example compound word: Schleifmaschinenverleih (word boundaries: Schleif'maschinen'verleih)
You begin with "Schl" which might suggest "Schleifer", then you delete "er", add "m", and if you're lucky, you get "Schleifmaschine".
"Schleifen" means to grind. "Maschine" means machine. So "Schleifmaschine" is grinding machine, and the disappearance of "en" is a grammatical artifact. "Verleih" means rental (as in a company that does rentals).
It's actually really frustrating. I absolutely love swiping, but it doesn't feel like it works very well with Swedish.
And I would guess this isn't even the proper way to handle German. The long words are just sequences of words not split by spaces when writing. In a parallel universe the German writing system would use spaces and the language wouldn't need to change a bit.
Words are features of writing systems, not of languages.
I still prefer regular keyboards so (desktop and laptop) -- not sure if that's related.
Okay, I'll see myself out now.