Hacker News new | past | comments | ask | show | jobs | submit login
Japanese Grammar in EBNF notation (learnjapanese.best)
135 points by sova 7 months ago | hide | past | favorite | 54 comments



Their grammar lacks the crucial component that is the recursive structure of natural languages, which was originally pointed by Chomsky, et al.

Here's how I (Japanese native) would do it:

  SENTENCE = S 。
  S = BUN* verb | BUN* NP be-verb
  BUN = NP pp | adv  (pp = postposition, or 助詞) 
  NP = noun | adj NP | S NP
Warning: this is a highly simplistic version. But this way, things like "今日 築地 で 寿司 を 食った 。 (Today Tsukiji [place-mark] sushi [object-mark] ate .)" could be parsed as follows:

  (S (BUN (adv 今日))
     (BUN (NP (noun 築地)) (pp で))
     (BUN (NP (noun 寿司)) (pp を))
     (verb 食った)) 。
Then, this can be further extended to a compound sentence: "築地 で 食った 寿司 は うまい 寿司 だった 。 (Tsukiji [place-mark] ate sushi [subject-mark] good sushi was .)"

  (S (BUN (NP (S (BUN (NP (noun 築地)) (pp で))
                 (verb 食った))
              (NP (noun 寿司))) (pp は))
     (NP (adj うまい)
         (NP (noun 寿司)))
     (be-verb だった)) 。
p.s. Yoda has nothing to do with this grammar.


My (non-native) criticism :-)

The be-verb is optional in S. "S = NP" is completely valid grammatically. Also "adj = BUN* verb | B* NP be-verb" if you allow "i-adjectives" to be classed as verb phrases (as they were historically) and "na-adjectives" to be "noun-phrase ni aru". So you would actually get:

  SENTENCE = VP | NP
  VP = BUN* verb | BUN* NP be-verb
  BUN = NP pp | adv
  NP = VP? noun
I'm also not sure about the fact that "adv NP be-verb" is allowed by this grammar...

Edit: spelling


Yes, that’s my criticism as well: a single adj category has no meaning because there is two of adjectives in Japanese, one with verbal trait, the other acting more like nouns. The difference can and needs to be addressed in a grammar.

The grandparent point about recursive rule(s) is valid however and their easiness is one of my favorite things in Japanese grammar.

Anyway, the problem seems easy at first but it’s totally possible to vaste hours perfecting a grammar that still doesn’t capture frequently occurring construction (been there done that).


The original article mentions something that I wish I had known when I began learning Japanese ~20 years ago: there aren’t really adjectives in Japanese, just nouns and verbs which have a few different inflection rules. Generally, any noun or verb (phrase) can modify another so there’s no need to create a separate category called adjective.


This is substantially true of English as well; we can construct a grammatically valid phrase such as "the sleepy gorillaesque lamppost".

With, of course, some exceptions. I'm confident there are exceptions in Japanese as well, being as exceptions are the rule in natural languages.


You can think of them as verb-like adjectives and noun-like adjectives, or you can think of them as adjective-like verbs and adjective-like nouns, but it really makes no difference one way or the other, and either way they are going to be their own separate categories.


... except for the fact that 形容詞 (i-adjectives) and 形容動詞 (na-adjectives) are used differently from 名詞 (nouns) and 動詞 (verbs). So there needs to be some distinction at some point, even if at the high level, 名詞 and 形容動詞 are similarish, and 形容詞 and 動詞 are similarish. But that doesn't go beyond -ish, and it's important to know.


Thanks and using lisp or list structure help quite a bit.


I fail to see how your example goes beyond the "boring" version.

モンタナにいるひとはうつくしいひとだ。

This sentence is accepted by the parser shown in the article and is conceptually the same as 「築地 で 食った 寿司 は うまい 寿司 だった 。」 (save for the use of past tense, which, for this purpose, is irrelevant).


The grammar in the article would parse your example as

  (S (JAR (noun-no モンタナ) (LID に))
     (JAR (M (verb-u いる)) (noun-no ひと) (LID は))
     (SFF (M (verb-i うつくしい)) (noun-no ひと) (da だ)))
Note how Montana is at the same level of nesting as everything else, rather than being part of the modifier describing the person.

The parser just turns every sentence into a flat list. Imagine a parser for a programming language that parses every single line, but doesn't group lines in the same function, loop, conditional etc. together. Unless the language is assembly, you wouldn't be able to get away with that.


Very good observation. I didn't pay attention to nesting. It didn't even occur to me that this interpretation of the sentence was also possible. (And as far as I can tell it's is a perfectly valid interpretation too!)

It all comes down to the precedence levels of the LID particles (に, は, で etc.), doesn't it. I'm not sure if those are well-defined, but I think it would feel more natural for a native speaker to evaluate に before は in nearly all cases. The order between で and は seems more elusive. I'll definitely bring this up in our conversation group next time, though I'm sure I'll get some funny looks for it.


I don't see this as a case of precendence.of に vs は but rather as a case of に being associated with いる, and that without it, いる人 is incomplete.


いる人 by itself means "a person who exists". Not something that would come up in everyday speech, but the grammar checks out, I think. Might sound more natural to replace いる人 with 生きてる人 though.


I believe the nesting you wish can be realized by an addition to the Modifier rule, allowing modifiers to be complete Jars:

    M = Jar | verb-u | verb-i | noun-na na | noun-no no


My algorithm ( https://ichi.moe/ ) assumes almost nothing about the grammar and simply divides the sentences into the most valuable chunks (based on the word's length and how common it is), and then some sequences get bonus scores (for example noun followed by a particle). It performs surprisingly well in real life cases (where strict grammar is unlikely to be followed). I think it's the best algorithm that's currently available.

Example text from the blog post:

https://ichi.moe/cl/qr/?q=%E3%81%B2%E3%81%A8%E3%81%8C%E3%81%...


What you are doing is essentially different: you are doing word sequencing and possibly part-of-speech tagging from a character stream. What is being done in this post, is parsing, which imposes a tree-like structure on the input sequence. These two techniques complement each other, and it might be beneficial to first do the word sequencing (the Viterbi algorithm is a good choice) and parsing after that.


...However, for real Japanese from the wild there would need to be many many rules, and any slight deviations from the rules would not result in a full parse.

Much like the rest of biological systems, natural language "laws" and formalisms are more or less guaranteed to have a "except when..." clause after them.

I tried to think of a recursive "except when" to the above, but couldn't come up with anything appropriately witty.


... Except when the formalism is so powerful as to be consistent with anything.


I have a small town in italy with a barber problem i’d like to tell you all about...


... except when it's not.


Context-free grammar was originally invented by linguistics to describe sentence structure. Ultimately the field moved away from them, but they are still the first model of sentence structure taught in introductory linguistics classes (at least the one I took).


We know theoretically that language in general is non-context-free. However, proven examples of non-context-free-ness are rare. The field has not really moved away from CFGs as the basis for formalisms (otherwise we can't really parse anything), but the complexity of attributes that need to be passed around to account for even basic phenomena (agreement, question or relative-clause formation, etc.) is such that CFG trees become very complex. Instead people (HPSG and CCG people, for instance) now prefer to use lexicalist models with everything stuffed inside lexical units with complex bottom-up derivation/unification rules.


Imagine wanting to learn Japanese and learning compiler theory just to learn Japanese easier.


A bit off-topic, but you mean parser. Compiler is so much more than parser that I consider equating them harmful.


This is not terribly dissimilar to the kind of language LARPing that you encounter regularly in Japanese learning communities.


Not the worst decision I ever made to be honest.


I wonder if https://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffal... has a Japanese equivalent, or perhaps just 100% hirigana will easily confuse this, thus making context-dependency a necessity.

The content is good, but unfortunately if I came across this domain name in a search engine result page I would think it was yet more SEO spam --- over two decades of Internet use has made me weary of domains containing words like "best" in them. (For similar examples, think of domains like best-antivirus-2020.online --- sounds like a fake AV scam.) IMHO the proliferation of such TLDs has only made things easier for phishing and such.


> I wonder if https://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffal.... has a Japanese equivalent, or perhaps just 100% hirigana will easily confuse this, thus making context-dependency a necessity.

I'm not aware of Japanese equivalents (probably due to my lack of Japanese knowledge), but the Gyeongsang dialect of Korean language allows a sentence that only contains a single syllable 가 (ga) and thus can only be distinguished by intonation. "가가 가가가" is a commonly cited example and pronounced (here denoted using X-SAMPA) and analyzed as follows:

    가  ka:     The kid (short for "그 아", itself short for "그 아이")
    가  ga      [topic marker]

    가  ka      The family name "Ga"
    가  <R>ga   A suffix for family (collective, unlike "씨" for individuals; e.g. "김가" for Kims)
    가  <F>ga   Coupla, weak question


The dialect in Toyama, Japan has the famous "かか、かーかーかかーか" which decomposes/translates to:

  かか、 mother (お母さん in standard Japanese)
  かー this (これ)
  かー this way (こう)
  かかー write (書く)
  か ? (か)
In proper English, that would be "Mom, is it spelled this way?"


For repetitive sounds, there used to be a fun English example on Wikipedia:

> Ted and Ed edited it.

I'm vaguely aware of two meant for Japanese, though since I know no Japanese, there are almost certainly errors in my report here:

> "Let's cover Eastern Europe", to-o-o-o-o-o-o-o-o-o.

> "Two chickens in the front yard and two in the back yard", niwaniwaniwaniwaniwaniwatori.


I don't know if this is a widely used description, but there is a subreddit for this kind of thing, called "word avalanches". Some are purer than others.

https://www.reddit.com/r/WordAvalanches/


It is indeed SEO spam, although atypically well written and informative. At the bottom of the post, they encourage you to pay $2/day (!!) for their learning service hosted at a real .com domain.


The spam pattern becomes clearer if you look at the posting history for articles from this site: https://news.ycombinator.com/from?site=learnjapanese.best It's a pretty consistent pattern of once every couple of days.

If you look at the post history for the spammer's main account, they try to disguise their spamming by posting a few miscellaneous Wikipedia articles and the like between posting their site: https://news.ycombinator.com/submitted?id=sova

You can also see them experimenting with spamming their other sites as well; the poster is always the same individual.

https://news.ycombinator.com/from?site=japanesecomplete.com

https://news.ycombinator.com/from?site=learnjapanesebest.wor...


There are but I could not find many details about them in English. Some of the well knowns are:

すもももももももものうち or ははははははははのははははははとわらった

Please see MeCab (https://en.wikipedia.org/wiki/MeCab) segmentation output below:

http://www.edrdg.org/~jwb/cgi-bin/mecabdisp?sent=%E3%81%99%E...

http://www.edrdg.org/~jwb/cgi-bin/mecabdisp?sent=%E3%81%AF%E...


Mecab fails miserably on the second. It doesn't find the hidden "母は", but it's hard to blame it.


Tge german wikipedia page for buffalo has some examples in other languages:

https://de.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffal...

In japanese: 子子子子子子子子子子子子

In swedish: Bar barbar-bar-barbar bar bar barbar-bar-barbar

In german: Weichen Weichen weichen Weichen, weichen Weichen weichen Weichen


子子子子子子子子子子子子 is a different category: in Japanese, Chinese characters usually have multiple possible readings. In the case of 子, it can be read as ko, ne, shi, or ji. 子子子子子子子子子子子子 is a word play that switches between them. It reads neko(no)kokoneko shishi(no)kokojishi. Note the no are omitted, which would be unusual here (although there are cases of all Chinese character sentences, but they are plain Chinese read as Japanese, with Japanese grammar made up on the spot). The actual correct way to write that sentence is 猫の子仔猫、獅子の子仔獅子.


The closest thing is にわにはにわにわとりがいる (庭には二羽にわとりがいる。)


> if I came across this domain name in a search engine result page I would think it was yet more SEO spam

Their main landing page doesn't inspire much confidence either: https://japanesecomplete.com/purchase The combination of layout and font gives that sketchy feel. I'm sure the content itself is great, but I think it would benefit from a bit more carefully constructed site along with some sort of trial (á la Wanikani) since almost every resource claims to be the panacea you've been looking for.


This is awesome! I've always wanted to try this. The only real complaint I have is that "da" is not actually a copula in the strictest sense. It's a contraction of "de aru". Similarly "na" is not a modifier. It's a contraction of "ni aru". "aru" is the verb which is the closest you get to a copula in Japanese - it means "it exists" for non-animate noun-phrases.

So if you say "sakana da", it does mean "It is fish", but so does just "sakana". The copula is implied. The "da" is completely optional and is actually only added for emphasis -- the literal translation is kind of like "That it is fish exists". In literary Japanese you would say "sakana de aru", "de" being the particle that links a verb the the means with which the verb is executed. For example "basu de iku" means "will go by bus" -- bus is the means by which we will go. In "sakana de aru" or "sakana da" we are basically saying that "fish" is the means of its existence.

The "na" modifier is also interesting. It is really "ni aru" where "ni" is essentially the "direction" in which something exists. "Something like a fish" would be "sakana no you". If you want to say "It is a fragrance something like a fish" you could say "sakana no you na kaori". Although I'm not aware of any modern Japanese that would express it like this, this is equivalent to "sakana no you ni aru kaori" -- "It is a fragrance that exists in the direction like a fish". Hopefully you can understand.

The interesting part of this is that adding "ni aru" to the end of a noun phrase just turns it into a verb phrase. And the even more interesting bit is that the only thing that can modify a noun phrase is a verb phrase.

But, you may have heard of "i-adjectives" -- these are adjectives that end in i. In actuality, these are not adjectives! They are verb phrases! So the word "cute" is "kawaii". However, the actual word is "kawai" and the inflection is "i". That's why when you want to say "not cute" it becomes "kawaiku nai" -- the "i" turns into "ku" because you are inflecting a verb.

This in turn is why you modify nouns directly with "i-adjectives". "kawai sakana", or "cute fish". Other adjectives are actually noun phrases in Japanese. "yumei na sakana" or "famous fish". This is, again, exactly the same as "yumei ni aru sakana" -- "The fish exists in the direction of fame".

So the rules are even simpler than presented in this blog post.

By the way, for anyone trying to learn Japanese and who wants to go beyond phrase-book level: learn plain form first and polite form later (if ever). Japanese makes absolutely no sense if you learn polite form first. It's incredibly logical (even the polite form extensions) if you start with plain form.


Another one of your posts I remember learning a lot from: https://news.ycombinator.com/item?id=13906535

Do you have any recommendations for material that teaches Japanese in this manner? I've found few resources that actually go into etymology like you have shown.


Although not taught in this manner, I learned all of my basic grammar from http://www.guidetojapanese.org/learn/grammar (Tae Kim's Guide to Japanese Grammar). I've seen a lot of text books, but for me this was by far the best. There is a famouse Japanese grammar book that's written in Japanese and starts out writing with simple grammar to kind of bootstrap you, but I can't remember what it's called.

However,most of the stuff like the above that I learned actually comes from an NHK programme, "Kotoba man". It went into the history of vocabulary and grammar and after I watched as many episodes as I could find, the structure of the language really started to make sense to me.

I think part of the problem is that most prescriptive grammars for language are subtly incorrect. For example, what part of language is blue in "I am blue"? What part of the language is it when in "Blue is my favorite colour"? If they are different parts of language, what is the difference in meaning between the two? What if you said, "I am painting"? Or if you said "Painting is my favourite hobby"? What is the difference in meaning between the 2 uses (if any)? Is there a difference in the meaning of "am" between "I am blue" and "I am painting"? I think if you follow the prescriptive grammar of English, it will force you to answer the questions in different ways than your intuitive (internal) ideas. Or at least it did for me. Studying that sort of sentence in English helped me to study similar sentences in Japanese and puzzle out similar insights (or at least as imagined by me). YMMV ;-)


>Kotoba man

My google-fu isn't turning up anything with this title. Do you happen to have a link to the page?


Sorry. It's an old NHK TV programme (I should have been more clear). It was 5 or 10 minutes long and explored obscure Japanese vocabulary and grammar that even most Japanese people don't know.


BTW, the NHK web site has some interesting Japanese language resources. https://www.nhk.or.jp/bunken/research/kotoba/index.html


Not GP, but I was in that thread. The most extensive non-Japanese book I know of is Grammaire japonaise systématique by Reiko Shimamori. As you may guess from the title, it's in French.


>As you may guess from the title, it's in French.

Unfortunately that would leave me with the issue of now learning 2 languages.

Interestingly enough the best (and only) resource I've been able to find for content like GP is https://www.japanesewithanime.com/ which – far from what I originally imagined given the domain name – actually has in-depth explanations of both grammar and etymology (see: [1], [2]) replete with references to research articles in JP linguistics.

[1] https://www.japanesewithanime.com/2019/12/rentaikei.html

[2] https://www.japanesewithanime.com/2019/12/kanou-doushi.html#...


Not the OP, but my own experience was that I finally started learning Japanese for real when I stumbled upon AJATT and started using some of the techniques he recommends. In short: learning a language is more like learning a martial art than learning mathematics. Repetition, mimicry and passive exposure are keys to mastery. Textbooks and pedagogical methods can provide interesting cognitive exercise but are very inefficient and ineffective at developing native-level language skill.


You have a few typos that makes your otherwise interesting post very hard to follow. With great fear of introducing new typos, I've added corrections in square brackets below, as well as comments.

> In literary Japanese you would say "sakana de aru", "de" being the particle that links a verb [with] the means with which the verb is executed.

´de´ has a few more uses, than just instrumentals, that you explain in that sentence. Instrumentalis show what instrument (noun) is used to do an action (verb). This ´de´ is often translated as "using" as in "I ate sushi using chopsticks". The somewhat broader interpretation of ´de´ you present, kinda makes sense, but I'm not clear why "de aru" is more appropriate/meaningful than "wo aru" (other than, because "that is how it is").

> However, the actual word is "kawai" and the inflection is "I".

I disagree. The word is kawaii. The word stem is kawai. But I think the logic holds either way.

> This in turn is why you modify nouns directly with "i-adjectives". "kawai[i] sakana", or "cute fish".

It is "kawaii sakana" with two I's.

> That's why when you want to say "not cute" it becomes "kawaiku nai" -- the "i" turns into "ku" because you are inflecting a verb.

But inflecting a verb to its negative form is turning a "u" into an "a" or simply removing the "ru" ending: "aruku -> arukanai", "taberu -> tabenai". Saying that the "i" turns into a "ku" because you are inflicting a verb has no explanatory power over saying the the change is because you are inflicting an adjective.

I think the argument for calling I-adjectives verb-phrases, is that they have tense (the same tenses as regular verbs)

The argument against, is that you can say "kawaii de aru", but you cannot add "de aru" to a verb-phrase.


> In actuality, these are not adjectives! They are verb phrases!

I wish more learning material acknowledges this fact. This is why they can forms entire sentence by themselves. I also agree with your comment about plain form first. That’s why for the Japanese method I’m writing for a relative, I start with those predicative adjectives, then explain how は change the topic, etc.

Language methods should use the underlying logic and natural use of a language. I cringe each time I see something with lesson #1 begin "watasi ha gakusei desu" or the like because this is teaching bad habits from day one.


Interesting. Wish it is more complete at lest have some negative. But do not want to pay up.


Japanese is quite structured, in the sense that it doesn't form a very hairy AST data structure. But, as someone who knows both Japanese and a variety of programming languages, I've always seen the "seams" of Japanese a bit differently. I've always thought of Japanese as a stack-based language, like Forth.

A refresher on Forth: in Forth, you run a lexer over text input to get a stream of tokens; but you don't run a parser (or at least not much of one) over the tokens to get an AST. Instead, each lexeme—each "word" in Forth—self-describes as either a literal or a symbol representing a function call. All words the lexer encounters are immediately "run" using the runtime. Structured programming (like defining functions and then later calling them) is enabled by having the runtime itself be a finite-state machine, that can be put in different states by the execution of certain words, such that all words that are executed in the new state are executed to different effect (e.g. the word `[` will make all words up to a matching `]` have the execution semantics of pushing their symbolic representation onto the stack, sort of like a Lisp quote; the `]` itself then captures what's on the stack and builds a function, sort of like the `defmodule` macro does in Elixir.)

Analogously, in Japanese, most words are literals, that just push themselves onto the stack; while there are two types of "active" words: grammatical particles and verbs. Each particle is a construction function, which attempts to match and consume a certain shape of existing words on the stack, pushing back a tagged structure in its place; and then verbs consume a particular set of tagged structures from the stack (varying by verb), in any order, leaving on the stack anything they didn't expect, and pushing back a representation of the structured meaning of the sentence-as-a-whole up to that point.

I think this is well-represented by showing how Japanese does quotation:

上司は「私は彼が好きじゃない!」と言った。

To transliterate and rewrite this in a Forth-like grammar, with grammatical particles as lower-case keyword symbols, verbs as upper-case bare symbols, and everything else as quoted literal words:

"Boss" :subject [ "I" :subject "him" :referent DISLIKE NEGATION ] :cons SAID.

Note that, in this particular case, the brackets are sugar here: the sentence would have the same semantics without them. (They're helpful to visually disambiguate where you should mentally backtrack to, but it's clear by the fact that there's two :subject-particles that the inner verb DISLIKE is only going to capture the last-pushed one, leaving the previously-pushed one on the stack for SAID to later consume.)

One could also describe this as what a shift-reduce parser does internally, but with the reduction edges triggered as explicit command-words in the input lexeme stream, rather than being triggered implicitly by non-conflicting pattern-matches.

And this is also, as it happens, the core of any pure-functional abstract-machine bytecode ISA (e.g. Erlang's BEAM bytecode ISA.) You've got ops that push literals; ops that take patterns of stuff off the stack and push back new product-types containing those same things; and ops for calls to (maybe-built-in) functions named by a pushed symbol term.


Nice analogy, I think.

I'll just point out two things odd with your sentence example though:

- はvs.が is a difficult topic, but I think 上司が would be more appropriate

- you're supposed to use keigo when talking about a superior, not only when talking to them. So it should be 仰った(おっしゃった) rather than 言った.


Very nice description of Forth BTW.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: