
Japanese Grammar in EBNF notation - sova
https://learnjapanese.best/2020/01/04/japanese-grammar-in-ebnf-notation/
======
euske
Their grammar lacks the crucial component that is the recursive structure of
natural languages, which was originally pointed by Chomsky, et al.

Here's how I (Japanese native) would do it:

    
    
      SENTENCE = S 。
      S = BUN* verb | BUN* NP be-verb
      BUN = NP pp | adv  (pp = postposition, or 助詞) 
      NP = noun | adj NP | S NP
    

Warning: this is a highly simplistic version. But this way, things like "今日 築地
で 寿司 を 食った 。 (Today Tsukiji [place-mark] sushi [object-mark] ate .)" could be
parsed as follows:

    
    
      (S (BUN (adv 今日))
         (BUN (NP (noun 築地)) (pp で))
         (BUN (NP (noun 寿司)) (pp を))
         (verb 食った)) 。
    

Then, this can be further extended to a compound sentence: "築地 で 食った 寿司 は うまい
寿司 だった 。 (Tsukiji [place-mark] ate sushi [subject-mark] good sushi was .)"

    
    
      (S (BUN (NP (S (BUN (NP (noun 築地)) (pp で))
                     (verb 食った))
                  (NP (noun 寿司))) (pp は))
         (NP (adj うまい)
             (NP (noun 寿司)))
         (be-verb だった)) 。
    

p.s. Yoda has nothing to do with this grammar.

~~~
mikekchar
My (non-native) criticism :-)

The be-verb is optional in S. "S = NP" is completely valid grammatically. Also
"adj = BUN* verb | B* NP be-verb" if you allow "i-adjectives" to be classed as
verb phrases (as they were historically) and "na-adjectives" to be "noun-
phrase ni aru". So you would actually get:

    
    
      SENTENCE = VP | NP
      VP = BUN* verb | BUN* NP be-verb
      BUN = NP pp | adv
      NP = VP? noun
    

I'm also not sure about the fact that "adv NP be-verb" is allowed by this
grammar...

Edit: spelling

~~~
tasogare
Yes, that’s my criticism as well: a single adj category has no meaning because
there is two of adjectives in Japanese, one with verbal trait, the other
acting more like nouns. The difference can and needs to be addressed in a
grammar.

The grandparent point about recursive rule(s) is valid however and their
easiness is one of my favorite things in Japanese grammar.

Anyway, the problem seems easy at first but it’s totally possible to vaste
hours perfecting a grammar that still doesn’t capture frequently occurring
construction (been there done that).

~~~
jotakami
The original article mentions something that I wish I had known when I began
learning Japanese ~20 years ago: there aren’t really adjectives in Japanese,
just nouns and verbs which have a few different inflection rules. Generally,
any noun or verb (phrase) can modify another so there’s no need to create a
separate category called adjective.

~~~
samatman
This is substantially true of English as well; we can construct a
grammatically valid phrase such as "the sleepy gorillaesque lamppost".

With, of course, some exceptions. I'm confident there are exceptions in
Japanese as well, being as exceptions are the rule in natural languages.

------
Grue3
My algorithm ( [https://ichi.moe/](https://ichi.moe/) ) assumes almost nothing
about the grammar and simply divides the sentences into the most valuable
chunks (based on the word's length and how common it is), and then some
sequences get bonus scores (for example noun followed by a particle). It
performs surprisingly well in real life cases (where strict grammar is
unlikely to be followed). I think it's the best algorithm that's currently
available.

Example text from the blog post:

[https://ichi.moe/cl/qr/?q=%E3%81%B2%E3%81%A8%E3%81%8C%E3%81%...](https://ichi.moe/cl/qr/?q=%E3%81%B2%E3%81%A8%E3%81%8C%E3%81%99%E3%81%94%E3%81%84%E3%80%82%E3%81%92%E3%82%93%E3%81%8D%E3%81%AA%E3%81%B2%E3%81%A8%E3%81%8C%E3%81%84%E3%82%8B%E3%80%82%E3%83%A2%E3%83%B3%E3%82%BF%E3%83%8A%E3%81%8C%E3%81%86%E3%81%A4%E3%81%8F%E3%81%97%E3%81%84%E3%80%82%E3%81%92%E3%82%93%E3%81%8D%E3%81%AA%E3%81%B2%E3%81%A8%E3%81%8C%E3%81%86%E3%81%A4%E3%81%8F%E3%81%97%E3%81%84%E3%83%A2%E3%83%B3%E3%82%BF%E3%83%8A%E3%81%AB%E3%81%84%E3%82%8B%E3%80%82%E3%81%84%E3%82%8B%E3%81%AE%E3%81%8C%E3%81%86%E3%81%A4%E3%81%8F%E3%81%97%E3%81%84%E3%83%A2%E3%83%B3%E3%82%BF%E3%83%8A%E3%81%A0%E3%80%82&r=htr)

~~~
GolDDranks
What you are doing is essentially different: you are doing word sequencing and
possibly part-of-speech tagging from a character stream. What is being done in
this post, is parsing, which imposes a tree-like structure on the input
sequence. These two techniques complement each other, and it might be
beneficial to first do the word sequencing (the Viterbi algorithm is a good
choice) and parsing after that.

------
hprotagonist
_...However, for real Japanese from the wild there would need to be many many
rules, and any slight deviations from the rules would not result in a full
parse._

Much like the rest of biological systems, natural language "laws" and
formalisms are more or less guaranteed to have a "except when..." clause after
them.

I tried to think of a recursive "except when" to the above, but couldn't come
up with anything appropriately witty.

~~~
gizmo686
... Except when the formalism is so powerful as to be consistent with
anything.

~~~
hprotagonist
I have a small town in italy with a barber problem i’d like to tell you all
about...

------
gizmo686
Context-free grammar was originally invented by linguistics to describe
sentence structure. Ultimately the field moved away from them, but they are
still the first model of sentence structure taught in introductory linguistics
classes (at least the one I took).

~~~
macleginn
We know theoretically that language in general is non-context-free. However,
proven examples of non-context-free-ness are rare. The field has not really
moved away from CFGs as the basis for formalisms (otherwise we can't really
parse anything), but the complexity of attributes that need to be passed
around to account for even basic phenomena (agreement, question or relative-
clause formation, etc.) is such that CFG trees become very complex. Instead
people (HPSG and CCG people, for instance) now prefer to use lexicalist models
with everything stuffed inside lexical units with complex bottom-up
derivation/unification rules.

------
haecceity
Imagine wanting to learn Japanese and learning compiler theory just to learn
Japanese easier.

~~~
sanxiyn
A bit off-topic, but you mean parser. Compiler is so much more than parser
that I consider equating them harmful.

------
userbinator
I wonder if
[https://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffal...](https://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffalo_buffalo_buffalo_Buffalo_buffalo)
has a Japanese equivalent, or perhaps just 100% hirigana will easily confuse
this, thus making context-dependency a necessity.

The content is good, but unfortunately if I came across this domain name in a
search engine result page I would think it was yet more SEO spam --- over two
decades of Internet use has made me weary of domains containing words like
"best" in them. (For similar examples, think of domains like best-
antivirus-2020.online --- sounds like a fake AV scam.) IMHO the proliferation
of such TLDs has only made things easier for phishing and such.

~~~
lifthrasiir
> I wonder if
> [https://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffal...](https://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffal..).
> has a Japanese equivalent, or perhaps just 100% hirigana will easily confuse
> this, thus making context-dependency a necessity.

I'm not aware of Japanese equivalents (probably due to my lack of Japanese
knowledge), but the Gyeongsang dialect of Korean language allows a sentence
that only contains a single syllable 가 (ga) and thus can only be distinguished
by intonation. "가가 가가가" is a commonly cited example and pronounced (here
denoted using X-SAMPA) and analyzed as follows:

    
    
        가  ka:     The kid (short for "그 아", itself short for "그 아이")
        가  ga      [topic marker]
    
        가  ka      The family name "Ga"
        가  <R>ga   A suffix for family (collective, unlike "씨" for individuals; e.g. "김가" for Kims)
        가  <F>ga   Coupla, weak question

~~~
glandium
The dialect in Toyama, Japan has the famous "かか、かーかーかかーか" which
decomposes/translates to:

    
    
      かか、 mother (お母さん in standard Japanese)
      かー this (これ)
      かー this way (こう)
      かかー write (書く)
      か ? (か)
    

In proper English, that would be "Mom, is it spelled this way?"

~~~
thaumasiotes
For repetitive sounds, there used to be a fun English example on Wikipedia:

> Ted and Ed edited it.

I'm vaguely aware of two meant for Japanese, though since I know no Japanese,
there are almost certainly errors in my report here:

> "Let's cover Eastern Europe", to-o-o-o-o-o-o-o-o-o.

> "Two chickens in the front yard and two in the back yard",
> niwaniwaniwaniwaniwaniwatori.

~~~
tasty_freeze
I don't know if this is a widely used description, but there is a subreddit
for this kind of thing, called "word avalanches". Some are purer than others.

[https://www.reddit.com/r/WordAvalanches/](https://www.reddit.com/r/WordAvalanches/)

------
mikekchar
This is awesome! I've always wanted to try this. The only real complaint I
have is that "da" is not actually a copula in the strictest sense. It's a
contraction of "de aru". Similarly "na" is not a modifier. It's a contraction
of "ni aru". "aru" is the verb which is the closest you get to a copula in
Japanese - it means "it exists" for non-animate noun-phrases.

So if you say "sakana da", it does mean "It is fish", but so does just
"sakana". The copula is implied. The "da" is completely optional and is
actually only added for emphasis -- the literal translation is kind of like
"That it is fish exists". In literary Japanese you would say "sakana de aru",
"de" being the particle that links a verb the the means with which the verb is
executed. For example "basu de iku" means "will go by bus" \-- bus is the
means by which we will go. In "sakana de aru" or "sakana da" we are basically
saying that "fish" is the means of its existence.

The "na" modifier is also interesting. It is really "ni aru" where "ni" is
essentially the "direction" in which something exists. "Something like a fish"
would be "sakana no you". If you want to say "It is a fragrance something like
a fish" you could say "sakana no you na kaori". Although I'm not aware of any
modern Japanese that would express it like this, this is equivalent to "sakana
no you ni aru kaori" \-- "It is a fragrance that exists in the direction like
a fish". Hopefully you can understand.

The interesting part of this is that adding "ni aru" to the end of a noun
phrase just turns it into a verb phrase. And the even more interesting bit is
that the only thing that can modify a noun phrase is a verb phrase.

But, you may have heard of "i-adjectives" \-- these are adjectives that end in
i. In actuality, these are not adjectives! They are verb phrases! So the word
"cute" is "kawaii". However, the actual word is "kawai" and the _inflection_
is "i". That's why when you want to say "not cute" it becomes "kawaiku nai"
\-- the "i" turns into "ku" because you are inflecting a verb.

This in turn is why you modify nouns directly with "i-adjectives". "kawai
sakana", or "cute fish". Other adjectives are actually noun phrases in
Japanese. "yumei na sakana" or "famous fish". This is, again, exactly the same
as "yumei ni aru sakana" \-- "The fish exists in the direction of fame".

So the rules are even simpler than presented in this blog post.

By the way, for anyone trying to learn Japanese and who wants to go beyond
phrase-book level: learn plain form first and polite form later (if ever).
Japanese makes absolutely no sense if you learn polite form first. It's
incredibly logical (even the polite form extensions) if you start with plain
form.

~~~
krackers
Another one of your posts I remember learning a lot from:
[https://news.ycombinator.com/item?id=13906535](https://news.ycombinator.com/item?id=13906535)

Do you have any recommendations for material that teaches Japanese in this
manner? I've found few resources that actually go into etymology like you have
shown.

~~~
mikekchar
Although not taught in this manner, I learned all of my basic grammar from
[http://www.guidetojapanese.org/learn/grammar](http://www.guidetojapanese.org/learn/grammar)
(Tae Kim's Guide to Japanese Grammar). I've seen a lot of text books, but for
me this was by far the best. There is a famouse Japanese grammar book that's
written in Japanese and starts out writing with simple grammar to kind of
bootstrap you, but I can't remember what it's called.

However,most of the stuff like the above that I learned actually comes from an
NHK programme, "Kotoba man". It went into the history of vocabulary and
grammar and after I watched as many episodes as I could find, the structure of
the language really started to make sense to me.

I think part of the problem is that most prescriptive grammars for language
are subtly incorrect. For example, what part of language is blue in "I am
blue"? What part of the language is it when in "Blue is my favorite colour"?
If they are different parts of language, what is the difference in meaning
between the two? What if you said, "I am painting"? Or if you said "Painting
is my favourite hobby"? What is the difference in meaning between the 2 uses
(if any)? Is there a difference in the meaning of "am" between "I am blue" and
"I am painting"? I think if you follow the prescriptive grammar of English, it
will force you to answer the questions in different ways than your intuitive
(internal) ideas. Or at least it did for me. Studying that sort of sentence in
English helped me to study similar sentences in Japanese and puzzle out
similar insights (or at least as imagined by me). YMMV ;-)

~~~
krackers
>Kotoba man

My google-fu isn't turning up anything with this title. Do you happen to have
a link to the page?

~~~
mikekchar
Sorry. It's an old NHK TV programme (I should have been more clear). It was 5
or 10 minutes long and explored obscure Japanese vocabulary and grammar that
even most Japanese people don't know.

~~~
glandium
BTW, the NHK web site has some interesting Japanese language resources.
[https://www.nhk.or.jp/bunken/research/kotoba/index.html](https://www.nhk.or.jp/bunken/research/kotoba/index.html)

------
ngcc_hk
Interesting. Wish it is more complete at lest have some negative. But do not
want to pay up.

------
derefr
Japanese _is_ quite structured, in the sense that it doesn't form a very hairy
AST data structure. But, as someone who knows both Japanese and a variety of
programming languages, I've always seen the "seams" of Japanese a bit
differently. I've always thought of Japanese as a stack-based language, like
Forth.

A refresher on Forth: in Forth, you run a lexer over text input to get a
stream of tokens; but you don't run a parser (or at least not much of one)
over the tokens to get an AST. Instead, each lexeme—each "word" in Forth—self-
describes as either a literal or a symbol representing a function call. All
words the lexer encounters are immediately "run" using the runtime. Structured
programming (like defining functions and then later calling them) is enabled
by having the runtime itself be a finite-state machine, that can be put in
different states by the execution of certain words, such that all words that
are executed in the new state are executed to different effect (e.g. the word
`[` will make all words up to a matching `]` have the execution semantics of
pushing their symbolic representation onto the stack, sort of like a Lisp
quote; the `]` itself then captures what's on the stack and builds a function,
sort of like the `defmodule` macro does in Elixir.)

Analogously, in Japanese, most words are literals, that just push themselves
onto the stack; while there are two types of "active" words: grammatical
particles and verbs. Each particle is a construction function, which attempts
to match and consume a certain shape of existing words on the stack, pushing
back a tagged structure in its place; and then verbs consume a particular set
of tagged structures from the stack (varying by verb), in any order, leaving
on the stack anything they didn't expect, and pushing back a representation of
the structured meaning of the sentence-as-a-whole up to that point.

I think this is well-represented by showing how Japanese does quotation:

上司は「私は彼が好きじゃない！」と言った。

To transliterate and rewrite this in a Forth-like grammar, with grammatical
particles as lower-case keyword symbols, verbs as upper-case bare symbols, and
everything else as quoted literal words:

"Boss" :subject [ "I" :subject "him" :referent DISLIKE NEGATION ] :cons SAID.

Note that, in this particular case, the brackets are sugar here: the sentence
would have the same semantics without them. (They're helpful to visually
disambiguate where you should mentally backtrack to, but it's clear by the
fact that there's two :subject-particles that the inner verb DISLIKE is only
going to capture the last-pushed one, leaving the previously-pushed one on the
stack for SAID to later consume.)

One could also describe this as what a shift-reduce parser does internally,
but with the reduction edges triggered as explicit command-words in the input
lexeme stream, rather than being triggered implicitly by non-conflicting
pattern-matches.

And this is _also_ , as it happens, the core of any pure-functional abstract-
machine bytecode ISA (e.g. Erlang's BEAM bytecode ISA.) You've got ops that
push literals; ops that take patterns of stuff off the stack and push back new
product-types containing those same things; and ops for calls to (maybe-built-
in) functions named by a pushed symbol term.

~~~
glandium
Nice analogy, I think.

I'll just point out two things odd with your sentence example though:

\- はvs.が is a difficult topic, but I think 上司が would be more appropriate

\- you're supposed to use keigo when talking about a superior, not only when
talking to them. So it should be 仰った(おっしゃった) rather than 言った.

