
Fast Markov chains in ~20 lines of sh, grep, cut and Awk - 0x0f0f0f
https://0x0f0f0f.github.io/posts/2019/11/really-fast-markov-chains-in-~20-lines-of-sh-grep-cut-and-awk/
======
jstanley
> At first, mrkwords.sh will pick a random line from the model and pick the
> first word of the pair as the first word of our output message.

In my opinion, this is not the best way to do it. This will generate sentences
that start with words that can never start sentences.

The best thing to do is to define a "sentinel" to go at the start and end of a
sentence. For example, a "$" sign, or a NUL byte. Then to generate a sentence
you set the initial state to your sentinel value, and generate forwards from
there until you have reached another sentinel, at which point the sentence
ends. When building your dictionary, you prepend (and append) a sentinel value
to each sentence you add.

It might not sound like it, but this is a big improvement in the quality of
sentence generation. For example, it will work out on its own that sentences
start with capital letters and end with full stops, and it also tends to
reduce the amount of obvious "fragment" sentences that get generated. An
example of a "fragment" sentence would be, e.g. "own that sentences start with
capital letters and" \- all of the state transitions are valid, but the start
and end of the sentence are not actually start and end states.

(I acknowledge that the linked project actually does _end_ sentences at valid
end states, but failing to do so is another common mistake of the same type).

~~~
beefhash
> The best thing to do is to define a "sentinel" to go at the start and end of
> a sentence. For example, a "$" sign, or a NUL byte.

The $ sign would get in the way if people actually use it in conersation, and
maybe other things don't like the NUL byte.

Maybe '\x1F' (ASCII Unit Separator) or '\x1E' (ASCII Record Separator) are
better candidates?

~~~
korpiq
Or just upper case first letter and full stop?

~~~
dredmorbius
Because, as Mr. Obvious notes, capitalisation and stops never appear elsewhere
in sentences, etc., etc.

------
jnellis
The best read for something like this is The Practice of Programming which is
a great little book to have in general. This link to a sample chapter covers
the entire analysis and creation of a markov chain program, none of which
require many lines of code.

[https://ptgmedia.pearsoncmg.com/images/9780201615869/samplep...](https://ptgmedia.pearsoncmg.com/images/9780201615869/samplepages/020161586X.pdf)

One thing I've noticed from playing with these types of programs is the number
of words to use as a hash. Two to start with of course, will quickly reproduce
the sample text once you get to only five or six prefix words. Where as two
prefix words usually generate nonsense, the sweet spot to a believable quality
is only three or four words as a prefix with five or more reproducing the
original text. The larger the varied sample text, the much better the results.
Furthermore, only breaking words on whitespace creates even better quality
output than assuming you need to tinker with the punctuation.

~~~
ngrammaton
The question of how many words are used is discussed here:
[https://en.wikipedia.org/wiki/N-gram#n-gram_models](https://en.wikipedia.org/wiki/N-gram#n-gram_models)

See also
[https://en.wikipedia.org/wiki/Google_Ngram_Viewer](https://en.wikipedia.org/wiki/Google_Ngram_Viewer)
which recently popularized the term n-gram.

~~~
drongoking
> which recently popularized the term n-gram.

Eh? It's well known in Info Retrieval; it goes back at least to Salton's IR
group at Cornell in the 1970's.

~~~
ngrammaton
I know "popularized" is often used colloquially as a synonym for "introduced",
but I think I'm well within the accepted meaning of the word if I use it to
mean "made popular with or accessible to a larger audience". In fact that
meaning is closer to the dictionary definition (and the etymology) than the
colloquial sense of "introduced".

------
szemet
In case of huge enough models it possibly can be made even faster using _look_
(binary search), instead of grep... It is better in O terms, with maybe some
large constant factor due to file seeking...

[http://man7.org/linux/man-pages/man1/look.1.html](http://man7.org/linux/man-
pages/man1/look.1.html)

------
commandersaki
I like this, it is sensible and goes against the status quo.

The usual thought is shell scripts are hacky and you need an ostensibly more
powerful language like Python or Perl.

------
mkoryak
This is awesome because it is a real shell script that does something
interesting AND it's well documented.

I'm bookmarking this as a bash script reference.

~~~
m31415
Not being pedantic, but OP's script isn't Bash, it's POSIX sh. POSIX sh can be
executed in Bash, but there are several differences. The only non-POSIX
command I can spot in that script is shuf.

~~~
AlEinstein
Which makes it a valid reference for bash.

~~~
m31415
That's like saying a well-written C program is a good reference for C++. In
any case, writing POSIX-only shell scripts is much harder than writing Bash
scripts, which can often be much faster (e.g., one can use {1..10}, which
doesn't need an external call, instead of `seq 1 10` to iterate through a list
of numbers). For POSIX-only shell scripts, the ones used in Git [1] are some
of the best-written I've seen.

[1]:
[https://github.com/git/git/search?l=shell](https://github.com/git/git/search?l=shell)

------
arminiusreturns
This is super cool. I love things like this being done in gnuland.

~~~
ketralnis
What makes this gnuland?

~~~
arminiusreturns
bash, grep, awk, are all gnu tools

~~~
jandrese
grep is from AT&T Unix version 6. awk is from AT&T Unix version 7. Bash is
from Gnu but the script is actually plain old Bourne Shell from AT&T Unix
version 7.

~~~
pard68
`shuf` is not POSIX, so that would make this BASH.

~~~
shakna
It wouldn't necessarily make it Bash. That would mean using actual Bash
features. It's still non-portable sh, because it relies on an extra program
being available, but it isn't relying on Bash itself - none of expansions or
so on.

This is what makes it Bash:

    
    
        file="${1:-~/.mrkdb}"
    

Not shuf.

~~~
kiwidrew
Actually, "${name-word}" has been part of the Bourne shell since the
beginning, and the "${name:-word}" extension is specified in recent versions
of the POSIX standard...

------
incompatible
Recursively executing shell scripts? I can imagine alternative implementations
that would probably be quite a bit faster.

------
thestoicattack
Isn't this linear in the size of the vocabulary for every token you want to
generate?

------
jasoneckert
hello everybody

