> At first, mrkwords.sh will pick a random line from the model and pick the first word of the pair as the first word of our output message.
In my opinion, this is not the best way to do it. This will generate sentences that start with words that can never start sentences.
The best thing to do is to define a "sentinel" to go at the start and end of a sentence. For example, a "$" sign, or a NUL byte. Then to generate a sentence you set the initial state to your sentinel value, and generate forwards from there until you have reached another sentinel, at which point the sentence ends. When building your dictionary, you prepend (and append) a sentinel value to each sentence you add.
It might not sound like it, but this is a big improvement in the quality of sentence generation. For example, it will work out on its own that sentences start with capital letters and end with full stops, and it also tends to reduce the amount of obvious "fragment" sentences that get generated. An example of a "fragment" sentence would be, e.g. "own that sentences start with capital letters and" - all of the state transitions are valid, but the start and end of the sentence are not actually start and end states.
(I acknowledge that the linked project actually does end sentences at valid end states, but failing to do so is another common mistake of the same type).
The $ sign would only get in the way if used as a standalone word, but I agree. If the sentinel is ever encountered in actual content, it should be escaped somehow.
The first random state should be picked with start probability, which are equal to the frequency of a word beginning a sentence, which is exactly equal to picking the first word of a random sentence
The best read for something like this is The Practice of Programming which is a great little book to have in general. This link to a sample chapter covers the entire analysis and creation of a markov chain program, none of which require many lines of code.
One thing I've noticed from playing with these types of programs is the number of words to use as a hash. Two to start with of course, will quickly reproduce the sample text once you get to only five or six prefix words. Where as two prefix words usually generate nonsense, the sweet spot to a believable quality is only three or four words as a prefix with five or more reproducing the original text. The larger the varied sample text, the much better the results. Furthermore, only breaking words on whitespace creates even better quality output than assuming you need to tinker with the punctuation.
I know "popularized" is often used colloquially as a synonym for "introduced", but I think I'm well within the accepted meaning of the word if I use it to mean "made popular with or accessible to a larger audience". In fact that meaning is closer to the dictionary definition (and the etymology) than the colloquial sense of "introduced".
In case of huge enough models it possibly can be made even faster using look (binary search), instead of grep... It is better in O terms, with maybe some large constant factor due to file seeking...
Not being pedantic, but OP's script isn't Bash, it's POSIX sh. POSIX sh can be executed in Bash, but there are several differences. The only non-POSIX command I can spot in that script is shuf.
That's like saying a well-written C program is a good reference for C++. In any case, writing POSIX-only shell scripts is much harder than writing Bash scripts, which can often be much faster (e.g., one can use {1..10}, which doesn't need an external call, instead of `seq 1 10` to iterate through a list of numbers). For POSIX-only shell scripts, the ones used in Git [1] are some of the best-written I've seen.
grep is from AT&T Unix version 6. awk is from AT&T Unix version 7. Bash is from Gnu but the script is actually plain old Bourne Shell from AT&T Unix version 7.
It wouldn't necessarily make it Bash. That would mean using actual Bash features. It's still non-portable sh, because it relies on an extra program being available, but it isn't relying on Bash itself - none of expansions or so on.
Actually, "${name-word}" has been part of the Bourne shell since the beginning, and the "${name:-word}" extension is specified in recent versions of the POSIX standard...
"AWK was significantly revised and expanded in 1985–88, resulting in the GNU AWK implementation written by Paul Rubin, Jay Fenlason, and Richard Stallman, released in 1988.[8] GNU AWK may be the most widely deployed version" AWK wikipedia entry
If you are on linux, you are using GNU grep, not unix grep.
Others have addressed bash... so wrong on all counts.
In my opinion, this is not the best way to do it. This will generate sentences that start with words that can never start sentences.
The best thing to do is to define a "sentinel" to go at the start and end of a sentence. For example, a "$" sign, or a NUL byte. Then to generate a sentence you set the initial state to your sentinel value, and generate forwards from there until you have reached another sentinel, at which point the sentence ends. When building your dictionary, you prepend (and append) a sentinel value to each sentence you add.
It might not sound like it, but this is a big improvement in the quality of sentence generation. For example, it will work out on its own that sentences start with capital letters and end with full stops, and it also tends to reduce the amount of obvious "fragment" sentences that get generated. An example of a "fragment" sentence would be, e.g. "own that sentences start with capital letters and" - all of the state transitions are valid, but the start and end of the sentence are not actually start and end states.
(I acknowledge that the linked project actually does end sentences at valid end states, but failing to do so is another common mistake of the same type).