In my opinion, this is not the best way to do it. This will generate sentences that start with words that can never start sentences.
The best thing to do is to define a "sentinel" to go at the start and end of a sentence. For example, a "$" sign, or a NUL byte. Then to generate a sentence you set the initial state to your sentinel value, and generate forwards from there until you have reached another sentinel, at which point the sentence ends. When building your dictionary, you prepend (and append) a sentinel value to each sentence you add.
It might not sound like it, but this is a big improvement in the quality of sentence generation. For example, it will work out on its own that sentences start with capital letters and end with full stops, and it also tends to reduce the amount of obvious "fragment" sentences that get generated. An example of a "fragment" sentence would be, e.g. "own that sentences start with capital letters and" - all of the state transitions are valid, but the start and end of the sentence are not actually start and end states.
(I acknowledge that the linked project actually does end sentences at valid end states, but failing to do so is another common mistake of the same type).
The $ sign would get in the way if people actually use it in conersation, and maybe other things don't like the NUL byte.
Maybe '\x1F' (ASCII Unit Separator) or '\x1E' (ASCII Record Separator) are better candidates?
This is strictly a toy, no one serious would use this for anything anyways.
Right, but the sentence start is taken from the first word of a random state transition pair, not the first word of a random sentence.
One thing I've noticed from playing with these types of programs is the number of words to use as a hash. Two to start with of course, will quickly reproduce the sample text once you get to only five or six prefix words. Where as two prefix words usually generate nonsense, the sweet spot to a believable quality is only three or four words as a prefix with five or more reproducing the original text. The larger the varied sample text, the much better the results. Furthermore, only breaking words on whitespace creates even better quality output than assuming you need to tinker with the punctuation.
See also https://en.wikipedia.org/wiki/Google_Ngram_Viewer which recently popularized the term n-gram.
Eh? It's well known in Info Retrieval; it goes back at least to Salton's IR group at Cornell in the 1970's.
The usual thought is shell scripts are hacky and you need an ostensibly more powerful language like Python or Perl.
I'm bookmarking this as a bash script reference.
This is what makes it Bash:
If you are on linux, you are using GNU grep, not unix grep.
Others have addressed bash... so wrong on all counts.