Hacker News new | past | comments | ask | show | jobs | submit login
Fast Markov chains in ~20 lines of sh, grep, cut and Awk (0x0f0f0f.github.io)
369 points by 0x0f0f0f on Nov 9, 2019 | hide | past | favorite | 39 comments



> At first, mrkwords.sh will pick a random line from the model and pick the first word of the pair as the first word of our output message.

In my opinion, this is not the best way to do it. This will generate sentences that start with words that can never start sentences.

The best thing to do is to define a "sentinel" to go at the start and end of a sentence. For example, a "$" sign, or a NUL byte. Then to generate a sentence you set the initial state to your sentinel value, and generate forwards from there until you have reached another sentinel, at which point the sentence ends. When building your dictionary, you prepend (and append) a sentinel value to each sentence you add.

It might not sound like it, but this is a big improvement in the quality of sentence generation. For example, it will work out on its own that sentences start with capital letters and end with full stops, and it also tends to reduce the amount of obvious "fragment" sentences that get generated. An example of a "fragment" sentence would be, e.g. "own that sentences start with capital letters and" - all of the state transitions are valid, but the start and end of the sentence are not actually start and end states.

(I acknowledge that the linked project actually does end sentences at valid end states, but failing to do so is another common mistake of the same type).


'^' and '$' are already the regex convention for start and end of string. With a corpus already broken into sentences, this would be readily applied.


> The best thing to do is to define a "sentinel" to go at the start and end of a sentence. For example, a "$" sign, or a NUL byte.

The $ sign would get in the way if people actually use it in conersation, and maybe other things don't like the NUL byte.

Maybe '\x1F' (ASCII Unit Separator) or '\x1E' (ASCII Record Separator) are better candidates?


The $ sign would only get in the way if used as a standalone word, but I agree. If the sentinel is ever encountered in actual content, it should be escaped somehow.


Or just upper case first letter and full stop?


Because, as Mr. Obvious notes, capitalisation and stops never appear elsewhere in sentences, etc., etc.


In his dataset, all sentences start at the beginning of the line and end at it's end.

This is strictly a toy, no one serious would use this for anything anyways.


> In his dataset, all sentences start at the beginning of the line and end at it's end.

Right, but the sentence start is taken from the first word of a random state transition pair, not the first word of a random sentence.


The first random state should be picked with start probability, which are equal to the frequency of a word beginning a sentence, which is exactly equal to picking the first word of a random sentence


The best read for something like this is The Practice of Programming which is a great little book to have in general. This link to a sample chapter covers the entire analysis and creation of a markov chain program, none of which require many lines of code.

https://ptgmedia.pearsoncmg.com/images/9780201615869/samplep...

One thing I've noticed from playing with these types of programs is the number of words to use as a hash. Two to start with of course, will quickly reproduce the sample text once you get to only five or six prefix words. Where as two prefix words usually generate nonsense, the sweet spot to a believable quality is only three or four words as a prefix with five or more reproducing the original text. The larger the varied sample text, the much better the results. Furthermore, only breaking words on whitespace creates even better quality output than assuming you need to tinker with the punctuation.


The question of how many words are used is discussed here: https://en.wikipedia.org/wiki/N-gram#n-gram_models

See also https://en.wikipedia.org/wiki/Google_Ngram_Viewer which recently popularized the term n-gram.


> which recently popularized the term n-gram.

Eh? It's well known in Info Retrieval; it goes back at least to Salton's IR group at Cornell in the 1970's.


I know "popularized" is often used colloquially as a synonym for "introduced", but I think I'm well within the accepted meaning of the word if I use it to mean "made popular with or accessible to a larger audience". In fact that meaning is closer to the dictionary definition (and the etymology) than the colloquial sense of "introduced".


I also recommend everyone read PoP.


In case of huge enough models it possibly can be made even faster using look (binary search), instead of grep... It is better in O terms, with maybe some large constant factor due to file seeking...

http://man7.org/linux/man-pages/man1/look.1.html


I like this, it is sensible and goes against the status quo.

The usual thought is shell scripts are hacky and you need an ostensibly more powerful language like Python or Perl.


This is awesome because it is a real shell script that does something interesting AND it's well documented.

I'm bookmarking this as a bash script reference.


Not being pedantic, but OP's script isn't Bash, it's POSIX sh. POSIX sh can be executed in Bash, but there are several differences. The only non-POSIX command I can spot in that script is shuf.


Which makes it a valid reference for bash.


That's like saying a well-written C program is a good reference for C++. In any case, writing POSIX-only shell scripts is much harder than writing Bash scripts, which can often be much faster (e.g., one can use {1..10}, which doesn't need an external call, instead of `seq 1 10` to iterate through a list of numbers). For POSIX-only shell scripts, the ones used in Git [1] are some of the best-written I've seen.

[1]: https://github.com/git/git/search?l=shell


Not being pedantic, but * something pedantic *.


What I meant is that I'm not being pedantic for the sake of being pedantic.


This is super cool. I love things like this being done in gnuland.


What makes this gnuland?


bash, grep, awk, are all gnu tools


grep is from AT&T Unix version 6. awk is from AT&T Unix version 7. Bash is from Gnu but the script is actually plain old Bourne Shell from AT&T Unix version 7.


`shuf` is not POSIX, so that would make this BASH.


It wouldn't necessarily make it Bash. That would mean using actual Bash features. It's still non-portable sh, because it relies on an extra program being available, but it isn't relying on Bash itself - none of expansions or so on.

This is what makes it Bash:

    file="${1:-~/.mrkdb}"
Not shuf.


Actually, "${name-word}" has been part of the Bourne shell since the beginning, and the "${name:-word}" extension is specified in recent versions of the POSIX standard...


There is a free `shuf` for OpenBSD


"AWK was significantly revised and expanded in 1985–88, resulting in the GNU AWK implementation written by Paul Rubin, Jay Fenlason, and Richard Stallman, released in 1988.[8] GNU AWK may be the most widely deployed version" AWK wikipedia entry

If you are on linux, you are using GNU grep, not unix grep.

Others have addressed bash... so wrong on all counts.


None of it is specific to GNU and you can easily do it without any GNU tools on MacOS, if you've heard about that niche operating system.


Did MacOS finally get up to bash v4 or do you still have to upgrade manually?


I am doing this just fine with a base OpenBSD and this ISC licensed `shuf` counterpart:

https://github.com/ibara/shuf


I use BSD ksh, awk, and shuf compiled from

https://github.com/ibara/shuf


he's using sh not bash. Grep, sh, and awk predate gnu by a long-time. It's not like they invented these tools, merely made their own implementations.


Recursively executing shell scripts? I can imagine alternative implementations that would probably be quite a bit faster.


Isn't this linear in the size of the vocabulary for every token you want to generate?


hello everybody




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: