As a non high-energy physicist, it's surprisingly hard! I usually do _worse_ than random chance, sometimes substantially so.
The author also has perhaps my favorite definition of a CFG: "The snarXiv is based on a context free grammar (CFG) — basically a set of rules for computer-generated mad libs."
If you’re using a small corpus and long Markov chains, you’ll end up with lots of actual strings from the corpus, and no fake ones. If this happens, experiment with the second parameter to the constructor for the class “MarkovGenerator.”
For the authors you are using, the corpus is too small.
Since Paul Erdos is one of the most prolific mathematicians ever, the problem is more likely that the default chain length (4) is too long for even a relatively large corpus. A chain length of 2 does much better.
Also see http://pdos.csail.mit.edu/scigen/, an automatic CS paper generator. One of their papers was accepted to a conference, and they gave some hilarious speeches (also in link).
As a non high-energy physicist, it's surprisingly hard! I usually do _worse_ than random chance, sometimes substantially so.
The author also has perhaps my favorite definition of a CFG: "The snarXiv is based on a context free grammar (CFG) — basically a set of rules for computer-generated mad libs."