
A kinda okay text generator - pboyd2
https://blog.ftql.pw/posts/random-text/
======
crazygringo
POS tagging is a nice touch, but I've long wondered if there's a way to
combine Markov sentence generation with actual grammatical rules -- that if
there's an opening parenthesis there has to be a closing one, dependent
clauses, etc. It would need to both gramatically parse the inputs as well as
produce grammar trees for the outputs that would then be filled in by Markov
chains that fit the trees...

Heck, even do it at a level larger than sentences too, so questions are
followed by answers, a line of dialog is followed by a response, paragraphs
follow a realistic distribution of lengths...

~~~
TazeTSchnitzel
Hmm. A Markov chain where you actually do some kind of search? You build up a
chain then backtrack and retry as necessary until you get something meeting
the requirements.

~~~
sova
Sounds like getting some sort of context free grammar out of it. There is a LR
grammar algorithm that is very fast, you could use a Markov chain to generate
and then a CFG to verify

~~~
reificator
Or a CFG to generate at a high level and a Markov chain to fill in the details

~~~
sova
Excellent idea. Then it would actually read like language! But uh, how to
choose sentence length and which grammars you want?

------
piceas
Children's poetry makes for fun source texts for these projects. In this case
dumping a bunch of old poems generated mostly gibberish but I found the
following examples amusing.

While here on my deathbed I try to relate My many misfortunes and miseries
great. Poor thoughtless young thing! If I recollect right, I began life in
March, on a clear frosty night; And before I could see or was half a week old,
I nearly had perished, the barn was so cold. But this chilly spring I got
pretty well over, But there wasn't a tree for miles around, They were too
frightened to stay on the ground, And moused in the stable, or played in the
clover, Or till I was weary, which seldom occurred, Ran after my tail, which I
took for a bird

The wind did blow, the cloak did fly, Like streamer long and gay, Till loop
and button failing both, At last it flew away. Then might all people well
discern The bottles he had slung, A bottle swinging at each side, As hath been
said or sung. The dogs did bark, the children screamed, Up flew the windows
all, And ev'ry soul cried out, Well done!

When Betty screaming came down stairs, The wine is left behind! Good lack!
Quoth he yet bring it me, My leathern belt likewise, In which I bear my trusty
sword When I do exercise. Now Mistress Gilpin, careful soul!

There as the mother sits all day, On business from their houses, And late at
night returning home, To cheer their babes and spouses; While you and I have
oft-times heard How men are killed and undone, By overturns from carriages, By
thieves, and fires in London. We know what risks these landsmen run, From
noblemen to tailors.

~~~
aidenn0
Lewis Carrol is actually really fun in just a pure markov chain; usually the
diction is sufficient to put someone in the mood for Carrol, which means
nobody really expects it to make sense anyways.

~~~
adzm
One of my favorite parts of Gödel, Escher, Bach was the description of
translating Lewis Carroll's Jabberwocky into various languages -- particularly
troublesome since the original poem contains few English words to begin with.

~~~
aidenn0
I think that just highlights how hard poetry is to translate in general.
Carroll picked those words with thoughts to how they sound no less than other
poets that restrict themselves to English words.

Translating poetry well where both meaning and sounds are vital can be harder
than writing the poem in the first place was.

------
karanlyons
You’re going to want to look at
[https://en.wikipedia.org/wiki/Kneser–Ney_smoothing](https://en.wikipedia.org/wiki/Kneser–Ney_smoothing)
for further improvements on an ngram based approach.

~~~
unhammer
Or even just Stupid Backoff[1]: If you can't find anything of length N, try
N-1 but lower the probability by a fixed ratio (0.4).

[1] Page 2, equation 5 of
[https://www.aclweb.org/anthology/D07-1090.pdf#page=2](https://www.aclweb.org/anthology/D07-1090.pdf#page=2)

------
hantusk
Markov chains are cool and text generation is fun for experiments. I would
definitely recommend looking at deep learning language models, for a more
fruitful investment of time to improve further. (courses on
[http://www.fast.ai/](http://www.fast.ai/) are a good starting place)

------
cmroanirgo
I too played around with War and Peace...but in the russian.

Shameless plug: My password generator that takes text inputs and generates
wordish results:
[https://cmroanirgo.github.io/wordish/](https://cmroanirgo.github.io/wordish/)

(It's much simpler and less correct than OP's... but that's the point anyhow)

The site is pre configured to use a few different source text excerpts:

\- The Happy Prince, Oscar Wilde

\- Matthew 5, King James Bible

\- War and Peace, Tolstoy (russian)

\- klingon, anonymous web source

\- Lorem Ipsum, lorem ipsim generator

------
donquichotte
A nice experiment! I am always a bit surprised to see that CNNs & company
produce much less coherent and natural sounding text than the comparably
simple Markov Chain generators.

That said, I markov-generated some text with Nietzsche's "Also sprach
Zarathustra" as a source some time ago and it made more sense and was more
coherent than the original.

~~~
kastnerkyle
I don't think this is necessarily true. CNNs can produce incredible results,
see for example [0]. A snippet from the paper (their code is also available):

""" Prompt: The Mage, the Warrior, and the Priest

Story: A light breeze swept the ground, and carried with it still the distant
scents of dust and time-worn stone. The Warrior led the way, heaving her mass
of armour and muscle over the uneven terrain. She soon crested the last of the
low embankments, which still bore the unmistakable fingerprints of haste and
fear. She lifted herself up onto the top the rise, and looked out at the scene
before her. [...] """

RNN/LSTM type models can also produce really coherent results on the scales
shown in the blog.

Markov chains can be quite useful when you can manually create conditioning,
but some of the key power of neural methods is learning to process
conditioning and blend in ways that may not be obvious (or require significant
domain expertise to grok) when handcrafting. YeGoblynQueenne below discusses a
lot of the core issues for Markov chain based approaches.

In particular, plagiaristic sequences are incredibly common in Markovian
generation, and though there are some papers on how to deal with this [1] it
is not straightforward to decide what plagiarism really means in many
contexts. This problem also arises in neural models, but isn't nearly as
extreme due to the nature of both learning and the sampling process.

One _huge_ bonus of Markov chains is that controlling exactly what you want or
using _hard_ rules is pretty easy, which is definitely NOT the case with
neural generation...

This blog post is a really nice run through of how to get decent results from
Markov chains, but don't discount neural methods either - they are getting
better all the time, if you are willing to deal with the headaches.

[0] Hierarchical Neural Story Generation,
[https://arxiv.org/abs/1805.04833](https://arxiv.org/abs/1805.04833)

[1] Max Order [http://www.flow-machines.com/maxorder/](http://www.flow-
machines.com/maxorder/)

------
YeGoblynQueenne
One problem I see in the article above is that the result is evaluated only by
eyballing- which _is_ actually a perfectly legitimate evaluation method;
humans' language faculty, is, after all, the only process we know of that can
correctly recognise and generate natural language expressions (and other,
automatic, methods like BLEU scores, often simply automate the eyballing
process).

However, if you're working with statistical language modelling, you should
probably try to check the statistical properties of the resulting model! For
example- measure perplexity, or mutual entropy, etc. This will give you a more
concrete, and faster, way to evaluate your model than just generating some
text and looking at it, trying to figure out where it came from. I'd suggest
the author give that a try- they seem to have dipped a toe into statistical
language modelling but be unwilling to go all the way with more advanced
approaches (e.g. the cited paper, by Kassarnig, uses "n-grams, Justeson & Katz
POS tag filter, recurrent neural networks, and latent Dirichlet allocation" \-
but the article author doesn't seem to have tried any of that on its own, let
alone a combination, other than n-grams). The more advanced material can be
daunting, but it's not actually such a huge leap from Markov chains. And the
results can _sometimes_ be worth the effort.

. . .

Btw- it sounds to me like the approach taken in the "N-Grams, take 2" section
basically boils down to a PCFG (a Probabilistic CFG). It sounds like the
probability of the next token is calculated conditional upon the probabilities
of all preceding _strings_ of tokens in a sentence, which is what a PCFG does
(which is to say, PCFGs are not Markov). With an N high enough, that will,
indeed, copy your text verbatim :)

. . .

The author should keep in mind that no matter what you do, Markov chains are
always going to either look completely incoherent, or produce exact copies of
their training text.

In fact, that's a bit of a problem with most statistical language modellng
techniques. Even state of the art systems, that can produce very grammatical
text, fail badly when it comes to producing _coherent_ text (e.g. a perfectly
reasonably structured recipe, where the listed ingredients are not used in the
preparation instructions, etc). And that's something that no statistical
measure of fit can capture, unfortunately. So you're left again with eyballing
the end results and hoping for the best.

Edit: sorry, I'm babbling a bit. The way to use statistical evaluation
metrics, like perplexity, is to first automatically evaluate your generated
text, then eyball it to see how the metric matches your own "sense" of the
text's grammaticality. So then, if you figure out that a low perplexity score
produces more grammatical results, you favour models that produce low
perplexity etc.

~~~
pboyd2
> they seem to have dipped a toe into statistical language modelling but be
> unwilling to go all the way with more advanced approaches

That's fair.

Thanks for your suggestions. I clearly have some studying to do.

------
jerrre
I feel for fooling a quick skim, getting interpunction/capitalization
correct/believable was the biggest improvement.

~~~
pboyd2
Ha.. yeah, I left that bit out. It was several dead ends and I finally ended
up with something that was stupid simple. It seemed like more trouble to
explain than it was worth.

------
jwilk
What does "cweep" mean?

~~~
pboyd2
If it helps, here's the full paragraph:

    
    
      “We’ll send the infantwy down by the swamps,” Denísov continued.
      “They’ll cweep up to the garden; you’ll wide up fwom there with the
      Cossacks”—he pointed to a spot in the forest beyond the village—“and I
      with my hussars fwom here. And at the signal shot...”

~~~
aidenn0
What are "sramps"?

~~~
jwilk
[https://en.wiktionary.org/wiki/swamp#Noun](https://en.wiktionary.org/wiki/swamp#Noun)

------
firefoxd
Your prose is too prolix.

------
honzajde
Was fun to read. Thankd

------
sova
Very impressed and delighted with your example results

------
huhtenberg
This is going to be used for building better link farms in 3, 2, 1...

