

Generating random text - whiskers
http://www.cs.bell-labs.com/cm/cs/pearls/sec153.html

======
thristian
Another practical example of Markov-chaining is <http://www.x11r5.com/> \- a
robot that's various trained from IRC, Twitter and Identica content.

There's even a weekly podcast generated from news headlines:
<http://www.x11r5.com/radio/>

~~~
DieBuche
I got an awesome one: “Wtf are you doing in 2008?" obama gave them to read it
as a utility to basically throw random hacked kexts at the corner store stocks
mexican coke.”

------
NathanKP
I accidentally reinvented this algorithm many years ago as a C++ project. Here
is an example of passages that it created:

sections of ice fell through the. invectives in which he had been wondering
how. roman scales was in readiness. occasional murmur of pain that continued
to torment. desavanchers dicksen dickson dochard du chaillu duncan durand was
a. waging a war of extermination against. lively about it no snap or bite.
chairs cane sofas carved wood pillars rose pillars. skirting an acclivity
covered with woods and dotted with trees of very deep water. scratching as
though he'd tear his nails out and sharp bite out. jerked by a sudden stoppage
of the sled dogs barked. mentioned a cemetery on the south the four brilliants
of the sky great and. ranks plunging into the flames would extinguish them
beneath their mass and the rest were seen in numerous flocks hovering about
the borders of some beautiful river until it fell. fridays well i hope that we
shall overcome. emphatically an' i make free the. profitable the captains of
the sea and consequently the balloon remained.

You can see more info about it or download my source code at:

[http://experimentgarden.blogspot.com/2009/11/software-
tool-f...](http://experimentgarden.blogspot.com/2009/11/software-tool-for-low-
order.html)

------
IsaacL
Markov chains are always good fun to play with... a few months ago I worked on
a class project which generated markov models from Final Fantasy SNES tracks.
(<https://github.com/IsaacLewis/MidiMarkov>). I should blog about it at some
point.

I hadn't seen Shannon's algorithm before though, which looks a bit more memory
efficient than the approach I used.

------
juretriglav
I remember reading about that in "The Information", where it is described how
Claude Shannon did it.

Now I'll have to tweak my spam detection even more. Joke aside and somebody
correct me if I'm wrong, spam probably runs on a simple wordlist type
algorithm.

What is then the usefulness (I define usefulness extremely wide) of such a
generator?

~~~
imurray
If you have a good generator for text, you have a useful language model that
can be plugged into applications such as speech recognition, OCR, predictive
text entry systems and compression.

~~~
juretriglav
Could you somehow use it in reverse? What I mean is, is it possible to get a
random text generator for a certain language and then use it to determine,
whether a given text is in that language or not?

~~~
imurray
Yes.

Given the string so far, see how probable it is that the generator would
generate the next character, p(x_n | x_<n). Running through the whole string
you can build up the log probability of the whole string: log p(x) = \sum_n
log p(x_n | x_<n). Comparing the log probabilities under different models
gives you a language classifier. For a first stab at the one-class problem,
compare the log probability to what the model typically assigns to strings it
randomly generates.

For more on information theory, modelling and inference you might like:
<http://www.inference.phy.cam.ac.uk/mackay/itila/book.html>

------
cstavish
Bentley's collection of Programming Pearls radically changed my outlook (when
I read it as a novice programmer, I considered programming itself to be an
end, not a means to an end). I still read and re-read selections to this day,
and I always seem to learn something new each time.

Would anyone care to share books of similar quality or importance?

------
StavrosK
Is it possible to generate text by a sort of reverse-LDA, where you have
topics (per-sentence or per-paragraph, ideally) and estimate the probability
of a word to appear in a given topic?

You could then use these topics to generate more realistic-looking text as
this ostensibly wouldn't have the wild jumps from one topic to another that
naive Markov chains have.

Has anyone done anything like this, or should I give it a shot?

~~~
mostlycarbon
That sounds possible, but gathering all the data for topic-specific training
might be somewhat maddening. The problem you get with larger groupings of
words is lack of cohesion. If you trained it at a sentence level, you might be
able to produce coherent sentences. But as you generated more sentences to
produce a paragraph, it would likely meander.

I created a kind of Mad Lib generator using CFGs. A paragraph consisted of:
[Intro] [Supporting sentence 1] [Supporting sentence 2] [Supporting sentence
3] ... [Conclusion]. All the sentences had placeholders for various nouns and
adjectives that could later be filled in programmatically, and I extended the
grammar spec to both permute sets of sentences and generate null productions
with certain probabilities.

The base sentences were created by humans, about 15 per grammar rule. A single
person could create a topic-based paragraph/grammar in less than two work
days. The chances of it creating the same template twice was about one in a
billion. Of course, the probability varied depending on how many seed
sentences were present.

If the person writing the seed sentences is literate and passed the 6th grade,
then everything the program generates is indistinguishable from human text.

It works marvelously.

~~~
StavrosK
That's interesting, do you have any code or examples?

My thought about topic detection is to have it learn which words go together,
and then augment the Markov chain model by some method that would weigh the
Markov chain probability with the topic probability to select the next word,
so it would at least generally stick to topic-relevant words. Perhaps you
could even select one topic (a sample sentence, really) in the beginning and
have it generate sentences based on that for the entire document.

~~~
mostlycarbon
I wish I could publish it, but my company isn't very much into open source.
It's a standard context free grammar framework modified to generate output in
a stochastic manner. So it's basically a stochastic context free grammar
(SCFG). I can go more into depth in private if you like.

The phrase for finding word pairs in text corpora is "cohort analysis". I was
a on research team that did studies of that; mostly finding them, not
generating anything with them.

It's an interesting subject area.

~~~
StavrosK
That gives me a good idea for further research, thank you.

------
caustic
"Dive into Python" contains chapter on XML processing[1] that uses grammar
defined in XML to generate random English passages. A very funny and
informative reading.

[1] <http://diveintopython.org/xml_processing/>

------
mirkules
Markov Chains are really handy. You can also use the Hidden Markov Model to do
voice recognition. <http://en.wikipedia.org/wiki/Speech_recognition>

------
leviathan
Shameless Plug: This is very similar to what I use in <http://wordum.net/> but
instead of letters, I use whole words to generate the text.

~~~
MasterScrat
That's exactly the idea that popped in my mind when I started reading the
article (except I would have done it for free).

How successful is it? What are the advantages compared to eg
randomtextgenerator.com?

------
pbewig
I did this exercise on my blog at
<http://programmingpraxis.com/2009/02/27/mark-v-shaney/>.

------
agentq
<http://www.jwz.org/dadadodo/> DadaDodo is a fairly old implementation that
may be of interest.

------
Stormbringer
Weren't they looking for help data mining the Palin emails? That'd be a better
selection of random gibberish wouldn't it?

~~~
hugh3
Please, no, just don't.

