
Tofu – Twitter bot - unheaped
http://tofuproduct.net/
======
adamb_
Without viewable source code, or a solid explanation of the algorithm at play,
this feels a bit like the Turk.[1]

[http://en.wikipedia.org/wiki/The_Turk](http://en.wikipedia.org/wiki/The_Turk)

~~~
Falling3
I'm not so sure it works well enough for us to be at all suspicious of that...

I just got my tofu tweet: "As a few words. rocking wearing? I injured my
syntax highlighting =D? I gotta!"

------
HeyChinaski
I made a very similar bot a few years ago.
[http://twitter.com/markovator](http://twitter.com/markovator)

------
wgx
Um.. so Tofu just seemed to snip a few words from my recent tweets, and string
them into nonsense:
[https://twitter.com/tofu_product/status/409054279328993280](https://twitter.com/tofu_product/status/409054279328993280)

~~~
geekam
It seems like a simple generator like
[http://www.wisdomofchopra.com/](http://www.wisdomofchopra.com/)

------
robitor
How is something like this made? What's the math going on behind the scenes?

~~~
delluminatus
These kinds of systems almost always use an n-gram model to generate text.
It's a fascinating and surprisingly effective tool (the surprise is that
implementing one is extremely simple).

Basically, it's based on collecting data from a corpus about probabilities of
word transitions -- the probability of the "next word" given the n preceding
words. So n-gram models often produce sentences that appear correct in small
areas of a few words, but as a whole don't make much sense. The higher the
"n", the more the generated sentences come to resemble the writing in the
corpus. Once you get to quadrigrams, usually the model simply produces exact
sentences verbatim from the text because the data becomes very sparse (how
many times do you see "then I turned" in a corpus of tweets, for instance?),
unless you have a very large corpus.

Like many NLP models, the n-gram model has seen many variations and tweaks
that have differing levels of effectiveness, where often different tweaks
produce more believable results for different corpora or variety thereof.

The math is quite simple. Let's take n=2, also called "bigrams", also called
"Markov chains". In Pythonic pseudocode (because math notation doesn't work on
HN) you could create a frequency distribution with:

    
    
      for i in range(1,words.length):
    
        model[words[i-1]][words[i]] += 1
    

assuming that "model" is a nested dictionary where keys have a default value
of 0. then normalise:

    
    
      for prior,words in model:
    
        n = sum(words.values)
    
        words.values.map(lambda x: x / n)

~~~
robitor
Thanks! I'm taking an information retrieval course right now and I'm
interested in applying what I've learned to a cool pet project. I don't think
we ever touched on n-gram models for some reason

~~~
saraid216
This isn't information retrieval. This is data processing. Information
retrieval is a subset of data processing.

Retrieval specifically needs an algorithm to determine document relevance.
Everything you're learning is to understand how different parts of that
algorithm affect the results. It's a very difficult problem, even if you
assume that the corpus isn't sapient.

Stuff like n-grams are more about reshuffling in order to expose patterns.
It's a little bit like regressing some noisy data to see the trend of
correlation.

