She uses the phrase "spam robot". Is that semantically equivalent to automated? If so, for the love of god, can someone tell me how I can build something that auto-generates beautiful existential poetry like that? (I am only half joking here)
Have you played around with markov chains? They're really easy to build and produce very entertaining results - a few years ago I tried feeding all of the SXSW Interactive session titles in to one and got results like "Participatory budgeting crowdsourcing for real time marketing growing a digital culture".
* read in words
* keep track of the frequency of "wordA wordB", "wordB wordC", etc.
* use that frequency, e.g.:
{ "wordA" : {"wordB" : 0.5,
"wordC" : 0.25, ...},
"wordB" : {"wordC" : 0.75, ...} }
combined with a word, say "wordA", and pick a
random word from the frequencies, weighted by frequency.
* your word is now the word you just picked. repeat that
last step until you get bored.
You can go 'up' and 'down' in how many layers of frequency you want to collect, like the frequency of individual word pairs, vs word triples {"wordA wordB" : {"wordC" : 0.25}, etc}, vs sentences, anything.
The more you collect, the more 'real' your generation will tend to be, but it takes more data to train it effectively since you want lots of possibilities for each 'key' so it doesn't repeat itself. Otherwise it might think that the only thing that comes after 'key' so it doesn't repeat itself. Otherwise it might think that the only thing that comes after 'key' so it doesn't repeat itself. Otherwise it might think that the only thing that comes after 'key' is a single phrase, so it keeps selecting it and it doesn't sound like something anyone would actually say.
I only have half an answer: Markov chains. It's probably going through text and using Markov chains to create "similar" text.
Then question then becomes what text to go through, how long the Markov chains are, how you vary them, etc.
Not long ago I took a workshop on algorithmic music where the lead professor had published a book of haiku. The vast majority of the haiku were algorithmically generated via Markov chains, while a very slim minority were actual haiku from 17th century Japan (iirc) translated into English. Guessing which haiku were "real" and which algorithmic was very hard to do.
As the name vaguely alludes to, a lot of the text is extracted from ebooks, including public domain ones from Project Gutenberg. That said, I find it hard to believe that just picking random sentence fragments would give such a high signal-to-noise ratio, so there's probably something else going on.
If I were the one pulling the strings of Horse_ebooks, I would probably focus on building a statistical model that uses the number of favorites per tweet as its training data, and looks for linguistic features that are highly correlated with getting lots of favorites.
> If I were the one pulling the strings of Horse_ebooks, I would probably focus on building a statistical model that uses the number of favorites per tweet as its training data, and looks for linguistic features that are highly correlated with getting lots of favorites.
Yeah, but the problem in that approach is that you will always hit the cold start problem. What happens at the beginning when there is nothing? And no one likes you? If this was a robot, I can't imagine significant time was spent initially on custom designing these tweets.
Its original purpose is affiliate marketing spam - every so often it tweets links to buy crappy ebooks, and the random text was presumably originally an attempt to evade spam detection. The fact that the amusement it provides is entirely accidental in origin (we don't even know for sure whether the operator knows about it now) only adds to the brilliance.
(This is briefly mentioned in the article, but since you asked...)
I found ff cache the page on refresh instead of reloading it, so if you want other generated phrases you need to fool the browser appending some gibberish in get (à la ?foo=1234)
"1000_quotes|2441|2049|13240
1 - that matter, of National Review - there is not much reason to pay it attention. OK
0 - If a church offers no truth that is not available in the general culture - in, for instance, the editorials of the New York Times or, for OK
Search by tweets: fun & follow:
1559500496
158361802
375521629
391547065
Unfollowing...
ok 56349379
ok 113315835
delete in messages:
delete out messages: 352898710100922368, 352898708427382785, 352898707064225792,
exit check new followers
Send thank you message to: 50860829, 414537821, 479947786,"
Lately I've been tweaking and curating the output before posting it. It makes it a bit less magical but much funnier. It turns out that seeding your brain with random starting points like this is a potent creativity-boosting technique.
I tried using Bayesian spam filtering to classify the results as Funny or Not Funny, but it was unable to detect funniness just from which words were present in the message.
Another phenomenon inspired by @Horse_ebooks is the proliferation of non-spam "Horse" accounts. As examples, @Horse_JS, @Horse_Recruiter, and @Horse_iOS each have their own flavour of topical absurdity.