Hacker News new | past | comments | ask | show | jobs | submit login
Happy third birthday, Horse_ebooks (medium.com/language-lingustics)
70 points by wallflower on Aug 7, 2013 | hide | past | favorite | 20 comments



She uses the phrase "spam robot". Is that semantically equivalent to automated? If so, for the love of god, can someone tell me how I can build something that auto-generates beautiful existential poetry like that? (I am only half joking here)


Have you played around with markov chains? They're really easy to build and produce very entertaining results - a few years ago I tried feeding all of the SXSW Interactive session titles in to one and got results like "Participatory budgeting crowdsourcing for real time marketing growing a digital culture".


Can you suggest some good resources for newbies/laymen on the topic?

Edit - For anyone else who might be looking, this page gives a nice pythonic example:

http://agiliq.com/blog/2009/06/generating-pseudo-random-text...


Well, as a very rough overview, all they are is:

    * read in words
    * keep track of the frequency of "wordA wordB", "wordB wordC", etc.
    * use that frequency, e.g.:
      { "wordA" : {"wordB" : 0.5,
                   "wordC" : 0.25, ...},
        "wordB" : {"wordC" : 0.75, ...} }
      combined with a word, say "wordA", and pick a
      random word from the frequencies, weighted by frequency.
    * your word is now the word you just picked.  repeat that
      last step until you get bored.
You can go 'up' and 'down' in how many layers of frequency you want to collect, like the frequency of individual word pairs, vs word triples {"wordA wordB" : {"wordC" : 0.25}, etc}, vs sentences, anything.

The more you collect, the more 'real' your generation will tend to be, but it takes more data to train it effectively since you want lots of possibilities for each 'key' so it doesn't repeat itself. Otherwise it might think that the only thing that comes after 'key' so it doesn't repeat itself. Otherwise it might think that the only thing that comes after 'key' so it doesn't repeat itself. Otherwise it might think that the only thing that comes after 'key' is a single phrase, so it keeps selecting it and it doesn't sound like something anyone would actually say.


I only have half an answer: Markov chains. It's probably going through text and using Markov chains to create "similar" text.

Then question then becomes what text to go through, how long the Markov chains are, how you vary them, etc.

Not long ago I took a workshop on algorithmic music where the lead professor had published a book of haiku. The vast majority of the haiku were algorithmically generated via Markov chains, while a very slim minority were actual haiku from 17th century Japan (iirc) translated into English. Guessing which haiku were "real" and which algorithmic was very hard to do.


Do you happen to have more information on that book?


As the name vaguely alludes to, a lot of the text is extracted from ebooks, including public domain ones from Project Gutenberg. That said, I find it hard to believe that just picking random sentence fragments would give such a high signal-to-noise ratio, so there's probably something else going on.

If I were the one pulling the strings of Horse_ebooks, I would probably focus on building a statistical model that uses the number of favorites per tweet as its training data, and looks for linguistic features that are highly correlated with getting lots of favorites.


> If I were the one pulling the strings of Horse_ebooks, I would probably focus on building a statistical model that uses the number of favorites per tweet as its training data, and looks for linguistic features that are highly correlated with getting lots of favorites.

Yeah, but the problem in that approach is that you will always hit the cold start problem. What happens at the beginning when there is nothing? And no one likes you? If this was a robot, I can't imagine significant time was spent initially on custom designing these tweets.


Its original purpose is affiliate marketing spam - every so often it tweets links to buy crappy ebooks, and the random text was presumably originally an attempt to evade spam detection. The fact that the amusement it provides is entirely accidental in origin (we don't even know for sure whether the operator knows about it now) only adds to the brilliance.

(This is briefly mentioned in the article, but since you asked...)


Some ideas from the guy who created @nytimes_ebooks: http://harrisj.tumblr.com/post/23737140672/nytimes-ebooks


half joking... have a look at http://www.polygen.org. AFAIK it's only in Italian, but it can generate pretty much anything with a little grammar...

-- EDIT --

on http://www.polygen.org/it/grammatiche/tecnologie/eng/manager... you can find an online example in English.

I found ff cache the page on refresh instead of reloading it, so if you want other generated phrases you need to fool the browser appending some gibberish in get (à la ?foo=1234)


horse-ebooks.com now goes to quotestatusjoke.com.

What does this iframe on quotestatusjoke.com do? http://quotestatusjoke.com/twitter/tw.php

The contents say:

> Now 07:27 - we work from 14:00

And a cached version says:

http://webcache.googleusercontent.com/search?q=cache:M0v3DQU...

"1000_quotes|2441|2049|13240 1 - that matter, of National Review - there is not much reason to pay it attention. OK 0 - If a church offers no truth that is not available in the general culture - in, for instance, the editorials of the New York Times or, for OK

Search by tweets: fun & follow: 1559500496 158361802 375521629 391547065

Unfollowing... ok 56349379 ok 113315835

delete in messages:

delete out messages: 352898710100922368, 352898708427382785, 352898707064225792,

exit check new followers

Send thank you message to: 50860829, 414537821, 479947786,"

Is this a site visitor-triggered cron job?


Because everything happens so much, have a look at the most popular horse_ebooks tweets http://favstar.fm/users/horse_ebooks


Here is some code to create your own _ebooks twitter account: https://github.com/mispy/twitter_ebooks


The day I found Horse_ebooks is the day I "got" Twitter.

Thank you, Horse_ebooks, and happy birthday.


Here's more code for running your own: https://github.com/longears/horsey-books It's running the (NSFW) bots @fetlife_ebooks, @obscuregenres, and @snes_games.

Lately I've been tweaking and curating the output before posting it. It makes it a bit less magical but much funnier. It turns out that seeding your brain with random starting points like this is a potent creativity-boosting technique.

I tried using Bayesian spam filtering to classify the results as Funny or Not Funny, but it was unable to detect funniness just from which words were present in the message.


Another phenomenon inspired by @Horse_ebooks is the proliferation of non-spam "Horse" accounts. As examples, @Horse_JS, @Horse_Recruiter, and @Horse_iOS each have their own flavour of topical absurdity.


Thanks. The @Horse_ebooks account is spammy and the bizarro tweets are far too rare IMO. I'll check out the other ones.


Hmm, nobody's mentioned Horse_eComics, the web comic that takes Horse_eBooks tweets and turns them into comics:

http://horseecomics.tumblr.com


It reminds me of the Cornell boxes created by Neuromancer and Wintermute




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: