The shape of stories is a very good pedagogical tool to understand narrative, a nice complement to the Hero's Journey. It makes it more intuitive and easier to apply to other story structures, and even to see different approaches from different cultures.
If this isn't nice, I don't know what is.
There is this atomic purity when you come across someone who ruthlessly wants to drive at the truth. Not be right. Not have power. Not project their aesthetic. Nothing but the truth. That is the kind of person I aspire to be.
You can see the emotional story arc -- the shapes of the stories -- for more than 16,000 books.
I train a Word2Vec model on the vocabulary of all those books (almost 1.5 billion words) and then I use a clustering algorithm to score all those words on a sentiment scale of 1 to 10 (where 1 is the most negative and 10 is the most positive). Then I break the books into 50 equal-sized chunks and aggregate the positive and negative scores for each chunk.
You can click on any of the chart segments to see a word cloud of all the words that contributed to the positive and negative sentiment of that chunk. You can really see the ups and downs of the stories, as the protagonists struggle to overcome their obstacles, when you look at those charts!
Here are a few of my favorite example books to show people:
Harry Potter and the Deathly Hallows
I first encountered this method not through Vonnegut but through the "Hedonometer" project, at the University of Vermont Computational Story Lab. They use this technique on the twitter firehose, to measure the overall emotional arc of the world, as expressed in social media.
There's an excellent episode of the podcast Lexicon Valley where they discuss the hedonometer project, with the researchers at UVM who developed it...
The "most passive page" thing also does not seem to be working. Passive as in passive voice? If yes it's also pretty off the mark.
It's easy to imagine exceptions to the idea of a simple numerical word-scoring algorithm...
Of course, a word like "bad" might be used ironically, or in some other slang-sense, with a different literal meaning on the page...
But that's totally fine. In principle, the word2vec algorithm is designed to cope with ambiguities like that.
When you analyze billions of words of prose, you can build a model of word-associativity that captures the superposition of all those different word-senses, and the contexts where they tend to appear on the page.
After a big crazy machine-learning process, each word is modeled as a vector in 300-dimensional space, with a vast network of associations and relationships between the other words in the vector-space, based on the way those words are used together in typical English grammar.
When we score the emotional valence of a particular word, we use a "word-vector" technique where those ambiguities are basically already priced into the scoring calculation. Words with a "less ambiguous" sentiment score (joy, paradise, ..., agony, depression) have their lack-of-ambiguity baked into the formula already.
Extreme scores are reserved for words with unambiguous intensity.
But the important thing is: we're not really as concerned about the numerical scores of individual words as we are with the shifting balance of those sentiment scores over the course of a long document.
It's not a perfect way of scoring sentiment of individual words, but it's REALLY reliable for estimating the basic structure of a narrative.
Of the books you've analyzed, interesting - but not necessarily surprising - to see a Palahniuk book has the least "passive voice" usage (1).
I threw a curveball at it: http://prosecraft.io/library/mark-z-danielewski/house-of-lea...
It would be interesting to see if Prosecraft would ever correlate "similar books" with Borges since Danielewski said that was an influence.
So books are more likely to be similar if they're roughly in the same genre and discuss similar kinds of topics (dragons, computers, romance, spies, war, shopping, time-travel, magic, hunting, etc).
Someday I hope the "similar books" feature will be a bit more sophisticated, where other kinds of "similarity" will also be relevant, beyond just the topic-model... Other things like: story structure, narrative voice, irony, vocabulary, sense-of-humor, lyricisim, etc...
Maybe I'm simplifying and applying a template to the "happy ending" cliche.
Once you read this, you'll start seeing that every popular movie today follows this guide, almost precisely to-the-minute. Book says "by page X of screenplay, Y should happen" and you'll see that it does.
The original pattern had the Shaman travel into the world of the supernatural to complete a kind of quest to solve a problem back home. In the Western world, the "supernatural realm" stopped catching on after a while, so we got first the Hero's Journey (vice Campbell) and the modern version of that myth, the MIH/BMG story where the protagonist gets into trouble and out again, possibly while staying in the same town all along.
"A hero ventures forth from the world of common day into a region of supernatural wonder: fabulous forces are there encountered and a decisive victory is won: the hero comes back from this mysterious adventure with the power to bestow boons on his fellow man."
It turns the typical detective story to its head. You see the murder and who does it as the first thing in the episode. The rest is how the detective intuits it and then most importantly the psychological battle and clever tricks required to corner the murderer and prove it was him.
https://youtu.be/aaPQsYqNDnY (NSFW, language)
Perhaps, at this level of shape, it’s a bit like saying that all software is just a variation on CRUD.
BMG: Sleepless in Seattle
MIH: A Christmas Carol
BMG + MIH: Groundhog Day
What shape would you use to graph the Monkey's Paw stories ? (where you think you are getting a great wish, but end up with a bad cause for the good thngs)
E.g. the Simpsons episode with the Monkey's Paw is a roller coaster of ups and downs