Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Five Thousand Novels, Ranked by Vividness (prosecraft.io)
54 points by benjismith 8 months ago | hide | past | web | favorite | 27 comments

This will come across as very mean-spirited but I find the idea of measuring a novel’s “vividness” based on the kind of vocabulary it uses (or the voice of the verb, etc.) to be completely bogus.

The purpose of this metric is to give authors a practical tool they can use to measure the degree of direct sensory language they use in their writing, and compare it with authors they admire. It's not a value judgement.

There are other admirable qualities of writing (emotional, conceptual, etc) that are orthogonal to "vividness", and in the long-run, I plan on developing metrics for those qualities as well.

For example, take a look at the works of Jane Austen. Here's the analysis of "Pride and Prejudice":


She's a brilliant writer, and her prose is highly emotional, but it isn't especially vivid, according to my definition of "vividness", which is: prose that evokes a sensory experience (with colors, textures, flavors, aromas, sounds, and bodily sensations).

There are plenty of ways to write a brilliant novel, and some of them involve vivid sensory writing. But Jane Austen's brilliance comes from her handling of emotional relationships.

For a modern example of the same phenomenon, take a look at one of my favorite authors, Nick Hornby:



I've read every one of his novels, and they're all about human relationships, but the prose itself isn't very vivid. Nothing wrong with that, though. It's just a measurement.

As an author, it's helpful to be mindful of these kinds of measurements. The same thing is true of "passive voice". Using a lot of passive voice is still a legitimate way of writing, but it's helpful for an author to be aware of the literary voice they're crafting:


My impression is that your "vividness" metric is closed source. [0]

Your metric is wholly subjective without a derivation and formula. We have no clue what's being measured. Your results are susceptible to "Yeah, well, that's just like your opinion man".

[0] https://blog.shaxpir.com/writing-vivid-prose-33283e861358

In these criticisms, I find a laudable impulse to protect language from being captured by formal analysis, with an ethical impulse something like not wanting to see an elephant caged at a zoo.

But in outcome, I always seem to see so much more detail in the positive efforts to analyze than I do in the defenses of language as being beyond analysis. Effort at making analysis comes with deep engagement on the different dimensions upon which language can be expressive, and the gist of challenges to these analyses are "it's subjective. You just can't do it!"

And without even commenting on the substance of these respective arguments, the types of output they tend to produce makes me more sympathetic to those making the effort to analyze.

I don't think this is about "protect[ing] language from being captured by formal analysis", it's about calling out bogus analysis when you see it.

This "vividness" analysis is bogus for two reasons: (1) The idea that individual words have objectively different levels of "vividness" completely divorced from context (when and where the novel was written, which character is speaking, etc.) is extremely debatable; (2) The idea that the "vividness" of individual words makes the novel as a whole "vivid" is a logical fallacy (compositional fallacy).

I'm 100% behind the use of "formal analysis" to extract new insights into literature and language – there are examples where it has been done very well [1] – but analysis has to be robust, which I don't think it is here.

[1] http://jonreeve.com/2016/07/paradise-lost-macroetymology/

I don't fear machine analysis draining the life of literature. Its limit would simply be a functioning similar to a human's. Rather than algorithmic analysis, we'd lump it with literary criticism. Here the computer's personal context rules on how 'vivid' or 'good' a text is.

This aside, I am as decidedly pro-analysis as your comment. My claim that OP's method may as well be subjective was a criticism of his process' insubstantial description.

The formula is easy, as explained in the article:

1) For any word in the 10,000-word vividness dictionary, add its score to the sum.

2) Divide by the total word count.

The complete word list in the vividness dictionary, and the scores of each word, are in constant flux, based on the results of a massive machine-learning algorithm, driven by the set of novels in the corpus.

Every time we add new novels, the results change slightly, but as the corpus, but the word-list and scores will eventually converge, and perhaps then we'll publish the dataset :)

The formula involves:

1) A mystery heuristic to arrive at your 10,000 'vivid' words

2) A mystery 'vivid' word numerical rating "...depending on the intensity of the sensory experience it invokes"

3) A mystery "linguistic algorithm"/"massive machine-learning algorithm"

Imagine clicking on an HN post titled "I made a raytracer in Python". Imagine the post contains pretty examples of the renders. Unfortunately it describes the raytracer as "cutting edge" but has no code for repeatability. Worse yet, the post doesn't recount any critical reasoning employed in the process of building a raytracer.

The algorithm for building the lexicon and scoring the vividness of each word is still evolving pretty rapidly, so I'm not quite ready to publish the exact details yet, but I'll eventually write about it in-depth...

Until then, here's a high-level overview:

1) Start with a human-curated list of several hundred vivid words. Be sure to include words that invoke all the senses: sight, sound, touch, smell, taste, and bodily sensation. These are the "seed words".

2) Scan through the entire corpus and find all instances of those seed words.

3) Vivid words tend to occur in clusters, so find all words that tend to occur in close proximity to the original seed words.

4) Create a list of new candidate words, and suggest them to a human reviewer.

5) The human review accepts or rejects each of the candidate words.

6) The accepted candidate words become new seed words for the next iteration. Eventually, you'll have somewhere in the neighborhood of 10,000 words :)

7) The score of each word is based on a modified TF/IDF metric, with a few extra finishing touches (like incorporating sentiment scores from the "Hedonometer" project at the University of Vermont Computational Story Lab).

That's where the algorithm is at right now. I've been iterating on this basic premise for over a year, and I'm pretty happy with the current results. It's not perfect, but it's useful. As far as I'm concerned, it's a pretty mature "version one" of the vividness metric.

But it has several weaknesses I want to address in a "version two" sometime soon.

The biggest weakness is that TF/IDF is only a shallow proxy for intensity of vividness.

For example, looking through the 500 million words in the prosecraft corpus, the word "blood-red" occurs 972 times, yielding a vividness score of 6.1. But the word "magenta" occurs only 515 times, yielding a vividness score of 8.3. It's more rare, so the model thinks it's more vivid.

To me, that seems wrong. Because the word "blood" has special bodily meaning, beyond just a prefix for a color-word, my intuition says the word "blood-red" should be scored as significantly more vivid than "magenta".

I have some ideas for addressing that deficiency by using the output of "version one", alongside a new "sensory hierarchy" model to train a new "version two" classifier.

But that new model is still in its very early conceptual stage, and I'm not ready to write about it yet :)

In the meantime, I consider "version one" a legitimately useful tool for working authors.

Glad you asked!

I appreciate the current insight. Hearing about current pitfalls and how you navigate them could boost your followup posts.

One "adversarial" author to the curated list approach might be David Foster Wallace. Partly because of his large vocabulary, which would only partially cluster with the list.

More-so because of a trickier pitfall I see with trying to quantify vividness: abstract metaphor can be vivid without using any 'vivid' words. Referring to a baby having a "little lipless hyphen of a mouth" would likely score poorly on vividness.

It's interesting you use Jane Austen as an example, as I have found her books to be filled with more than enough detail to picture the scenes quite distinctly.

I find it particularly interesting that your algorithm has identified the argument between Elizabeth and Lady Catherine as the most passive page, when in fact it's one of the most tense exchanges between any two characters and is dripping with sarcasm on both sides. They use the passive voice to insulate the argument and add a veneer of social acceptability.

Have you done any analysis to see how your tools handle irony? The description makes it seem words are taken at face value, but really good prose usually operates on several levels.

Totally agree

Here's an article I wrote, describing the idea behind the project, defining the idea of "vividness", and explaining how the linguistic analysis works:


You can click around anywhere on the histogram chart, to see the different percentile buckets. And you can click on any of the books, to see detailed linguistics, including a snippet of the most vivid page in the book.

Thanks! I'm enjoying comparing novels I've read and finding surprises at misremembering the prose of some of them. Was it a deliberate decision to not include the ability to search for specific titles? I've been clicking through the percentiles and realized could save some time finding the titles if they were all on one page (Ctrl-F) or via a search box.

edit: manually editing the URL helps with this :)

edit2: and apparently the home page

You can click on the logo in the upper-left corner, or you can go directly to the homepage, http://prosecraft.io, to search for specific titles :)

I'm disappointed there is no Gene Wolfe. I'd be extremely interested at his metrics.

just discovered this! Thank you.

How'd you assemble the corpus? Only some of these books are public domain, did you have to buy/license the rest?

Apologies if this is mentioned and I missed it, but does this account for changes in word meaning or context over time? Earlier literature, such as Austen, could be considered “not-vivid” unless you’re clued in for particular hints/phrases. I’m thinking of, perhaps, the use of “Et cetera” for pudenda.

Here is the "most vivid" page of the "most vivid" book:

"giants, and they were impaled by spear, lance, and crystal shard. A series of explosive reports echoed across the battlefield as the giants stumbled upon the Mistcloak’s tripwires, sending lethal blossoms of sharpened steel twisting through the air. Fell and his minions moved through the giants like an avalanche. The Under-King shifted his form to a flowing slab of stone and crashed down upon giant flesh, pulverizing it to blue powder and red ash. Even the animals, though weak and weary, tore into the giants with the primal fury of the wild. Claw and fang stood with horn and hoof, wounding with equal enmity. Beak and talon darted and gouged. The entire island of Mistgard stood united against the foul armies of frost and fire. Devastation was rampant on the mountain, but it was nothing compared to the wrath of the Storm Speaker. Even the stoic Under-King was surprised at the power of the Oldest of Cubs. At the back of the Pandyr’s armies, high atop the tallest of Fell’s battlements, stood the lone figure of the Storm Speaker. He called forth and charmed the very storms from the clouds beneath him and sent electric green-and-blue arcs of lightning into the giants’ lines, blasting hundreds of their bodies off of the battlefield and into the mist below. The world above burned. The radiant morning light was blackened by acrid smoke, making the golden skull radiate a brown and bloody glow. The Aesirmyr lay strewn with broken bodies: blue and red"

This is an interesting analysis, but I have a serious problem with the strong implication that high vividness = good.

At the rock bottom of the vividness scale, we find Jane Austen, Isaac Asimov, Agatha Christie, C.J. Cherryh, and Danielle Steel -- all extremely popular authors. And at the very top, we find George R.R. Martin, Roald Dahl, Poul Anderson, Edgar Rice Burroughs, Ray Bradbury, and Kim Stanley Robinson -- also popular, but generally not quite of the same stature as those on the first list.

Possibly the reading public is slightly biased toward low vividness. Meanwhile, I have at least two favorite authors on both lists.

I was naturally curious about outliers so went to the most vivid book which is "Pygmy". I haven't read this book, but style of it is listed as incorrect grammar "English" written in a detached scientific tone. I wonder how much this threw off the algorithm to cause it to have such a high score (over 100% and nearly 25% higher than second most vivid).

This is great.

I wish we could do our own arbitrary style analysis on the data set sort of the way one can do a factor analysis on a portfolio.

I would look at words that are common between Marukami and McCarthy compared to the rest of the corpus for instance.

Interesting project. Not surprised to see Chuck Palahniuk at the top of the list, but was a little shocked how much he dominated it.

Incredible concept! On mobile (ios 6s) the website needs some work.

This is very cool. I want to search, can we search?

You can search from the homepage: http://prosecraft.io

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact