The algorithm for building the lexicon and scoring the vividness of each word is...

The algorithm for building the lexicon and scoring the vividness of each word is still evolving pretty rapidly, so I'm not quite ready to publish the exact details yet, but I'll eventually write about it in-depth...

Until then, here's a high-level overview:

1) Start with a human-curated list of several hundred vivid words. Be sure to include words that invoke all the senses: sight, sound, touch, smell, taste, and bodily sensation. These are the "seed words".

2) Scan through the entire corpus and find all instances of those seed words.

3) Vivid words tend to occur in clusters, so find all words that tend to occur in close proximity to the original seed words.

4) Create a list of new candidate words, and suggest them to a human reviewer.

5) The human review accepts or rejects each of the candidate words.

6) The accepted candidate words become new seed words for the next iteration. Eventually, you'll have somewhere in the neighborhood of 10,000 words :)

7) The score of each word is based on a modified TF/IDF metric, with a few extra finishing touches (like incorporating sentiment scores from the "Hedonometer" project at the University of Vermont Computational Story Lab).

That's where the algorithm is at right now. I've been iterating on this basic premise for over a year, and I'm pretty happy with the current results. It's not perfect, but it's useful. As far as I'm concerned, it's a pretty mature "version one" of the vividness metric.

But it has several weaknesses I want to address in a "version two" sometime soon.

The biggest weakness is that TF/IDF is only a shallow proxy for intensity of vividness.

For example, looking through the 500 million words in the prosecraft corpus, the word "blood-red" occurs 972 times, yielding a vividness score of 6.1. But the word "magenta" occurs only 515 times, yielding a vividness score of 8.3. It's more rare, so the model thinks it's more vivid.

To me, that seems wrong. Because the word "blood" has special bodily meaning, beyond just a prefix for a color-word, my intuition says the word "blood-red" should be scored as significantly more vivid than "magenta".

I have some ideas for addressing that deficiency by using the output of "version one", alongside a new "sensory hierarchy" model to train a new "version two" classifier.

But that new model is still in its very early conceptual stage, and I'm not ready to write about it yet :)

In the meantime, I consider "version one" a legitimately useful tool for working authors.

Glad you asked!