There are other admirable qualities of writing (emotional, conceptual, etc) that are orthogonal to "vividness", and in the long-run, I plan on developing metrics for those qualities as well.
For example, take a look at the works of Jane Austen. Here's the analysis of "Pride and Prejudice":
She's a brilliant writer, and her prose is highly emotional, but it isn't especially vivid, according to my definition of "vividness", which is: prose that evokes a sensory experience (with colors, textures, flavors, aromas, sounds, and bodily sensations).
There are plenty of ways to write a brilliant novel, and some of them involve vivid sensory writing. But Jane Austen's brilliance comes from her handling of emotional relationships.
For a modern example of the same phenomenon, take a look at one of my favorite authors, Nick Hornby:
I've read every one of his novels, and they're all about human relationships, but the prose itself isn't very vivid. Nothing wrong with that, though. It's just a measurement.
As an author, it's helpful to be mindful of these kinds of measurements. The same thing is true of "passive voice". Using a lot of passive voice is still a legitimate way of writing, but it's helpful for an author to be aware of the literary voice they're crafting:
Your metric is wholly subjective without a derivation and formula. We have no clue what's being measured. Your results are susceptible to "Yeah, well, that's just like your opinion man".
But in outcome, I always seem to see so much more detail in the positive efforts to analyze than I do in the defenses of language as being beyond analysis. Effort at making analysis comes with deep engagement on the different dimensions upon which language can be expressive, and the gist of challenges to these analyses are "it's subjective. You just can't do it!"
And without even commenting on the substance of these respective arguments, the types of output they tend to produce makes me more sympathetic to those making the effort to analyze.
This "vividness" analysis is bogus for two reasons: (1) The idea that individual words have objectively different levels of "vividness" completely divorced from context (when and where the novel was written, which character is speaking, etc.) is extremely debatable; (2) The idea that the "vividness" of individual words makes the novel as a whole "vivid" is a logical fallacy (compositional fallacy).
I'm 100% behind the use of "formal analysis" to extract new insights into literature and language – there are examples where it has been done very well  – but analysis has to be robust, which I don't think it is here.
This aside, I am as decidedly pro-analysis as your comment. My claim that OP's method may as well be subjective was a criticism of his process' insubstantial description.
1) For any word in the 10,000-word vividness dictionary, add its score to the sum.
2) Divide by the total word count.
The complete word list in the vividness dictionary, and the scores of each word, are in constant flux, based on the results of a massive machine-learning algorithm, driven by the set of novels in the corpus.
Every time we add new novels, the results change slightly, but as the corpus, but the word-list and scores will eventually converge, and perhaps then we'll publish the dataset :)
1) A mystery heuristic to arrive at your 10,000 'vivid' words
2) A mystery 'vivid' word numerical rating "...depending on the intensity of the sensory experience it invokes"
3) A mystery "linguistic algorithm"/"massive machine-learning algorithm"
Imagine clicking on an HN post titled "I made a raytracer in Python". Imagine the post contains pretty examples of the renders. Unfortunately it describes the raytracer as "cutting edge" but has no code for repeatability. Worse yet, the post doesn't recount any critical reasoning employed in the process of building a raytracer.
Until then, here's a high-level overview:
1) Start with a human-curated list of several hundred vivid words. Be sure to include words that invoke all the senses: sight, sound, touch, smell, taste, and bodily sensation. These are the "seed words".
2) Scan through the entire corpus and find all instances of those seed words.
3) Vivid words tend to occur in clusters, so find all words that tend to occur in close proximity to the original seed words.
4) Create a list of new candidate words, and suggest them to a human reviewer.
5) The human review accepts or rejects each of the candidate words.
6) The accepted candidate words become new seed words for the next iteration. Eventually, you'll have somewhere in the neighborhood of 10,000 words :)
7) The score of each word is based on a modified TF/IDF metric, with a few extra finishing touches (like incorporating sentiment scores from the "Hedonometer" project at the University of Vermont Computational Story Lab).
That's where the algorithm is at right now. I've been iterating on this basic premise for over a year, and I'm pretty happy with the current results. It's not perfect, but it's useful. As far as I'm concerned, it's a pretty mature "version one" of the vividness metric.
But it has several weaknesses I want to address in a "version two" sometime soon.
The biggest weakness is that TF/IDF is only a shallow proxy for intensity of vividness.
For example, looking through the 500 million words in the prosecraft corpus, the word "blood-red" occurs 972 times, yielding a vividness score of 6.1. But the word "magenta" occurs only 515 times, yielding a vividness score of 8.3. It's more rare, so the model thinks it's more vivid.
To me, that seems wrong. Because the word "blood" has special bodily meaning, beyond just a prefix for a color-word, my intuition says the word "blood-red" should be scored as significantly more vivid than "magenta".
I have some ideas for addressing that deficiency by using the output of "version one", alongside a new "sensory hierarchy" model to train a new "version two" classifier.
But that new model is still in its very early conceptual stage, and I'm not ready to write about it yet :)
In the meantime, I consider "version one" a legitimately useful tool for working authors.
Glad you asked!
One "adversarial" author to the curated list approach might be David Foster Wallace. Partly because of his large vocabulary, which would only partially cluster with the list.
More-so because of a trickier pitfall I see with trying to quantify vividness: abstract metaphor can be vivid without using any 'vivid' words. Referring to a baby having a "little lipless hyphen of a mouth" would likely score poorly on vividness.
I find it particularly interesting that your algorithm has identified the argument between Elizabeth and Lady Catherine as the most passive page, when in fact it's one of the most tense exchanges between any two characters and is dripping with sarcasm on both sides. They use the passive voice to insulate the argument and add a veneer of social acceptability.
Have you done any analysis to see how your tools handle irony? The description makes it seem words are taken at face value, but really good prose usually operates on several levels.
You can click around anywhere on the histogram chart, to see the different percentile buckets. And you can click on any of the books, to see detailed linguistics, including a snippet of the most vivid page in the book.
edit: manually editing the URL helps with this :)
edit2: and apparently the home page
"giants, and they were impaled by spear, lance, and crystal shard. A series of explosive reports echoed across the battlefield as the giants stumbled upon the Mistcloak’s tripwires, sending lethal blossoms of sharpened steel twisting through the air. Fell and his minions moved through the giants like an avalanche. The Under-King shifted his form to a flowing slab of stone and crashed down upon giant flesh, pulverizing it to blue powder and red ash. Even the animals, though weak and weary, tore into the giants with the primal fury of the wild. Claw and fang stood with horn and hoof, wounding with equal enmity. Beak and talon darted and gouged. The entire island of Mistgard stood united against the foul armies of frost and fire. Devastation was rampant on the mountain, but it was nothing compared to the wrath of the Storm Speaker. Even the stoic Under-King was surprised at the power of the Oldest of Cubs. At the back of the Pandyr’s armies, high atop the tallest of Fell’s battlements, stood the lone figure of the Storm Speaker. He called forth and charmed the very storms from the clouds beneath him and sent electric green-and-blue arcs of lightning into the giants’ lines, blasting hundreds of their bodies off of the battlefield and into the mist below. The world above burned. The radiant morning light was blackened by acrid smoke, making the golden skull radiate a brown and bloody glow. The Aesirmyr lay strewn with broken bodies: blue and red"
At the rock bottom of the vividness scale, we find Jane Austen, Isaac Asimov, Agatha Christie, C.J. Cherryh, and Danielle Steel -- all extremely popular authors. And at the very top, we find George R.R. Martin, Roald Dahl, Poul Anderson, Edgar Rice Burroughs, Ray Bradbury, and Kim Stanley Robinson -- also popular, but generally not quite of the same stature as those on the first list.
Possibly the reading public is slightly biased toward low vividness. Meanwhile, I have at least two favorite authors on both lists.
I wish we could do our own arbitrary style analysis on the data set sort of the way one can do a factor analysis on a portfolio.
I would look at words that are common between Marukami and McCarthy compared to the rest of the corpus for instance.