What this highlights most to me is the false precision people have come to expect. This is a real problem I've seen as an engineer / physicist working with business people. Things that I would consider "the same" because they are well withing the noise of one another get all kinds of "hold on now, I added up the numbers in the columns and got 22.1 but the columns total says 22" from people that don't understand rounding, significant figures, experimental error, etc. I know it's important to hide this from people because they end up getting distracted by it (or to occasionally let slip so someone can feel like they have done a thorough review because they found a column didnt add up to the rounded total). But I think it's more important overall to educate people about what kind of variation is important and what isn't, vs. getting hung up on the definition of a word.
I have a ton of experience with that too. You have two people try to estimate a thing, and they make different choices and come up with answers that are totally consistent with each other, but aren't the same. Then someone sees that they're different and omg omg omg everything's on fire.
It's amazingly hard to communicate "variability" in an answer. Whenever I communicate experimental results, my business partners always want to hone in on the simplest explanation (e.g. "treatment effect somewhere between 1 and 3" becomes "treatment effect = 2"). And that's just the uncertainty part of variability. Then there's the kinds of variability you're talking about.
There are a ton of tricks to mitigating this, like reporting confidence intervals instead of point estimates, or giving the right number of digits to communicate the key message, or making super sharable/linkable results to minimize the game of telephone. I'd love someone to write a book on how to communicate this sort of "variability" in a business setting.
In corporate finance something like a ROI is seen as telling the future. I can’t tell you how common it is to revisit something on an “let’s understand why the ROI is off” basis. And them not understand when I point out not a single assumption held true. More often than not aggressive assumptions were used to validate the strategy and gain approvals and it was always going to fall short. In my role, CYA is a huge deal.
Then there is the problem of how insanely innumerate society at large is to the point where something like this doesn't register for huge swaths of people. might work in the context of engineers or people with some higher education but not all contexts.
I don't know, it feels like many people understand ranges (but still not all). They don't need to understand what a CI is to get "somewhere between 1 and 3." The barrier I face is that while they understand ranges, they can't take uncertainty.
I'm sad that the author didn't do a manual word count. If it's only 1660 words or so, it would surely take less time than it took to write this post. It would probably make the author confront the difficulty in precisely defining word count.
It turns out that a lot of things that initially seem trivial to precisely define aren't actually that precisely defined, like the length of the california coastline. This is, in my mind--and as a complete tangent--a great argument for wide programming and math education. When you're forced to be so goddamned precise all the time, it's very clear when an idea isn't fully defined.
I diffed down to understand why Open Office (1663 consistent with Microsoft Office) differs from Google Docs (1655).
"ago--never": G docs 1 word vs OpenOffice 2 words.
"tiger-lillies--what": G docs 1 word vs OpenOffice 2 words (IDK what this should "really" be)
"Wanting?--Water": GD 3 words vs OO 2 words
In this case the disagreement springs, exclusively, from if the docs engine believes that double hyphens make a compound word (and potentially handling punctuation in the middle of such a compound word)
This is an excellent answer. The original post, insinuating that engineers are demonstrating "engineering mediocrity" or not "correct," is lazy for not understanding why this problem is ill-defined.
The irony is that his title is correct - Boring problems need attention.
But the content is jarringly wrong - The lack of attention didn't lie with the engineers of these companies, but with the author themselves.
For ~1600 words, they could have very easily tried to count it and quickly realized that the answers they got weren't wrong - the question was lazy and lacking attention.
Do you mean dictionary words?
Compound words?
Whitespace separated tokens?
What about stand alone text that isn't a word (ex - page numbers? Title numbers?)
Words that the author invented?
Words use non-traditional-spacing or grammar?(and maybe a missing space)
> "tiger-lillies--what": G docs 1 word vs OpenOffice 2 words (IDK what this should "really" be)
i think 3 words.
The tiger lily is a type of lily. plural is tiger lillies. Fair to say hypen is stylistic (& after a quick web search , I can't find any other usage like that).
That aside, I can't believe how simplistic Google Docs word recognition/count engine is.
I mean.. are numbers words? Is a floating ! or - a word? Are hyphenated words 2 or 1? How about when a word is broken over two lines with a hyphen? Is ... a word? Is an unrecognized string a word? Does a figure caption count towards a word count? How about chapter titles? Or page numbers?
Yeah the issue seems to be the concept of a "word" not being precisely defined. But I don't think that's a problem. I'd be concerned if the numbers were dramatically different but they're pretty close so there isn't an issue.
If exact precision is necessary you probably shouldn't rely on imprecise terms like "word".
Agreed. If you are going to base a salary on imprecise concepts, then the pay will be imprecise. Concepts from the real world don't always fit nicely with the strict rules that we attempt to program them in. Trying to solve this problem by changing the constraints may prove to be easier than programming a precise solution around an imprecise concept.
It's also not just edge cases in English or any specific language. The concept of a "word" doesn't exist in some languages. Chinese, for example, only has something comparable that is contextual and not syntactic. So how do you define "word count" in a document that mixes Chinese and English? Ignoring the Chinese characters altogether seems incorrect in the spirit of the metric, and trying to count using English syntax rules will still give you something incorrect in the spirit of the metric.
Look at the waveform of speech and you will see long silent gaps inside "words" as well as there frequently being no gap between "words".
There are phrases like "Skinny Puppy" that can do the same job as a word, there are also structures smaller than words that people smush together to make words. The two even work together:
If you see "words" as the molecules of text there will always be an asymptote you can't overcome because segmenting text into words will sometimes introduce errors that you might not be able to recover from.
One step further, I was doing text processing and my very western education taught me that characters were the smallest relevant construct when dealing with text.
Except that's simply not true for most eastern languages. Devanagari, Arabic, Bengla, Thai and so many other languages are packed with diacritics and graphemes, conditional rendering based on surrounding letters and much more that force you to completely rethink your approach.
I don't understand this. The article showed 4 text processors have a different definition of "word". How does that mean "there is no such thing"? Just because no one agrees on a definition for "word" doesn't mean words don't exist.
GP's point is more about how words are spoken. When we hear speech in a language we understand, we parse the sounds as a collection of distinct units with individual meaning we call "words". But the sound wave itself isn't actually segmented in this fashion, and the separation into words occurs in our perception and interpretation. But I agree, words definitely exist, just as interpretations of sound and written characters.
I don´t think it has anything to do with our interpretation: the illiterate have no concept of words. They come from writing systems and are not a part of languages.
German has very long "words" just because they decided not to add lots of spaces when creating a writing system for it. There is no particular reason why it's "Bundesverfassungsgerich" instead of "Bundes verfassungs gericht".
I remember during my PhD, a casual discussion on that figuring out the end of a sentence on it's own is surprisingly more complex then you'd think:
"I work at the F.B.I. I like it there."
"I work at the F.B.I., I like it there."
"I work at the F.B.I. I like it there!"
It's not as simple as counting periods.
So that counting words can have corner cases is definitely understandable. Is "&" a word? It is literally just 'e' + 't' superimposed and "et" is definitely a word.
I'd rather see an explanation behind the issue rather than a complaint about people not working on boring stuff because a bunch of text editors can't agree on a word count.
I remember taking a typing class some 25 years ago and being told a that a word count is typically every 5 characters. That way someone doesn't pad out their word count by using lots of small words.
IMO the issue here is that: everything that is not carefully specified as a precise algorithm, will be implemented differently.
This is why each browser used to parse HTML differently.
This is why you'd have compat or even security issues because some software used \r\n for newlines splitting while other used \n.
Luckily the browser vendors formed WHATWG which created pretty precise specs which are maybe convoluted but at least everyone parses HTML in the same way, and each browser pretends to be every other browser for compatibility.
2021 is really great for web compat, maybe not all browsers implement every API, but existing APIs are accompanied by very thorough test suites (Web Platform Tests):
https://github.com/web-platform-tests/wpt
Having said that I don't see vendors aligning on definition on word count any soon due to corporate inertia, lack of incentives and lack of "Word editors consortium" (or is there any?)
A truly good function for word count might be pretty complex, and perhaps different for every language.
I think a complement to this article is the essay "Reality has a Surprising Amount of Detail"[1]. The point is that things which superficially seem easy often have a lot of hidden complexity when you drill down to all the low-level details and corner cases.
There's a 2% variation between the lowest and the biggest. Is there a need to know a precise word count, where 2% of margin of error is not acceptable? If there's not, there's no problem. The author doesn't attempt to define what a word is, so maybe all of these are correct for their own definition of word? Maybe the definition of what counts as a word is the boring problem that needs a solution, but it doesn't look boring or easy to me. You can't ask people to count correctly something that you haven't even defined.
Also, believing that complex problems are easy is not something only programmers do. I work currently a lot with Excel automation, and most people have no idea of what can be automated easily and what can't. I have some people coming that ask for automating a task they've never done manually and don't really know how to do precisely. I think that's the same mechanism of "overabstraction" that leads to people to say "WET instead of DRY" (Write Everything Twice instead of Don't Repeat Yourself).
The "boring problem" that always needs more attention than it gets is performance. Plenty of companies ship features that only reduce performance by 10,20,50 milliseconds but it all adds up and applications end up feeling very sluggish. Apple is one of the few where I feel like the perf doesn't regress over time (likely because the features don't really expand much either).
I used to work for a machine shop that made lots of stuff - parts for cars, aircraft, medical devices, all very exciting. But the real money maker? Little clamps for holding sections of drill shafts together. Oil extraction companies go through literal tons of them every day and don't give a damn about cost. It was quite an eye opening experience.
I'm always looking for boring ideas, I'm a full stack web dev, and honestly my dream is to have some running side business that's extremely boring, I never seem to find good ol' boring ideas.
The classic advice is to talk to people. Ask your friends in other industries what the most bullshit part of their job is. The most frustrated friend will probably have something. Especially if their answer is “eh, I have this Excel sheet that kinda does the job”.
Work with the brokers (and reinsurers if possible) and go after the carriers. Follow the Zillow model instead of Zenefits.
Edit, more:
Businesses want to work with established P&C brokers, and brokers have ensured (in most states) that they have to anyway. Provide the right data-driven SaaS tool that gives brokers the power to sell/renew better. Offer it for free unless it converts but require reporting conversions, then charge a percentage. You could call up brokers and they would all try it out. Then focus on solving boring, mostly statistical, problems in an elegant way behind the scenes.
The variance is probably due to a difference in what counts as a word + ambiguity around white spaces etc in each text editor. So it's not that the word counting problem is unsolved it's a problem of different companies adopting different definitions.