
An OCR cliche: Into his/her anus (2009) - userbinator
https://wraabe.wordpress.com/2009/03/07/an-ocr-cliche-into-hisher-anus/
======
reaperducer
_The Wesleyan-Methodist Magazine‎ – Page 433 “carried this child in his anus
to Derry”_

I guess so the child could smell the Derry air.

~~~
EForEndeavour
I want to believe that the secret purpose of Wesleyan-Methodist Magazine was
to set up the reading machines of a future civilization for this pun.

------
TipVFL
This reminds of an eBook of Neuromancer that I read, it was occasionally
missing the letter f. For the most part I just added it back mentally without
really thinking about it, but then sometimes I hit a passage like this: "He
turned, pulled his jacket on, and licked the cobra to full extension." That
one took a moment.

~~~
jobigoud
Probably the original text was using ligatures for fi and fl and they got lost
in conversion.

[https://en.wikipedia.org/wiki/Typographic_ligature#Stylistic...](https://en.wikipedia.org/wiki/Typographic_ligature#Stylistic_ligatures)

~~~
superkuh
Yup. I have to manually detect and correct for all the possible ligatures in
all possible unicode in my text to speech pre-processor scripts. I _hate_
them.

~~~
ahazred8ta
"this gives us e&#64259;cient space-time trade-o&#64256;s" :-(

~~~
dotancohen
Those are HTML entities. Most modern programming languages come with tools to
decode this, e.g. in python:

    
    
        text = urllib.parse.unquote(text)

~~~
jwilk
urllib.parse.unquote() is unrelated to HTML. It undoes URL-encoding:

[https://docs.python.org/3/library/urllib.parse.html#urllib.p...](https://docs.python.org/3/library/urllib.parse.html#urllib.parse.unquote)

In Python ≥ 3.4, you can use html.unescape() to decode HTML entities:

[https://docs.python.org/3/library/html.html#html.unescape](https://docs.python.org/3/library/html.html#html.unescape)

~~~
dotancohen
You are 100% correct. I mixed the two encodings up. Thanks.

------
mrob
Another example:

[https://books.google.com/ngrams/graph?content=fuck&year_star...](https://books.google.com/ngrams/graph?content=fuck&year_start=1600&year_end=2000&corpus=15&smoothing=3&direct_url=t1%3B%2Cfuck%3B%2Cc0)

If you don't know what's going on it looks like the word "fuck" was more
common in the 17th century than today. But actually it's the word "suck"
written with a long s ("ſuck"), which you can see is easily OCRed incorrectly.

------
projectramo
After you guys fix this with Markov chains or whatever, I look forward to
reading: the proctologist was thorough but found no sign of blockage in her
arms.

~~~
sthgrau
That would be a clbuttic mistake, indeed.

------
gumoro
Reminds me of "Don't kick a man when he's clown"; Google finds 2 PDFs with
this, due to bad OCR:

[https://www.google.com/search?q="Don't+kick+a+man+when+he's+...](https://www.google.com/search?q="Don't+kick+a+man+when+he's+clown")

Credit:
[https://twitter.com/ObeyComputer/status/1050131788830560258](https://twitter.com/ObeyComputer/status/1050131788830560258)

------
HocusLocus
At the family printshop my mother experienced some typing glitches, she would
sometimes type "simulate" instead of "stimulate" and "muck" instead of "mark".
This led to two disasters, which required us to stop the presses (the pressman
was a great proofreader!),

In a political brochure: "...will introduce bills to simulate progress..." In
a funeral booklet: "...he left his muck upon us all."

She asked me to go into computer's master dictionary and patch it to disable
the words 'simulate' and 'muck' so it would bring these mistakes to her
attention.

~~~
pxtail
I'm pretty sure that for many political brochures these words could be used
interchangeably

------
qwerty456127
OCR-ed texts should really be proof-read before being published. Pirates
usually do this, that's funny Google doesn't. Also Markov chains can help by
highlighting unusual word combinations, I doubt anal children occur often in
correct texts.

~~~
reaperducer
_Pirates usually do this, that 's funny Google doesn't._

Pirates use cheap labor to solve the problem.

Google's approach (and Facebook and Twitter and...) is to see every problem as
solvable through an algorithm.

If this approach worked, we wouldn't have so many errors in published OCR'ed
documents. Or social media tearing the world apart, for that matter.

~~~
qwerty456127
> Pirates use cheap labor to solve the problem.

Really? They usually do it themselves for free AFAIK.

~~~
Finnucane
That's pretty cheap.

------
tyingq
Apparently, "feces" where "faces" should be is a thing as well.

 _" feces sticking out of large pipes, looking hungrily at the camera"_

[https://archive.org/stream/The-Colonel-Who-Would-Not-
Repent/...](https://archive.org/stream/The-Colonel-Who-Would-Not-
Repent/The%20Colonel%20Who%20Would%20Not%20Repent%20-%20The%20Bangladesh%20War%20and%20Its%20Unquiet%20Legacy%20-%20Salil%20Tripathi_djvu.txt)

~~~
userbinator
I like how if you read it out of context, the first part of that sentence
seems perfectly fine in something like a text about sewage; and then the
second part catches you by surprise.

------
tilt_error
I have a work injury, acquired from 30+ years of coding :)

As a youngster, I read a lot and at great speed, but after I started coding my
read speed dropped dramatically. Attention to detail while writing or reading
code seems to have re-wired my brain for accuracy instead of speed ;)

What jumps at me in the article is not the misinterpretation of arms, which
results in funny but somewhat "working" language, but rather "[..] hitting
lier feet against stones" where the 'h' is interpreted as 'li'. That brought
me to a full stop.

Also makes me think of "kerning" vs. "keming" :)

~~~
taneq
> Also makes me think of "kerning" vs. "keming" :)

The game Path of Exile once had a line in the patch notes which simply said
"Fixed keming." Made my day.

~~~
chrisweekly
A designer at my former workplace had a full-zip hoodie. To one side of the
zipper: "Ker", to the other: "ning".

------
billfruit
I am reminded of the anecdote of one of the first mass printings of the Bible
in London in the 1600s having a grave misprint: "Though shall commit
adultery".

Proof reading is as ever important.

~~~
crtasm
And now I'm thinking of Rimmer's parents in Red Dwarf. Devout seventh day
advent hoppists, due to a missing letter in "...and the greatest of these is
hope."

------
dfboyd
The Kindle version of "A Game Of Thrones" (first book in the series) has
"Dome" everywhere instead of "Dorne" (the name of the kingdom in the south).
Apparently it was OCR'ed from the printed book.

~~~
kzrdude
How is that acceptable past a month of its release, is nobody correcting it?

~~~
iiiggglll
Welcome to the 21st century, where quality, accuracy, and precision are
sacrificed at the altar of "scale".

------
visarga
Funny

The arms/anus confusion should be fixed with a language model on top of the
letter predicting network.

~~~
duskwuff
Without a very long-range model I don't think that would help. "in his/her
anus" and "in his/her arms" can both be correct in the right circumstances; it
takes quite a bit of surrounding context to tell which one is more likely.
(While doing some research in Google Books I even found a couple that looked
like OCR errors until I read beyond the search snippet.)

~~~
abecedarius
How does OCR software integrate letter and language models? Do they first make
a best guess at the letters and then try to correct it with the language
model?
[https://en.wikipedia.org/wiki/Optical_character_recognition#...](https://en.wikipedia.org/wiki/Optical_character_recognition#Techniques)
gives me that impression, but I'm not sure.

Brains are said to have a lot of feedback from higher levels of sensory
processing to lower. Maybe you don't need as good a language model if its
evidence is integrated more tightly with the rest.

------
robin_reala
After stopping laughing I went back and checked the Standard Ebooks corpus to
see if any of this mistake had slipped through; luckily it seems that in the
intervening 9 years someone at Gutenberg and / or archive.org has corrected
this particular issue in the source transcriptions.

~~~
mcguire
Gutenberg is designed to avoid this sort of thing, although some slip through:
originally, they didn't use OCR and now they use the distributed proofreader
thing.

Archive.org is kind of a mess, though.

~~~
robin_reala
Yeah, I usually submit about 10-15 corrections to Gutenberg per book I proof;
generally they’re in good shape. The bigger problem with Gutenberg is that
older books omit all accents, which is a huge problem for who series of books.
I’ve been trying to produce Maurice Leblanc’s series of Arsène Lupin stories
for Standard Ebooks and Gutenberg generally spells the titular protagonist’s
name wrong.

------
robbrown451
You'd think that the OCR process would somehow call attention to words that
have a high probability of being wrong and especially of being wrong in a
problematic way. You don't want to require humans to read and sign off on
everything, but with something like that, it shouldn't be that hard to have
something that is very quick for a human to see the scanned image and compare
it to the transcription, simply on the basis of the word "anus" being in
there.

~~~
lathiat
I was reading “Creative Selection” by Ken Kocienda last week. Goes behind the
scenes of him designing the iPhone keyboard early in its development (good
read)

In any case he mentioned there is a hate word dictionary specifically so that
the autocorrect never suggests such words even if they seem to be a close
match. You basically have to type those words perfectly.

In another related bug Xerox document centres which weren’t even technically
doing OCR were changing numbers from 1 thing to another in scanned IMAGES due
to high level compression substituting numbers - much more dangerous!
[https://www.theregister.co.uk/2013/08/06/xerox_copier_flaw_m...](https://www.theregister.co.uk/2013/08/06/xerox_copier_flaw_means_dodgy_numbers_and_dangerous_designs/)

------
billfruit
What is the article saying about 'pertistent' vs 'persistent' ? Is that a
word, what is its meaning?

~~~
IshKebab
It's not a word. Either it's a pun on 'pert' (impossible to tell without
context) or it was a typo in the original (seems more likely).

~~~
Alex3917
> Either it's a pun on 'pert' (impossible to tell without context) or it was a
> typo in the original (seems more likely).

Haven't read the original, but it was probably meant to add character to the
way the character speaks. Either to make fun of the character for not being
able to pronounce words correctly, or to make them more pitiable, or just as a
matter-of-fact detail.

------
TazeTSchnitzel
Reminds me of how when xkcd looked at which days of the month were most
common, the 1st, 10th, 11th, 21st and 31st were more or less common than they
should have been due to OCR error: [https://drhagen.com/blog/the-missing-11th-
of-the-month/](https://drhagen.com/blog/the-missing-11th-of-the-month/)

------
yesenadam
This online version of Miles Davis' Autobiography features a character called
"dark Terry" i.e. Clark Terry.

[http://yanko.lib.ru/books/bio/miles.htm](http://yanko.lib.ru/books/bio/miles.htm)

------
drcongo
Imagine how Clint Eastwood feels.

[https://www.google.com/search?tbm=bks&q=cunt+eastwood](https://www.google.com/search?tbm=bks&q=cunt+eastwood)

------
DoreenMichele
Given the examples in the piece involving children, I wonder if there is any
danger of this resulting in a problem where a site gets accused of child
pornography or gets blocked because of it sounding so wildly inappropriate or
something.

~~~
Tharkun
I don't know, children do have anuses, and they are known for their curiosity.
I'm sure many a parent has had to dig lego bricks out of various orifices.

~~~
esrauch
Children have a few things that depicting or discussing would result in being
blocked from most schools, even when stopping well short of pornographic
depictions.

