I guess so the child could smell the Derry air.
text = urllib.parse.unquote(text)
In Python ≥ 3.4, you can use html.unescape() to decode HTML entities:
At work I sometimes have to copy blocks of text from a PDF into another document. If I do it with Preview, I lose the fi and fl ligatures. It only happens with PDFs created in-house, so I guess it's some kind of stylistic thing that comes from the guy who lays out the PDFs.
I eventually learned to use Adobe's own Acrobat, instead, and it works fine.
If Preview can do this automatically, please don't change that feature.
Preview for some reason just drops it entirely.
In Acrobat, fl -> f and l adjacent.
> flicked the cobra to full extension
The cobra is a weapon in the Neuromancer's universe, something like an extendable knife/club.
If you don't know what's going on it looks like the word "fuck" was more common in the 17th century than today. But actually it's the word "suck" written with a long s ("ſuck"), which you can see is easily OCRed incorrectly.
In a political brochure: "...will introduce bills to simulate progress..."
In a funeral booklet: "...he left his muck upon us all."
She asked me to go into computer's master dictionary and patch it to disable the words 'simulate' and 'muck' so it would bring these mistakes to her attention.
Pirates use cheap labor to solve the problem.
Google's approach (and Facebook and Twitter and...) is to see every problem as solvable through an algorithm.
If this approach worked, we wouldn't have so many errors in published OCR'ed documents. Or social media tearing the world apart, for that matter.
Really? They usually do it themselves for free AFAIK.
Just because 50% of the internet are porn, you shouldn't use a 50% porn corpus to train your Markov chains.
Maybe you'd have to worry about graffiti or spam, though. A git PR model would be fine for low-traffic situations though, and maybe there's something similar that scales to higher traffic.
I think what I want is something that allows for technical improvement while maintaining authorial intent and "ownership" (in a conceptual if not legal sense) without optimising for consensus-gathering.
It's also not everywhere. Maybe what I'm after is a browser extension or something...
"feces sticking out of large pipes, looking hungrily at the camera"
As a youngster, I read a lot and at great speed, but after I started coding my read speed dropped dramatically. Attention to detail while writing or reading code seems to have re-wired my brain for accuracy instead of speed ;)
What jumps at me in the article is not the misinterpretation of arms, which results in funny but somewhat "working" language, but rather "[..] hitting lier feet against stones" where the 'h' is interpreted as 'li'. That brought me to a full stop.
Also makes me think of "kerning" vs. "keming" :)
The game Path of Exile once had a line in the patch notes which simply said "Fixed keming." Made my day.
Proof reading is as ever important.
The arms/anus confusion should be fixed with a language model on top of the letter predicting network.
Brains are said to have a lot of feedback from higher levels of sensory processing to lower. Maybe you don't need as good a language model if its evidence is integrated more tightly with the rest.
Archive.org is kind of a mess, though.
In any case he mentioned there is a hate word dictionary specifically so that the autocorrect never suggests such words even if they seem to be a close match. You basically have to type those words perfectly.
In another related bug Xerox document centres which weren’t even technically doing OCR were changing numbers from 1 thing to another in scanned IMAGES due to high level compression substituting numbers - much more dangerous! https://www.theregister.co.uk/2013/08/06/xerox_copier_flaw_m...
Haven't read the original, but it was probably meant to add character to the way the character speaks. Either to make fun of the character for not being able to pronounce words correctly, or to make them more pitiable, or just as a matter-of-fact detail.
> If another copy from the same edition has the error corrected, such cues may help to identify early and late printings and contribute to a more comprehensive account of the book’s printing history.
In other words, when transcribing books you want to preserve misspellings that occur in the source text.
It’s actually quite interesting because that means that automatic spellchecking of OCRed text while helping to improve the quality of the transcript could also introduce unwanted corrections. But doing like the OP did and comparing their transcripts with those of Google Books was clever.
But the error does fall into the deliberate rather than unwitting category described here: https://sites.ualberta.ca/~sreimer/ms-course/course/scbl-err...
The transcriber's error is unwitting; he specifically comments on the fact that he didn't want to make it.
Probably my favourite quote from the Inbetweeners.