
Help Make the 19th Century Searchable - polm23
https://blog.archive.org/2020/08/21/can-you-help-us-make-the-19th-century-searchable/
======
bloak
Commercially available OCR is amazingly bad. The errors it makes are crazy.
For example, if the top of a 'd' is missing it might read it as an 'a', while
a human can see that it's identical to all the other 'd's on the page except
that a bit is missing, while it looks nothing at all like an 'a' _in any of
the fonts used in that book_. Or the OCR sees a random blotch and guesses it
might be a comma, although it's in completely the wrong place and has the
wrong size to be part of the text.

I think perhaps the developers took a wrong turn when they started trying to
improve OCR with language models rather than font models. Humans can
accurately transcribe a _printed_ text without knowing the language. They do
that with a mental model of font metrics and so on. In fact, a human
transcribes more accurately (though more slowly) when they don't know the
language because they don't erroneously "autocorrect".

I wonder if Google has in-house OCR that works much better than commercially
available OCR. Google has OCR-ed at least 25 million books. You can't download
the complete texts, but you can see snippets. Perhaps someone would like to
publish a paper assessing the quality of Google's OCR and comparing it with
commercial software. (Probably someone has done that already; I'm just bad at
finding papers.)

~~~
kranner
Have you evaluated the latest version of Tesseract (developed by Google)? They
added an LSTM-based OCR engine starting with version 4 and the improvements
over version 3.x are startling.

~~~
lou1306
Also, ocrmypdf [0] is a great Tesseract-based tool which makes it easy to add
ocr layers to raster pdfs. Plus, it takes care of optimizing the resulting
file (compression, etc). I used it on several old academic papers and have
been pleased with the results.

[0]:
[https://github.com/jbarlow83/OCRmyPDF](https://github.com/jbarlow83/OCRmyPDF)

------
yeoldegeomag
We recently analysed a dataset from an 1859 geomagnetic observatory in Rome
[1].

We found the document (the observatory yearbook written in 1859) via Google
Books. As I don't speak Italian, I initially typed out passages into Google
Translate in order to find the information I needed. Google Books has a view
text option, but the format of the page and font often made it garbled (when
pushed through translate in any case). Decent OCR likely would have made my
life a lot easier.

There is a recent trend in space weather research to study extreme geomagnetic
storms that happened in the 19th and early 20th century, and is aided partly
by all of the scanned documents from that era available on the likes of Google
Books, The Internet Archive and HaithiTrust. Better OCR would be a great help.

Although even having access to all of the documents is already incredible!

[1]
[https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/201...](https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2019JA027336)

------
tosser0001
Some months back an article was posted here about the Library of Congress’s
newspaper digitization technology:

[https://arxiv.org/abs/2005.01583](https://arxiv.org/abs/2005.01583)

[https://news-navigator.labs.loc.gov/](https://news-navigator.labs.loc.gov/)

What they have done already is great:

[https://chroniclingamerica.loc.gov/newspapers/](https://chroniclingamerica.loc.gov/newspapers/)

I’ve used it quite a bit and they have some of the best OCR results I’ve ever
encountered when it comes to scanned newspapers. But it’s just the tip if the
iceberg. Historical newspapers seems to be one of the largest corpuses of
material yet to be digitized.

~~~
ncarroll
Thank you so much for the chroniclingamerica link. That is seriously cool. I
heard my personal "Dangerous Rabbit Hole" alert when I clicked on it, but what
the heck, it's Saturday. I'm diving in and definitely won't be out before
dinner.

------
8bitsrule
Sounds like a great project. There are many remarkable documents from that
century ... explorers in many realms, science pioneers making great strides.

Re their statement: "What we do not have is a good way to integrate work on
these projects with the Internet Archive’s processing flow. So we need help
and ideas there as well."

... maybe there are still people from the Gutenberg project around, they used
to be handling human-transcribed stuff at quite a volume. (Personally I'd
rather type stuff in than 'repairing' bad OCR, that constant back-and-forth is
just aggravating. I'd be glad to dig in on something personally interesting,
like say a little-remembered expedition or research in an area of interest ...
they could arrange stuff by subject that way and then put up wish-lists for
volunteers. History is full of forgotten amazings.)

------
zzleeper
I would start by either improving the OCR, or piggyback on GCV/Amzn, who have
better tools. For instance, what if I drag-and-drop their sample image to this
demo page [https://cloud.google.com/vision](https://cloud.google.com/vision) ?

All text is recognized, and as far as I can tell, there are no errors.

There are a lot of difficulties with older text, but I would start with low-
hanging fruit such as trying to use better tools. On top of that, you have the
problem of making sense of the layout, fixing common typos, etc.

~~~
est31
I tried the GCV demo as well and got one mistake "prevent:" instead of
"prevents" as it is in the text. But that's the only I could find, which puts
it several categories ahead of the txt file.

I'm not sure though whether the IA can get Google to OCR it for them for a
budget they can afford. Likely they'd want OCR solutions that have a one-time
cost, so volume based SAAS offerings won't work.

------
analognoise
Why not just do a SETI@HOME style thing where you get a slice of a newspaper
article, a human being types it up, each is compared to one or two other
people typing the same thing up, when they all match that piece is done and
you move on.

I guess that would take a tremendous amount of time, but it would be really
cool to get an old newspaper article to look at, and I bet a lot of people
would find interesting things to talk about.

You could have a "comment thread" for people who added the piece, kind of like
adding a digital layer to history.

~~~
duskwuff
That sounds exactly like the Distributed Proofreaders project, which is part
of Project Gutenberg:

[https://www.pgdp.net/c/](https://www.pgdp.net/c/)

~~~
analognoise
...I love HN. Thank you for pointing me to this.

------
dynamicwebpaige
Valiant effort! Could you potentially post this on
[https://scistarter.org/](https://scistarter.org/)?

------
pjmlp
While the effort is noble, I think it should be applied to all mankind written
documents, not only 19th century.

~~~
082349872349872
Don't let the perfect be the enemy of the good. XIX english is pretty close to
XXI english, XIV not so much[1]. I make frequent use of
[https://gallica.bnf.fr/](https://gallica.bnf.fr/) but the older the document
the more unfamiliar the dialect. In german the older scripts (let alone XX
forms such as Sütterlin!) are likely to be even less OCR/search friendly.

[1] and good luck with Beowulf:
[http://www.bl.uk/manuscripts/Viewer.aspx?ref=cotton_ms_vitel...](http://www.bl.uk/manuscripts/Viewer.aspx?ref=cotton_ms_vitellius_a_xv_f132r)

~~~
pjmlp
Just like middle age Portuguese isn't like modern Portuguese, and we also have
stuff like Linear B and whatever else exists since there are written
documents.

------
dr_dshiv
I love this!

Related to the overall culture tech, most of the scientific revolution
occurred in NeoLatin which isn't taught, mostly remains untranslated and
untranscribed. For instance: Descartes first book, which dealt with music
theory and human emotion.

------
cs02rm0
I can read the text in the snippet without issue, but I clicked through to
some of the full images in the web viewer and couldn't read the text at all. I
tried downloading a few PDFs and still couldn't read them.

Not sure if I'm just not getting the full quality somehow or if the image
quality just isn't there for OCR to ever work?

~~~
ebg13
Text is readable for me in the viewer. Did you use the magnifying glass icons
in the bottom right to zoom in?

~~~
cs02rm0
Ah. I used pinch to zoom on my mobile - it didn't have the magnifying glass
icons. That works, thanks.

------
mseidl
I remember the first time I visited a website in 1856.

------
nomadtwin
Can someone help me understand the greater vision of this project?

I do like the idea and efforts from a technical point of view. Tinkering with
OCR on unusual (or old) languages. But that's not the goal of this project as
far as I'm concerned (it's a byproduct?) Archiving every single news entry for
the sake of completion sounds more like obsession than purpose. We're creating
so much information that it will be even harder to separate garbage from
valuable information (you have to spend time reading the useless stuff before
you can justify whether or not it's valuable to you).

Information overload IS a problem and by adding more information to an already
saturated ecosystem I don't see the vision here but would like to understand
:)

It almost seems like a hording problem but for the digital natives. People
accumulate a lot of stuff but rarely can they actually appreciate what they
possess as time & perception is a very limiting factor.

An article that would shed some light would be highly appreciated.

~~~
luckylion
They're not generating new information though, they're aiming to make old
information available. I'm sure that it's quite helpful for historians to be
able to search & read old newspapers, it gives more details about what people
read about and often establishes a more specific time line.

We're probably not that good at recognizing which bits will be of interest to
future generations, so archiving everything (to a point ... but I believe that
newspapers are well within reasonable) sounds like a good idea. Plus you never
know what you discover when you make things available.

~~~
cmehdy
Not to mention that from 1800 to 1930 the world population doubled (1 billion
to 2 billion, passing the 1.5 mark just around 1900).

While this pales in comparison with the 6 billion humans added since, that's a
significant change - particularly for most likely available recorded sources -
at a time of monstrous evolutions to major world powers/empires, expansion
into vast new areas of the world (namely North America) of essentially the
British empire (while at the same time the East India Company ceased to exist
by the end of the century for contrast), some abolition of slavery becoming a
reality in places (1833 for the British), and what arguably kickstarted much
of the mental frameworks for our entire lives: the first two industrial
revolutions (for example: democratization of once-monastic school system while
adopting the year-of-production type of mental model for its promotions).

1804 is the first locomotive. 1859 is The Origin of Species by Darwin. 1861 is
Maxwell equations. 1869 is Mendeleev's period table. And so on and so
forth[0]. Measurement devices also improve in reliability and efficiency,
leading to many of the early recordings we can now look back at when it comes
to the consequences of the explosion of human activity with regards to the
environment.

It's quite a fantastic century to keep a trace of, frankly.

[0]
[https://en.wikipedia.org/wiki/19th_century#Science_and_techn...](https://en.wikipedia.org/wiki/19th_century#Science_and_technology)

