Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

(Can't rewatch the video right now as daughter is sleeping and I've misplaced my headphones, but found in my G+ backup a post I'd made about it in 2012; reproducing it below as it's no longer on the internet anywhere. Sorry to reply to myself; made this a separate comment because it's rather long for HN.)

Sanskrit and OCR: A talk by Daniel Ingalls and his son (also) Daniel Ingalls.

The OCR part starts at around 28:00.

Apparently in 1980 a one-programmer hobby project achieved 99.5% accuracy on manually typeset Devanagari (typeset in India)! Gives one hope that it should be possible to write good OCR for Indian languages today.

----

Daniel Ingalls was Professor of Sanskrit at Harvard. He was one of the few Western Sanskritists (IMO) to understand and appreciate Sanskrit poetry on its own terms, by its own standards. He translated the Subhashita-ratna-kosha (also excerpted as Sanskrit Poetry from Vidyākara's Treasury, and has a good introduction) and Anandavardhana's Dhvanyaloka with the commentary of Abhinavagupta. He worked on Indian philosophy/logic too (navya-nyāya). A large number of today's prominent American Indologists are his students, including Sheldon Pollock, Robert Thurman, Gary Tubb, and Wendy Doniger.

His son Dan Ingalls was a computer scientist at Xerox PARC and is the main creator (along with Alan Kay) of Smalltalk, from which the discipline of object-oriented programming arose. Apparently he's also responsible for context menus, and something called "bit blit" (which he indeed mentions in this talk).

Quotes:

(30:00) "This is the text that we're working with, and here's a typical page. And this is just to show you that a significant part of the problem is simply locating the lines of text on the page. You've got page headings which you don't care about, and here comes the text that you do care about — it's in two-column format — and there's this little [?] down here and then commentary below that. And in addition just to make things interesting, there's these little squiggles here under things and they relate to footnotes on the page... Now the two-column format also is worse because you get these breaks between chapters where this is column 1 and it completes there, and then the next chapter begins here, column 1 and column 2. So you have to deal with all that page layout. But that's, I mean, you just do it and then it's done."

(33:00) "The page may not be perfectly horizontal. Or even if it is, the type probably won't be because these are all pretty much manually typeset."

(35:00) "It's interesting: you learn about the actual pieces of type that they have."

(39:40) [Handling of matras.] "This part of the program is very heuristic. I did it in a hurry and it's not actually done uniformly, but it does the correct thing."

(41:45) "This is a bha and this is a ma. ... They look almost identical. And they can look identical... or at least you can't tell them apart."

(40:00) "It's really been fun doing this because, you know, computers inside 'em don't have any idea what you're doing. And I'm in that relationship to this project."

:D

His method was something like this:

* Identify horizontal lines of text (by looking for total blackness of pixels at each height)

* Within each line, identify words (by looking for gaps between words)

* Within each word, identify bounding boxes for individual glyphs (similar)

* For individual glyphs,

Training: For each character in all its variations that appear among the text, find its "skin" (OR of all the variations) and "bones" (AND of all the variations)

Recognition: a known character is bad for a glyph-to-be-recognised, if the glyph has a dot where skin doesn't, or has no dot where bone does. Did not worry about rotations! Known glyph with lowest badness wins. (Some reward for good bits also.)

* Did the boxing (identifying a glyph) only up to the baseline, because the under-baseline matras can get kerned below the next letter, interfere with keeping them separate, etc. Turned out to be enough.

* Always keep in mind that 99.5% is good enough; cuts out a lot of complexity.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: