This is not interesting at all because the hard part is to actually segment and classify characters in a real text, and in real life they don’t conveniently come as 28x28 images. As someone with a few hours of training in reading that kind of texts the difficulties are:
- taking appart characters (hentaigana) that looks very close
- segmenting characters because size and length vary a lot
- differentiating between kana and kanji because the former are deritaves of the latter and they sometime look quite alike
So the experimenter should have started his post telling he didn’t know any Japanese instead of closing it as such (even if one can notice when he just use modern kana as classes).
I mean, if you write out "minimum" in cursive, someone who learned English exclusively from printed text might think it was a scribble. But someone familiar with cursive decodes the whole word in a chunk, partly from understanding how latin letters get transformed into cursive, and partly from having seen "minimum" written in cursive before, and maybe also using nearby words for context. There's a roughly similar thing going on here.
Now, imagine some text in medieval english in doctors handwriting. That would be basically undecipherable for most people.
The same is true about many Edo period (and older) writings, which are written as "scribbles", and use old vocabulary/grammar that most people don't know. They are essentially unreadable to most Japanese, except the trained ones (e.g. historians).
Comments like this make me reluctant to post things I work on
I have no problem with people doing fun projects and explaining how they did them. I have a bit of a problem with people apparently making exaggerated claims about their product's capabilities.
This isn't written by a Wolfram employee. It seems to be just someone's fun and educational project, posted on a Wolfram message board.
I mean, it's a fine article for someone interested in how to train a classifier on an annotated dataset in Mathematica. It's just not very interesting for someone interested in how to classify Japanese characters from the Edo period.
I have a collection of 1500 fonts, and finally exported all the PNGs for 75,000 characters. Now I need to pad, crop, and scale to make 28x28 (or 32x32, or 64x64, or another resolution).
Then I want to do the Machine Learning (Classify) step.
The article doesn't go into any detail about how to install Classify, how to import the training and testing data, and how to then actually run it. I watched the videos from CS231n because of my boss, but again, I'm still not really sure what to do practically.
If I have lots of folders of images, what should I do to build an OCR program?
Documentation for Classify: http://reference.wolfram.com/language/ref/Classify.html
As far as I could see they listed all the Wolfram Language code in the example to set it up.
- put the labels (character codes one hot encoded) in an array called y
- use a convolutional neural network to learn the mapping X->y (training)
- send new images to the network to make predictions
That's how you train all neural nets in supervised learning.
As my sibling poster mentioned, these are generally research level tasks. Most people working on this are either writing papers on very particular issues related to OCR or they are working in industry and the companies are not open sourcing their code because it is expensive to produce and/or they have custom data sets they don't want to give away. There might be some tutorials somewhere though.
Source: previously worked on OCR systems for European and East Asian languages.
But if you are trying to do OCR on complete texts, maybe even handwritten, you're getting close to research-level problems very quickly.
Both are supported by Pingtype.
Sorry that it's slow to load; the fonts have to be downloaded from the server every time because I can't be sure if they're on the user's device. The default font is 12MB. Once it's loaded once, you can re-translate much faster - it's just the first run that's slow.