
Classifying Japanese characters from the Edo period - soofy
http://community.wolfram.com/groups/-/m/t/1221098
======
titanix2
If I understand correctly, the guy downloaded a data set, train it and get
happy because it performs over 95% on the training set from the data set.

This is not interesting at all because the hard part is to actually segment
and classify characters in a real text, and in real life they don’t
conveniently come as 28x28 images. As someone with a few hours of training in
reading that kind of texts the difficulties are:

\- taking appart characters (hentaigana) that looks very close

\- segmenting characters because size and length vary a lot

\- differentiating between kana and kanji because the former are deritaves of
the latter and they sometime look quite alike

So the experimenter should have started his post telling he didn’t know any
Japanese instead of closing it as such (even if one can notice when he just
use modern kana as classes).

~~~
reustle
> This is not interesting at all because...

Comments like this make me reluctant to post things I work on

~~~
mikekchar
The OP's comment is very good because the discussion in the article is quite
misleading about what they accomplished. It's also apparently self-serving;
apparently marketing material for Wolfram. As your sibling's post makes
extremely clear, classifying handwritten text from the Edo period is a much,
much, much more difficult problem. In fact the samples they have taken for
training are so unstylized that they might just as well have done an article
about recognising modern hiragana hand writing. That would be very cool too,
but would have gathered no interest because it is a well solved problem.

I have no problem with people doing fun projects and explaining how they did
them. I have a bit of a problem with people apparently making exaggerated
claims about their product's capabilities.

~~~
greeneggs
> I have no problem with people doing fun projects and explaining how they did
> them. I have a bit of a problem with people apparently making exaggerated
> claims about their product's capabilities.

This isn't written by a Wolfram employee. It seems to be just someone's fun
and educational project, posted on a Wolfram message board.

------
pornel
The training set contains errors. The に row is especially messy: it's full of
小, ふ, and み.

~~~
ramchip
Aren't these hentaigana?

[http://www.book-seishindo.jp/kana/onjun_2.html#ni](http://www.book-
seishindo.jp/kana/onjun_2.html#ni)

~~~
pornel
Ahh, I see. These aren't supposed to be visual variants of に specifically, but
are just any old forms/arbitrary other characters that were used in place of に
(sort of like "V" being classified as "u" because of Roman texts).

------
peterburkimsher
I want to do the same for Chinese.

I have a collection of 1500 fonts, and finally exported all the PNGs for
75,000 characters. Now I need to pad, crop, and scale to make 28x28 (or 32x32,
or 64x64, or another resolution).

Then I want to do the Machine Learning (Classify) step.

The article doesn't go into any detail about how to install Classify, how to
import the training and testing data, and how to then actually run it. I
watched the videos from CS231n because of my boss, but again, I'm still not
really sure what to do practically.

If I have lots of folders of images, what should I do to build an OCR program?

~~~
visarga
\- load images, convert into array, call it X

\- put the labels (character codes one hot encoded) in an array called y

\- use a convolutional neural network to learn the mapping X->y (training)

\- send new images to the network to make predictions

That's how you train all neural nets in supervised learning.

~~~
peterburkimsher
Just like the article, your comment is pseudo-code. Isn't there a tutorial for
this?

~~~
mchaver
OCR for handwritten number and Latin characters: [http://opencv-python-
tutroals.readthedocs.io/en/latest/py_tu...](http://opencv-python-
tutroals.readthedocs.io/en/latest/py_tutorials/py_ml/py_knn/py_knn_opencv/py_knn_opencv.html)

As my sibling poster mentioned, these are generally research level tasks. Most
people working on this are either writing papers on very particular issues
related to OCR or they are working in industry and the companies are not open
sourcing their code because it is expensive to produce and/or they have custom
data sets they don't want to give away. There might be some tutorials
somewhere though.

Source: previously worked on OCR systems for European and East Asian
languages.

------
wodenokoto
Anyone know the work he refers to as the inspiration for this article?

