Hacker News new | comments | show | ask | jobs | submit login
Classifying Japanese characters from the Edo period (wolfram.com)
49 points by soofy 11 months ago | hide | past | web | favorite | 24 comments

If I understand correctly, the guy downloaded a data set, train it and get happy because it performs over 95% on the training set from the data set.

This is not interesting at all because the hard part is to actually segment and classify characters in a real text, and in real life they don’t conveniently come as 28x28 images. As someone with a few hours of training in reading that kind of texts the difficulties are:

- taking appart characters (hentaigana) that looks very close

- segmenting characters because size and length vary a lot

- differentiating between kana and kanji because the former are deritaves of the latter and they sometime look quite alike

So the experimenter should have started his post telling he didn’t know any Japanese instead of closing it as such (even if one can notice when he just use modern kana as classes).

Since the original article doesn't give a lot of context as to what these characters looks like in context, here is an image of a Japanese book [1] and some playing cards [2] to give context to the segmentation difficulties.

[1] https://wakancambridge.files.wordpress.com/2015/07/e8bbbde58...

[2] http://img03.aucfan.com/item_data/image/20150720/yahoo/u/u80...

Hard enough for a literate human to segment.

It does look like an alien language to me; I wonder what Latin or Cyrillic script would look like to someone for whom that was perfectly legible.

Writing like this is more or less analogous to cursive handwriting in English.

I mean, if you write out "minimum" in cursive, someone who learned English exclusively from printed text might think it was a scribble. But someone familiar with cursive decodes the whole word in a chunk, partly from understanding how latin letters get transformed into cursive, and partly from having seen "minimum" written in cursive before, and maybe also using nearby words for context. There's a roughly similar thing going on here.

That is absolutely true, but, like cursive handwriting in English, there are cases where it's all gibberish, even to natives. See, for example, how doctors handwriting is barely readable. With effort, you can make up the words you are familiar with. But not the others.

Now, imagine some text in medieval english in doctors handwriting. That would be basically undecipherable for most people.

The same is true about many Edo period (and older) writings, which are written as "scribbles", and use old vocabulary/grammar that most people don't know. They are essentially unreadable to most Japanese, except the trained ones (e.g. historians).

That's a good point. My last name, when written in Russian cursive, looks like a series of waves, and is effectively indistinguishable from a child's scribble.

> This is not interesting at all because...

Comments like this make me reluctant to post things I work on

The OP's comment is very good because the discussion in the article is quite misleading about what they accomplished. It's also apparently self-serving; apparently marketing material for Wolfram. As your sibling's post makes extremely clear, classifying handwritten text from the Edo period is a much, much, much more difficult problem. In fact the samples they have taken for training are so unstylized that they might just as well have done an article about recognising modern hiragana hand writing. That would be very cool too, but would have gathered no interest because it is a well solved problem.

I have no problem with people doing fun projects and explaining how they did them. I have a bit of a problem with people apparently making exaggerated claims about their product's capabilities.

> I have no problem with people doing fun projects and explaining how they did them. I have a bit of a problem with people apparently making exaggerated claims about their product's capabilities.

This isn't written by a Wolfram employee. It seems to be just someone's fun and educational project, posted on a Wolfram message board.

I think the article is more mistitled than anything else.

I mean, it's a fine article for someone interested in how to train a classifier on an annotated dataset in Mathematica. It's just not very interesting for someone interested in how to classify Japanese characters from the Edo period.

The training set contains errors. The に row is especially messy: it's full of 小, ふ, and み.

Ahh, I see. These aren't supposed to be visual variants of に specifically, but are just any old forms/arbitrary other characters that were used in place of に (sort of like "V" being classified as "u" because of Roman texts).

I want to do the same for Chinese.

I have a collection of 1500 fonts, and finally exported all the PNGs for 75,000 characters. Now I need to pad, crop, and scale to make 28x28 (or 32x32, or 64x64, or another resolution).

Then I want to do the Machine Learning (Classify) step.

The article doesn't go into any detail about how to install Classify, how to import the training and testing data, and how to then actually run it. I watched the videos from CS231n because of my boss, but again, I'm still not really sure what to do practically.

If I have lots of folders of images, what should I do to build an OCR program?

Classify is a function in Mathematica which returns a ClassifierFunction.

Documentation for Classify: http://reference.wolfram.com/language/ref/Classify.html

As far as I could see they listed all the Wolfram Language code in the example to set it up.

- load images, convert into array, call it X

- put the labels (character codes one hot encoded) in an array called y

- use a convolutional neural network to learn the mapping X->y (training)

- send new images to the network to make predictions

That's how you train all neural nets in supervised learning.

Just like the article, your comment is pseudo-code. Isn't there a tutorial for this?

OCR for handwritten number and Latin characters: http://opencv-python-tutroals.readthedocs.io/en/latest/py_tu...

As my sibling poster mentioned, these are generally research level tasks. Most people working on this are either writing papers on very particular issues related to OCR or they are working in industry and the companies are not open sourcing their code because it is expensive to produce and/or they have custom data sets they don't want to give away. There might be some tutorials somewhere though.

Source: previously worked on OCR systems for European and East Asian languages.

You could try http://fast.ai They show all the steps necessary to train an image classifier (for cats vs. dogs, but it should be possible to adapt to other classes easily) and the explanations are excellent.

But if you are trying to do OCR on complete texts, maybe even handwritten, you're getting close to research-level problems very quickly.

Sounds like a nice collection. I'd be keen to see any decent ones you have for rarer seal scripts.

I recommend FangzhengXiaozhuantiFont-TraditionalChinese.ttf for Small Seal Script and JDFZHUANF.ttf for Large Seal Script.

Both are supported by Pingtype.


Sorry that it's slow to load; the fonts have to be downloaded from the server every time because I can't be sure if they're on the user's device. The default font is 12MB. Once it's loaded once, you can re-translate much faster - it's just the first run that's slow.

Anyone know the work he refers to as the inspiration for this article?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact