Hacker News new | comments | show | ask | jobs | submit login
Real time numbers recognition (MNIST) on an iPhone with CoreML (liip.ch)
137 points by uberneo 9 days ago | hide | past | web | favorite | 19 comments





Neat walkthrough!

Last year I actually made an applied-CoreML app to solve sudoku puzzles where MNIST came in very handy.

I wrote about it here: https://blog.prototypr.io/behind-the-magic-how-we-built-the-...


>After I scanned a wide variety of puzzles from each book, my server had stored about 600,000 images

600,000?!? Even divided by 81 that's over 7000! How long did this take?


A couple of afternoons.

I just hacked into my app's flow to upload a "scan" of the isolated puzzle to my server instead of slicing it and sending the component images to CoreML.

Then I sat there and flipped through page after page of Sudoku puzzles and scanned them from a few different angles each, sliced them in bulk on the server, and voila: data!


Sorry I’m still confused. You took roughly 7000 pictures in two afternoons? What do you mean by sliced them in bulk? If you took them from different angles how do you slice them in bulk?

Correct.

The app already had the code for "isolate the puzzle and do perspective correction" so the uploaded images all looked something like this: https://magicsudoku.com/example-uploaded-image.png

By "slicing in bulk" I mean the server was the one that split that out into 81 smaller images rather than the app doing the slicing and uploading 81 small images.

Taking them from different angles was done because the perspective correction adds distortions that I didn't want my model to be sensitive to.


Interesting stuff! I’m also a little confused as to how you took so much pictures in only a couple of afternoons.

7000 pictures at 5 seconds per picture is "only" 10 hours of work. Possibly per-picture time can be lower than that too. Seems quite doable over 2-4 afternoons.

Props for doing the project end2end, including the non-trivial (and typically skipped) part of collecting training data.


"Apple ... provides a ... helper library called coremltools that we can use to ... convert scikit-learn models, Keras and XGBoost models to CoreML"

Awesome.


As someone with not much experience in ML, how to handle when there is no number present or if a number is present?

Great question! This is actually a surprisingly deep problem in ML, known as "anomaly detection" or "out-of-distribution" (OoD) detection.

Another way to formulate this question: "given training data that only tells you about digits, how do you know whether something is a digit or not?" Given that the training data never actually defines what isn't a digit, how can we ensure that the model actually sees a digit at test time? If we cannot ensure this (e.g. an adversary or the real world supplies inputs), how can we "filter out" bad inputs?

A quick hack solution that works well in practice is to examine the "predictive distribution" across digit classes. Researchers have empirically found that entropy tends to be higher (i.e. more smooth) when the model sees an OoD input. However, the OoD problem is not fully solved.

Here's a nice survey paper on the topic: https://arxiv.org/abs/1809.04729

Note that methods that tie OoD to the task at hand (classification) are not actually solving OoD, they are solving "predictive uncertainty" of the task.


You mean to get either 0-9 or 'no number'? Here are two approaches:

1) Integrated. Represent 'no number' as class number 11 in the original model. Retrain it with this additional class (needs additional training data).

2) Cascading. Train a dedicated model for 'number' versus 'no number' (binary classifier), and use that in front of the original model.

Note that the MNIST data comes already extracted from original image, centered in fixed-size images of 28x28 pixels. In a practical ML application these steps would also need to be done before classification can be performed.


In the work shown in the article, the segmentation and centering of digits looks to be done by the user holding the camera. Which can be workable for some applications!

The predictions variable has a confidence value for each digit. You can put a cutoff and say if none is above a certain confidence, assume there's no number at all.

This could work, but it is important to note that a lot of ML algorithms trained in a closed domain (no "other" class) will be pretty bad at knowing what they don't know. This is an open problem in ML.

Choosing the threshold will be hard. And (as mentioned by other poster) the model is unlikely to generalize well to classes of data it has not seen. I suspect that this approach will get things similar to numbers wrong quite often, like handwritten characters (a,b,c). Including these into the training set is much more likely to yield a model which will successfully discriminate it.

You can use threshold value to detect whether there is no number. If the prediction accuracy is below this threshold value you can say it as no number

The scrollbar distance confirms a suspicion that I've held for some time: that writing a machine learning algorithm is of similar complexity to developing an iOS app in Xcode!

What scrollbar distance are you talking about?

It was a joke - the Xcode section starts about halfway down the page. I was just illustrating that the friction we deal with today is of comparable complexity to what might be thought of as advanced programming (AI, VR, AR, physics, etc etc).



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: