Like, I've got scans of letters, machine-written, 300dpi, perfect black/white 1-bit depth, no marks or scratches, perfect quality — and OCRmyPDF (using tesseract) absolutely fails at this task, returning only bullshit. Even if I set the language correctly, or even set the wordlist to a manual transcription of the PDF.
I also tried using OCR on screenshots, with the same miserable result.
How does Apple’s Vision API do so much better at this kind of task? Is there some trick I'm missing?
Like, the images I supply are of such high quality that you can literally just split into characters and search for the nearest match in a database of latin characters, and even that would return better results than tesseract.
My guess is Apple trained on their own massive dataset and better architectures that make their systems better than off the shelf ocr.
I am working in this area and one way I see good improvement is training an object detector to first detect words in an image, then you can pass that through tesseract/OCR software. Besides that, finetuning tesseract on data you want strong performance would be the next best alternative.
Here is a cool article from dropbox how they engineering their OCR system:
This link below is a really cool project that showcases what building your own OCR engine would look like!
I would also vouch for google’s ocr API but I opted for Apple’s for privacy reasons and as they were easier to implement.
Thank you for respecting user privacy and making it a first class citizen.
This is the script I wrote for that purpose (maybe it helps): https://nopaste.xyz/?ea1e6242a659ce5c#vAMyPpajo0OIeCGQe9Wkdq...
The results aren't perfect, but good enough for my purpose.
I was also left with a feeling that I must have been missing some trick, because people on the web seem to be using it with good results.
It handles OCR, indexing and search
Mac Catalyst 13.0+
> The app does use third party services that may collect information used to identify you.
> I want to inform you that whenever you use my Service, in a case of an error in the app I collect data [...]. This [data] may include information such as your device Internet Protocol (“IP”) address, [...], the configuration of the app when utilizing my Service, [..], and other statistics.
Could an error log dump the strings it detected in my images and send them off to this third-party for instance?
I don't understand why iOS doesn't allow blocking internet access for apps. You can only disallow mobile data access, but not WiFi access.
The frustrating part is that in China they have the functionality , so it's there, but we are not allowed to use it for our privacy.
edit: This comment was posted before you edited yours.
I'd rather the program asked if I want to send the crash log the next time I opened it. I have several programs that do that, and I wish it was more widespread.
Well, it's already illegal, it just takes time to prosecute enough data processors until others start being afraid of the fines and start ACTUALLY implementing GDPR.
You can be paranoid, but then you shouldn't have a cell phone at all, let alone an always-connected smartphone.
But the web site is one page, without even an "About" section. All it has is an animation, some text, and an e-mail address. No information about the company at all. Even the whois for the domain is hidden.
For the sort of person for whom "on device" processing is important, information about the company has value.
Took a few minutes to go through my images, but the search was working from the very first indexed photo.
The search is fast and word recognition is freaking impressive!
One thought - get this plugged into the SIRI search system so people can search their photos when just doing a general search.
One of the big focuses for us this year is to make that experience much better.
We've already taken some steps. (We now just give you a regular React Native project if you need one), but we plan to keep working on this stuff until you can get all the power you need to build something like Screenshot Hero with the ease of use of just making a website.
Also a bit of tanget, but I'll take the oportunity to say that I wish third party react-native modules had some sort of standard documentation like flutter website, specially versioned documentation.
It's really cool to see how something like this can be made without a huge investment! I expect that future iOS versions will probably just include this natively. There's already some object recognition but not at this level.
Try searching "credit card, [city], september 12th 2019", or "yellow cat" or "screenshot".
It can't transcribe text though.
And transcribe text with Google Lens
I would love to see more developers follow this approach.
As an alternative, Google Photos from the App Store also support OCR for still-images on its web version. I don't have an idevice on me to know if it's also OCR in-app. Give GP a run for its money, Screenshot Hero!
And he said that he used React just for the views — SwiftUI isn’t that hard at all and there are exact tutorials about listing things. Ray Wenderlich for example, the WWDC videos — there is high quality information out there but it seems like Swift and SwiftUI wasn’t actually a real consideration.
To be clear, I am not criticizing the app, what I am criticizing is the parent comment hoping more developers follow that approach. I happen to think that approach is lazy and aspires to the lowest common device denominator rather than building apps that are platform specific and optimized for each.
Sometimes you just want to get something done rather than do free work for one of the richest corporations in the world.
The latter should be enough fulfillment for anyone, but sometimes you're just too greedy!
It helped that React Native performed great for this use case.
I wonder how well it'll scale to 10s of thousands of screenshots (especially the search).
Still, I'm wondering wether or not I should keep Notion around (only used it for web-clipping). Especially since Notion still can't search within pages and iOS now can do full-page screenshots.
Edit: except that on iOS, Safari's full-page screenshots are PDFs. Darn.
Not affiliated in any way, just a satisfied user. https://readdle.com/scannerpro
That would make it more comfortable to try apps like this in terms of privacy concerns.
I will look into this.
I don't plan to open source this particular project as I already have enough of open source action going for me.
I wonder how difficult it would be to hook up the index of text to the spotlight search API so it can search via the built-in iOS search?
Good Notes on the iPad is also shockingly good at OCR for handwriting. At least for the way I write.
I’d like to be able to take a photo of my book shelf and have it show me where the book I want is... or to generate an index of my books, with links I can click that show me where they are.
xcrun simctl io booted recordVideo <filename>.<extension>
Probably a bit harder but maybe scraping app store screenshots and the top 5000 Alexa websites and training a classifier on them might be viable?