Based on the video it works amazingly well. I’ve tried finding a usable way for open source text recognition on desktop, and haven’t found anything even close to this.
Like, I've got scans of letters, machine-written, 300dpi, perfect black/white 1-bit depth, no marks or scratches, perfect quality — and OCRmyPDF (using tesseract) absolutely fails at this task, returning only bullshit. Even if I set the language correctly, or even set the wordlist to a manual transcription of the PDF.
I also tried using OCR on screenshots, with the same miserable result.
How does Apple’s Vision API do so much better at this kind of task? Is there some trick I'm missing?
Like, the images I supply are of such high quality that you can literally just split into characters and search for the nearest match in a database of latin characters, and even that would return better results than tesseract.
I think you have to fiddle with the internal settings of the OCR package you are using to get good results. For tesseract and pytesseract, there is a whole article on improving the quality: https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuali...
My guess is Apple trained on their own massive dataset and better architectures that make their systems better than off the shelf ocr.
I am working in this area and one way I see good improvement is training an object detector to first detect words in an image, then you can pass that through tesseract/OCR software. Besides that, finetuning tesseract on data you want strong performance would be the next best alternative.
Reading all these comments about others having bad experiences with Tesseract makes me think there is a market for a robust, better alternative to Tesseract.
In my limited comparison. Apple’s API performed way better than Tessaract. You can check your images using this app or their sample xcode project for vision, I think same API works on macOS too.
I would also vouch for google’s ocr API but I opted for Apple’s for privacy reasons and as they were easier to implement.
I use Tesseract to extract text from my scanned documents in order to be able to grep through them. What surprised me, was that the lower quality (resolution) scans had actually much better OCR results. So it seems Tesseract works best with some specific font-size-DPI combination.
You get what you pay for with Tesseract, don't judge OCR performance based on it. Would highly recommend checking out FineReader if you're on a supported platform. You can feed it crumpled beer receipts and the results are still decent
If it makes you feel any better, I had a similarly terrible experience with Tesseract. I spent days studying the docs and I failed to find a setup where it would produce anything useful even with the most perfect input.
I was also left with a feeling that I must have been missing some trick, because people on the web seem to be using it with good results.
I don't know if this is how it works or not, but it is worth noting that searching for text in images is a much easier problem than transcribing images, and if you do the latter as "step 1" you destroy your ability to do the former: if you search for some piece of text you want to find all images that maybe if you squint a bit could be that piece of text, which allows you to deal with cases where your recognition is forced to make a bad guess that does the moral equivalent of rebracketing symbols that is difficult to work around in a later text-only search pass. If you can at all avoid it, always avoid transforming your input information until you know your query, lest you destroy information (this is what always made Microsoft Journal so epic for me, though I should be clear that I haven't used it in 15 years so for all I know they forgot this).
I had a similar experience working with OCR for a similar project as OP ( https://www.askgoose.com ). What we realized was tesseract's training data was significantly different to what we were trying to transcribe (screenshots of conversations). The sparseness of text leads to bad transcriptions. One workaround is to do some image processing before hand so you can break the image down into data that looks similar to the train data used for tesseract.
I've had success with first using imagemagick to cleanup the image, do some thresholding and then using the latest version (4.1) of tesseract. With tesseract you can also try to experiment with different --psm values for better results.
The paperless project has some tweaks in the config that can get decent results out of the box. For general use with decent quality documents it will work pretty well right now versus spending hours tweaking Tesseract.
This sounds like an app I’ve wanted forever! But as with all apps I’m particularly sensitive to giving bulk access to my photos (and other data).
The app description says “images are all processed on device and nothing is sent to any server” but the app’s privacy policy talks about first- and third-party collection of potentially PII.
> The app does use third party services that may collect information used to identify you.
> I want to inform you that whenever you use my Service, in a case of an error in the app I collect data [...]. This [data] may include information such as your device Internet Protocol (“IP”) address, [...], the configuration of the app when utilizing my Service, [..], and other statistics.
Could an error log dump the strings it detected in my images and send them off to this third-party for instance?
I assume everything here exists with the best intentions, but I worry about using an app like this to scan and analyze all my images when there are a host of exceptions in the privacy policy.
It doesn’t need to. The OS could download ads and show them in—app and report back impressions through the OS. And, Apple doesn’t make much money from iOS ads.
Apple doesn't have any ad frameworks that are open to third-party developers, so they would have to build one to support this. And that didn't work out so well last time they tried.
The data collected is, as mentioned, used to analyze crashes. In case of an error, some information is collected, that can be explained as being justified: device unique ID, to know if an error happened on one device multiple times, IP address is collected in web server access logs, configuration of the app could mean a lot of things, i.e. resolution of the viewport, etc.
If the Privacy Policy contains no mention of sending images to the server and storing the images and/or text contents, then ... it's not happening.
edit: This comment was posted before you edited yours.
If the Privacy Policy contains no mention of sending images to the server and storing the images and/or text contents, then ... it's not happening.
A Privacy Policy is just words in most jurisdictions. Short of a Wireshark analysis, there's no way to know for sure.
I'd rather the program asked if I want to send the crash log the next time I opened it. I have several programs that do that, and I wish it was more widespread.
> A Privacy Policy is just words in most jurisdictions.
Not for long. At least in EU.
Well, it's already illegal, it just takes time to prosecute enough data processors until others start being afraid of the fines and start ACTUALLY implementing GDPR.
If it makes you comfortable, I am not sending text (or any other metadata for images including location, camera fingerprint etc). I plan to add basic analytics to know which views/features are being used and also crash detections.
I'm in the same boat. I wish Apple gave us a way to block internet access for specific apps. Currently, there is way to block cellular data for a given app but wifi is always unrestricted.
This looks very promising, and the price is reasonable.
But the web site is one page, without even an "About" section. All it has is an animation, some text, and an e-mail address. No information about the company at all. Even the whois for the domain is hidden.
For the sort of person for whom "on device" processing is important, information about the company has value.
This works really well, great work! One thought-- if you first let it scan all screenshots and let it finish, but then change it to scan all images, it seems that it is recomputing all the text for every image. Perhaps it might make sense to compute the hash of the individual images, and associate the parsed text with this image hash in a table. Then if you have to scan the image again, and the hashes match, these can simply be retrieved from the table rather than computed again. Either way, what a nice free tool.
I tried writing a native module a while back (2+ years ago) to work with a RN app and failed miserably. Has the process/documentation improved since then? Are there known go-to resources for native module development with clear examples? Wish expo.io (which I have used and is otherwise great) was more robust when it comes to support custom native modules, too.
One of the big focuses for us this year is to make that experience much better.
We've already taken some steps. (We now just give you a regular React Native project if you need one), but we plan to keep working on this stuff until you can get all the power you need to build something like Screenshot Hero with the ease of use of just making a website.
that's awesome to hear. For what it's worth, I found all of your guys' work simply incredible, and have used it several times both for personal and professional projects with overwhelmingly fantastic results and a far better experience than anything I was ever able to accomplish natively. The "complaint" about problems really stems from the fact that the bulk of the experience is so fantastic that things like this which are still a bit rough around the edges end up really standing out.
Did you try the iOS Calendar tutorial[1] ?
Besides this, I find useful looking into other third party modules, in concrete some module that is simliar of what you are building, if you are having UI or not, if you want to use Cocoa pods or not ...
Also a bit of tanget, but I'll take the oportunity to say that I wish third party react-native modules had some sort of standard documentation like flutter website, specially versioned documentation.
I have learned it in the last few months. Its been really intuitive as a Native dev swithcing to work with RN more. A really simple example I made is this native module for showing the iOS 13 Context Menu when RN views are long pressed, if you are looking for superrrr simple examples https://github.com/mpiannucci/react-native-uimenu
I've been using an app called "Memos"[0] with a similar value proposition for several months now- it's become a crucial for me. If someone gives me a paper, appointment reminder, note from school- I just take a picture of it and I can find it again any time I want. It's also really nice for finding pictures of buildings or signs (or photos of people taken near buildings or signs) since it can recognize text on those as well.
It's really cool to see how something like this can be made without a huge investment! I expect that future iOS versions will probably just include this natively. There's already some object recognition but not at this level.
Agreed: Cool app and certainly heroic and worth implementing.
As an alternative, Google Photos from the App Store also support OCR for still-images on its web version. I don't have an idevice on me to know if it's also OCR in-app. Give GP a run for its money, Screenshot Hero!
Thanks! Yes Google Photos is awesome and does way much more (like searching by things in photo). I just wanted an on-device solution without having to upload anything to Google servers.
I wouldn’t. Swift is far more performant than using these React-type pseudo-native systems and the developer mentioned that he “fell back” to React because of an apparent lack of SwiftUI documentation — so why not use this project as a chance to actually learn it and create some documentation? He wrote up about how he built the app, adding yet another article to the boring canon of “I just want to use JavaScript and React and call myself an iOS developer.”
And he said that he used React just for the views — SwiftUI isn’t that hard at all and there are exact tutorials about listing things. Ray Wenderlich for example, the WWDC videos — there is high quality information out there but it seems like Swift and SwiftUI wasn’t actually a real consideration.
To be clear, I am not criticizing the app, what I am criticizing is the parent comment hoping more developers follow that approach. I happen to think that approach is lazy and aspires to the lowest common device denominator rather than building apps that are platform specific and optimized for each.
This project was supposed to be that, a chance to learn SwiftUI. I didn't want to reverse engineer their API + document it. Frankly, I didn't have that kind of free time :)
It helped that React Native performed great for this use case.
I think that's unnecessarily harsh. RN is powerful and the dev still uses Native modules and views. The performance was probably good enough for this concept and he made a cool app with it.
Oh. This is great. Just tried it out a couple times on my 439 screenshots, and it works. I love it.
I wonder how well it'll scale to 10s of thousands of screenshots (especially the search).
Still, I'm wondering wether or not I should keep Notion around (only used it for web-clipping). Especially since Notion still can't search within pages and iOS now can do full-page screenshots.
Edit: except that on iOS, Safari's full-page screenshots are PDFs. Darn.
Scanner Pro, by Readdle, also does on-device OCR for documents you scan (along with many other things, including deformation/perspective correction and PDF export).
I really wish iOS had a way to selectively block all internet access for specific apps (like it kinda does for cellular data), and to block the Hidden photos album from all apps at the system/API level.
That would make it more comfortable to try apps like this in terms of privacy concerns.
How does it work with iCloud Photo Library, where most of the photos are not stored on the device? Does it have to download the full library to search? I have over 35k photos, but this app only shows about 4.5k of them, and it processes one image every 2-3 seconds.
What I really want is an app that goes through all my photos/chat history and deletes anything that might contain sensitive information after 30 days, such as from screenshots. I already use the iMessage feature that lets you delete history after 30 days, but I wish I could pretty much do this for everything. There are some things I might want to keep (like photos, for example), but I'd prefer to just have anything private deleted automatically. If it's information I really want to save, I'll make sure to save it somewhere safe.
With photos on iOS you can delete sensitive content on creation i.e. take photo, delete photo. The trash empties after 30 days, so you can access it there if you need to.
Evernote works surprisingly well. I'm always surprised when I search for something and it finds random photos of white boards that contain the word or phrase I'm looking for.
Good Notes on the iPad is also shockingly good at OCR for handwriting. At least for the way I write.
I’d like to be able to take a photo of my book shelf and have it show me where the book I want is... or to generate an index of my books, with links I can click that show me where they are.
Usually yes. But imagine you have a screenshot of a text conversation and search for "method" (but stop at "meth" for whatever reason) and see an unexpected screenshot. Finding the fact that "something" contains the text "meth" in said conversation may be incredibly difficult.
It’s unlikely there’s any way to do that on iOS, but in theory you could do that on a desktop OS. Pretty sure there would be such a massive amount of text throughout the day, you couldn’t just do a simple filter search, you would need to somehow sort for relevance as well.
I have a bunch of XKCDs in my screenshots and somehow this app was even able to index those! I’m very impressed with the quality of the OCR tech. Great work!
Like, I've got scans of letters, machine-written, 300dpi, perfect black/white 1-bit depth, no marks or scratches, perfect quality — and OCRmyPDF (using tesseract) absolutely fails at this task, returning only bullshit. Even if I set the language correctly, or even set the wordlist to a manual transcription of the PDF.
I also tried using OCR on screenshots, with the same miserable result.
How does Apple’s Vision API do so much better at this kind of task? Is there some trick I'm missing?
Like, the images I supply are of such high quality that you can literally just split into characters and search for the nearest match in a database of latin characters, and even that would return better results than tesseract.