
Show HN: Copyfish – Extract text from images, videos or PDF - a9t9
https://addons.mozilla.org/en-us/firefox/addon/copyfish-ocr-software/
======
gressquel
Is the OCR-extraction performed in the client? if its transferred to a server
then people should be aware of this so sensitive data from documents/pdf is
not submitted.

~~~
qz_
Yeah apparently it uses [https://ocr.space/](https://ocr.space/), deal-breaker
for me.

~~~
a9t9
I understand that hosted OCR, just like SaaS in general, is not suitable for
every use case.

On the other hand, the OCR.space OCR API has a _very_ strict privacy policy:

[https://ocr.space/privacypolicy](https://ocr.space/privacypolicy) \- All
uploaded images and the extracted text are deleted immediatly after
processing.

~~~
cantrevealname
> All uploaded images and the extracted text are deleted immediately

Until they are served with a subpoena for a particular client, or a sweeping
subpoena to store everything forever, or the company is sold and the new
parent has different values, or the company decides to mine customer data for
advertising uses, or there's a bug in the software, or there's a long-lived
cache of the data, or it gets into their backups accidentally or deliberately,
or they don't keep the data but keep "just" the meta-data, or they do
statistics or analytics before deleting the data, or they are hacked, or they
simply change their minds.

In terms of privacy, even a non-free non-open-source local app with DRM or
license management is better than a server app with a "strict privacy policy".
With a good firewall setup, you can be pretty sure that the local app won't
betray you.

~~~
bpicolo
It doesn't seem reasonable to blame them for an arbitrary potential future
when they're currently doing the right thing.

~~~
mnem
No, however the description of the plugin should make it clear data will be
uploaded to a third party server for recognition so the user can make a choice
about that.

~~~
bpicolo
It more or less does.

`For developers: Copyfish is published under the GPL open-source license. As
OCR software, it uses the free OCR API from
[https://ocr.space/](https://ocr.space/) .`

~~~
tripzilch
I don't find that clear at all. And this is also important to non-developers.

Also, for nearly all documents I ever need to scan, if they're important
enough to require scanning, they're important enough that a third party should
have nothing to do with them.

The majority of exceptions to the above being, ironically, documents without
text, sketches, doodles, etc.

------
TekMol
What is the business model of free extensions like these? Is it all
spyware/malware?

It looks like many free extensions either have malware in them from the start
or get sold to malware companies later on, who then deploy the malware via
updates:

[http://lifehacker.com/many-browser-extensions-have-become-
ad...](http://lifehacker.com/many-browser-extensions-have-become-adware-or-
malware-1505117457)

~~~
irrational
Why does everything have to have a business model? Sometimes people like to
create things for the sure enjoyment of creating things or they have an itch
to scratch and think others might have the same need. Not everything is
nefarious.

~~~
Volt
Exactly for the reason they said. The concern isn't that extensions are
necessarily nefarious, but that people often want something in return for
their work, which might be money by whatever means.

------
ImJasonH
Similar Chrome extension I wrote using Google Cloud APIs:
[https://chrome.google.com/webstore/detail/cloud-
vision/nblmo...](https://chrome.google.com/webstore/detail/cloud-
vision/nblmokgbialjjgfhfofbgfcghhbkejac?hl=en)

~~~
mdani
No need to use third-party extensions if you have a Google cloud account. You
can download
[https://github.com/kaneshin/pigeon](https://github.com/kaneshin/pigeon) and
just run it from command line - protects privacy and more secure compared to
relying on third-parties.

~~~
pbhjpbhj
Isn't Google a third-party?

------
angry_octet
We need to evolve a grammar for describing privacy implications, because
proper classification of this software would allow it to be marked as
malware/spyware.

It is beyond irresponsible for mozilla to do nothing to prevent this malware
from being recommended on their platform.

~~~
ghostly_s
How about you explain what on earth you're talking about if you're going to
take the time to disparage this product here?

~~~
angry_octet
Read the page. Yes, it isn't obvious is it?

Look down the bottom.

[https://ocr.space/](https://ocr.space/)

It uploads everything to a commercial OCR service. Which provides these CPU
cycles 'for free'.

Who owns this data? Do you have a privacy agreement with ocr.space? Can you
trust them as far as you could spit?

It doesn't matter that this is documented though. Unless it had a popup banner
EVERY TIME YOU USED IT saying "Your data will be sent to a cloud service for
OCR, which may keep/index/sell you data without restriction."

~~~
cortesoft
I think you are going a bit too far with your requirement for a popup banner
every time you use it. Do you expect a popup banner every time you click a
link on a web page taking you to a third party website, because they are going
to be able to run javascript code on your computer?

As long as the plugin is clear that they are using a third party service that
will recieve your images, I think it is fine to leave it at that. Not everyone
feels that is a deal breaker, and they shouldn't be annoyed by a pop up just
because their deal breaker is different than yours.

~~~
jasonkostempski
A link shouldnt get a pop up but, if running JavaScript had required at least
a one-time user approval for each individual script link from the day off it's
inception, the web would be a much friendlier place.

~~~
cortesoft
Do you really think so? I think in practice, if every website you visited made
you click 10-50 pop ups the first time you went to the site, people would
start to blindly click without reading, and it would be even easier to slip a
malicious pop up request by a user.

While you might want to believe that a user would actually think about what
they are accepting, reality is almost all don't. Even the more security minded
people among us will start to get numb to the requests. Only the most paranoid
would pay attention to all of them, and those people are probably already
doing things that would make that sort of pop up redundant.

I think this is a very common trap we fall into, where we want to provide MORE
warnings to people and let them use their judgement. However, there is such a
thing as 'alert fatigue'.

In California, companies that produce carcinogens took advantage of this
aspect of human nature; when California wanted to place warning signs about
cancer causing substances, they realized they couldn't win the fight against
the warnings. Instead, they fought for MORE warnings; they wanted warning
signs for even very slight risk carcinogens. They knew that if the signs were
EVERYWHERE, people would stop paying attention to them.

It worked. Basically every building in California has a warning that
'substances known to cause cancer or birth defects are present'. Since every
building has the same warning, I have no way of knowing which ones are
ACTUALLY dangerous.

~~~
jasonkostempski
No, I don't expect most users would care. Most users wouldn't care if sites
could execute native code as root on their machine. I think, if there was a
prompt, content providers that cared, even a little bit, about presentation
would think real hard before introduction that prompt. The way it works now,
providers very rarely think twice about adding it. And I think ad networks,
trackers and all the other useless, JS based, user hostile tools of the web,
would have a much harder time convincing site owners to drop in a snippet of
JS when there were actual consequences for doing so.

However, I don't believe for a second, without some kind of law, punishable by
death, a requirement like that would have lasted. It would take only one
browser to default "Never prompt for permissions to run JavaScript". Typical
users would flock to it (because sites would say they only work with it) and
compliant browsers would have to copy to compete. Users ruin everything.

~~~
cortesoft
If human nature doesn't allow a solution to be viable, it is useless to blame
human nature. You need a different solution.

~~~
jasonkostempski
An argument against building A.I. because the simplest solution to that is to
eliminate humans from the equation.

------
xophishox
Give me an api end point to send an image to, and a text response. Ill hand
you cash.

~~~
RhodesianHunter
There are a ton of these now. Google provides OCR as part of their machine
vision API. AWS has similar with Rekognition. As others have mentioned, there
are dozens of others on less well known platforms.

~~~
a9t9
Actually, based on my tests, there are only a few good services:

Abbyy (best recognition rate but by far most expensive), Google Cloud Vision
(second best recognition rate), Microsoft OCR and... our OCR.space service
with a very generous free tier and a competitive priced PRO tier.

------
CM30
Hmm, I've seen a few apps and extensions like this before. I think Project
Naptha was a heavily advertised one that did the same thing a few years back.

But how's the accuracy here? Cause when I used previous plugins for this
functionality, I often found they'd return gibberish if the text was even
slightly ambiguous looking in image form.

How does it compare to the other plugins doing the same thing here?

~~~
maggit
The text on the linked page actually compares this to Project Naptha:

> For extension gurus: You might have heard of Project Naptha, a great addon
> that applies state-of-the-art computer vision algorithms on every image you
> see while browsing the web. Copyfish solves the same problem, but it takes a
> different user interface approach. It does not try to alter the website.
> Instead, it lets you mark the text in the image that you want to extract. As
> a result Copyfish works with every website, even videos and PDF documents.

------
chrischen
Need this for mobile. Somehow it became a defining characteristic of an "App"
to disable text selection.

~~~
yjftsjthsd-h
Should be fixable via accessibility APIs?

------
Zyst
This seems cool, I just tried it in chrome and it has support for pop-up
dictionaries, so I'll be using this for some beginner reading assistance.

Thanks for making this!

------
dontchooseanick
Copyphish ?

------
RhodesianHunter
Semi related: I would love to see someone do a comparison of the various OCR
APIs on speed, accuracy, and cost.

~~~
asenna
Same here. Was just researching on this. Not sure if I should go with an open
source OCR engine or one of these APIs

~~~
RhodesianHunter
Doing it yourself with Tesseract is pretty hard (time consuming, error prone).
It's something I would only consider doing once my project was build, viable,
and the costs of an API were an issue.

------
foota
On my phone so I don't have a chance to give it a shot, but what I find has
been most irritating in the past about ocr is the accuracy. If your extension
has better accuracy you might call that out.

~~~
imron
I saw the heading on HN and thought "I wonder if it works with Chinese".

I saw the first example screenshot on the page was a Chinese movie and thought
"Great, it does"

I saw the enlarged version of the screenshot and the Chinese subtitles contain
multiple mistakes: "Nice try, but maybe not so great after all for the use
case I'd personally be interested in".

~~~
a9t9
Well, at least this confirms that the screenshots are not manipulated ;)

The tricky part for the OCR in this example is the diverse background, as the
Chinese characters are directly inside the movie.

Your comment is interesting, as the original motivation for creating the
Copyfish extension was to help me watch Chinese movies. So I can confirm that
for this purpose, it works fine. Of course, once in a while it gets some
characters wrong but it works ok with many movies.

Here is a screencast of Copyfish doing subtitle OCR:

[https://www.youtube.com/watch?v=YNGkGWj8lA4](https://www.youtube.com/watch?v=YNGkGWj8lA4)

~~~
imron
> as the Chinese characters are directly inside the movie.

Yep, same with TV shows, and soft-copies of transcripts are difficult to come
by, hence my interest in something like this.

I just watched the video. When used on a video does it keep a history of all
OCRed text?

Finally, you might also like to try posting this on [http://www.chinese-
forums.com](http://www.chinese-forums.com) If it mostly works well for TV and
films, I'm sure there will be quite a few people there who are interested in
it.

~~~
a9t9
> When used on a video does it keep a history of all OCRed text?

Not yet - but this feature is already on my todo list ;)

Thanks for the hint about the chinese forums!

~~~
imron
> Not yet - but this feature is already on my todo list ;)

Another interesting feature would be to do some sort of statistical analysis
of Chinese text being OCRed and then combining that with possible characters
suggested by the OCR. This would almost certainly prevent the mistake in the
last two characters of the Chinese movie screenshot.

------
bondolo
This seems like it could be very useful for accessibility applications.

------
Nimsical
This is cool!

Wondering what you're using for OCR?

~~~
jffry

      For developers: Copyfish is published under the
      GPL open-source license. As OCR software, it uses
      the free OCR API from https://ocr.space/

~~~
whitten
So, to answer the question mentioned above, the document storing the text is
sent to an off-site server ([https://ocr.space/](https://ocr.space/)) which
does the OCR and returns the results.

~~~
tobltobs
And what lib is using ocr.space for OCR?

~~~
tangue
I suspect they're using Tesseract as they've written a gui for it (
[https://ocr.space/blog/p/free-ocr-
windows.html](https://ocr.space/blog/p/free-ocr-windows.html) ) but there's no
way to find more.

~~~
samfisher83
[https://github.com/A9T9/Free-OCR-Software](https://github.com/A9T9/Free-OCR-
Software)

Based on this github they might be using the microsoft ocr library.

------
ransom1538
Is this using TesseractOCR?

------
sjs382
Could you add an email address to your HN profile so I can contact you?

~~~
a9t9
Done. In addition, the email listed on
[https://github.com/A9T9](https://github.com/A9T9) also reaches me.

~~~
vdRrsithZm
> Done. In addition, the email listed on
> [https://github.com/A9T9](https://github.com/A9T9) also reaches me.

Neat! Brother. +1 =100 Ace

------
zhangkehu
wanderful tool,we can get text from some pdf file easyly.

