I understand that you want to keep your work private and not expose your documents to the internet, but there might be a situation where the document isn't that important to you and any online solution would be sufficient, let's say you one of your friends tells you to ask the ai a math problem they want to know how to solve/learn but the ai only understands text then you need to ocr the pdf which is jpg converted then copy it to the ai, you might be on your phone or away from your desktop environment, here you might consider using an online solution like pdfequips :)
Thank you for your suggestion, I'm considering keeping the basic functions free and adding premium features, but still not changing any of the core features i.e the website is going to work as is.
pdfquips is fast, free, and offers tools that are not available on smallpdf like pdf-to-csv, pdf-to-pdf-a, translate-pdf, you can OfCourse use what you feel most comfortable with, but i guess you should give pdfequips a try :)
The files are deleted immediately after processing I'm considering implementing WebAssembly (Wasm) to do most of the work on the client's device and enable offline use
You say that here, but its not in the privacy policy as far as I can see.
When you get to the stage of monetizing the site, I expect the most obvious starting point is monetizing the information inside the pdfs.
Then there's the obscurity of how much (if any) is passed on to other services (like google). You may have one policy about the PDFs, they may gave another.
So yeah, I'm in the "not for general use" camp myself. (Although there are edge cases where it may be useful.)
Don't get me wrong, I can see the upsides, and your web site looks professional, but alas the downsides are too significant to overcome the inconvenience of searching out something local.
I decided to create pdfequips.com when a friend kept sending me PDF files for translation, realizing the widespread need for PDF solutions Now, it serves as a central hub for PDF management, offering conversion tools like PDF to Word and CSV, as well as OCR technology Over the past year, I extensively developed the website, leveraging a wide range of open-source tools on both the front-end and back-end.
the web app i.e the front end part is next.js and typescript mostly, the landing page is built using astro.js, and the back end is heavily python, flask and some javascript for web-to-pdf and markdown-to-pdf, the rest is mostly python
Not op, but I've had good experience with WeasyPrint. I use it for generating PDF invoices: I create a HTML invoice from a template, WeasyPrint turns it into a PDF document. It handles CSS, images, custom fonts, etc.
A neat trick to convert HTML to PDF in a browser environment is to open a new browser window, load the HTML in it, and call print() on it, like here: https://stackoverflow.com/a/33890644/5821. May be OK for an internal tool.
i used open source solutions to built it, like libreoffice, ghostscript, google's tesseract and a bunch of other tools, Google's Tesseract: https://github.com/tesseract-ocr/tesseract
I’m surprised everyone is using Tesseract. It was the only game in town 10 years ago, and it’s
Ok on cleaned aligned data, but there are a few newer ones like EasyOCR [0] that can deal with much less organized text (albeit more slowly)
EasyOCR looks like it's more focused on the mobile use case of extract text from photos. That's a little bit different from extracting text from scanned documents, where document structure is an important aspect, and Tesseract is the devil we know. In the commercial space ABBYY Finereader still dominates - https://en.wikipedia.org/wiki/ABBYY_FineReader
I think it looks like a nice tool, naysayers notwithstanding. I don't have sensitive PDFs and, though I would probably not use it for my tax return, I'll use it for other stuff. For my level of security, I'm happy enough with your promise to delete the stuff right away.
i appreciate your trust, and yeah belive me i'm deleting the files right after the processing, the way it works is that i'm saving the files uploaded as a tmp file then process it then delete them after the response.
this is how the code looks like on the server side for most of the tools:
```python
...
@after_this_request
def remove_file(response):
os.remove(tmp_file.name)
return response
return response
```
i don't have any reason to keep them.