Hacker News new | past | comments | ask | show | jobs | submit | sanusihassan's comments login

i'm open-sourcing the backend, but not 100% of the code.


Thanks! :)


I understand that you want to keep your work private and not expose your documents to the internet, but there might be a situation where the document isn't that important to you and any online solution would be sufficient, let's say you one of your friends tells you to ask the ai a math problem they want to know how to solve/learn but the ai only understands text then you need to ocr the pdf which is jpg converted then copy it to the ai, you might be on your phone or away from your desktop environment, here you might consider using an online solution like pdfequips :)


Thank you for your suggestion, I'm considering keeping the basic functions free and adding premium features, but still not changing any of the core features i.e the website is going to work as is.


pdfquips is fast, free, and offers tools that are not available on smallpdf like pdf-to-csv, pdf-to-pdf-a, translate-pdf, you can OfCourse use what you feel most comfortable with, but i guess you should give pdfequips a try :)


Thanks for bringing that up, I'll take care of that issue.


The files are deleted immediately after processing I'm considering implementing WebAssembly (Wasm) to do most of the work on the client's device and enable offline use


You say that here, but its not in the privacy policy as far as I can see.

When you get to the stage of monetizing the site, I expect the most obvious starting point is monetizing the information inside the pdfs.

Then there's the obscurity of how much (if any) is passed on to other services (like google). You may have one policy about the PDFs, they may gave another.

So yeah, I'm in the "not for general use" camp myself. (Although there are edge cases where it may be useful.)

Don't get me wrong, I can see the upsides, and your web site looks professional, but alas the downsides are too significant to overcome the inconvenience of searching out something local.


I decided to create pdfequips.com when a friend kept sending me PDF files for translation, realizing the widespread need for PDF solutions Now, it serves as a central hub for PDF management, offering conversion tools like PDF to Word and CSV, as well as OCR technology Over the past year, I extensively developed the website, leveraging a wide range of open-source tools on both the front-end and back-end.


I'd like if there's more details on the open source software used.


Same here. No (F)OSS licenses to be found on the page itself. Sus. Perhaps it is simply injecting remote root vulnerabilities into the PDF's.


the web app i.e the front end part is next.js and typescript mostly, the landing page is built using astro.js, and the back end is heavily python, flask and some javascript for web-to-pdf and markdown-to-pdf, the rest is mostly python


just curious: what do you use to convert web pages to pdf?


Not op, but I've had good experience with WeasyPrint. I use it for generating PDF invoices: I create a HTML invoice from a template, WeasyPrint turns it into a PDF document. It handles CSS, images, custom fonts, etc.

A neat trick to convert HTML to PDF in a browser environment is to open a new browser window, load the HTML in it, and call print() on it, like here: https://stackoverflow.com/a/33890644/5821. May be OK for an internal tool.


puppeteer


I hope those are FOSS remote root PDF vulns!


If something is turing complete, don't trust/execute it until you have verified where it comes from, who is behind it and what it does.

Here you have what Adobe has to say about PDF's: https://www.adobe.com/acrobat/resources/can-pdfs-contain-vir...


i used open source solutions to built it, like libreoffice, ghostscript, google's tesseract and a bunch of other tools, Google's Tesseract: https://github.com/tesseract-ocr/tesseract


I’m surprised everyone is using Tesseract. It was the only game in town 10 years ago, and it’s Ok on cleaned aligned data, but there are a few newer ones like EasyOCR [0] that can deal with much less organized text (albeit more slowly)

[0] https://github.com/JaidedAI/EasyOCR


EasyOCR looks like it's more focused on the mobile use case of extract text from photos. That's a little bit different from extracting text from scanned documents, where document structure is an important aspect, and Tesseract is the devil we know. In the commercial space ABBYY Finereader still dominates - https://en.wikipedia.org/wiki/ABBYY_FineReader

But perhaps I'm wrong...


ABBYY does indeed dominate, but Google Document AI is making inroads.


Careful with the Ghostscript AGPL licensing if you plan to make a commercial product that uses it.


The PDF metadata says it's PyPDF2


i used PyPDF2 to implement some tools, but not all of them.


I think it looks like a nice tool, naysayers notwithstanding. I don't have sensitive PDFs and, though I would probably not use it for my tax return, I'll use it for other stuff. For my level of security, I'm happy enough with your promise to delete the stuff right away.


i appreciate your trust, and yeah belive me i'm deleting the files right after the processing, the way it works is that i'm saving the files uploaded as a tmp file then process it then delete them after the response.

this is how the code looks like on the server side for most of the tools:

```python ... @after_this_request def remove_file(response): os.remove(tmp_file.name) return response return response ``` i don't have any reason to keep them.


indentation is not showing correctly, but you get the idea.


This is great, definitely going into bookmarks. The website design lacks some refinement, but overall easy to use.


thanks for the feedback!


edit pdf tool is under development.


I'm building a high-performance PDF design tool that feels just as fast as native software.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: