I built an online PDF management platform using open-source software

porcoda · on May 13, 2024

Unfortunately, most of the PDF work I do involves things I’m not uploading to a service - ever. I don’t care if they’re “deleted immediately after processing” - they left my control. This sort of software would be great if it were 100% offline.

This isn’t just a niche issue either: this is a very real consideration for any corporate user. More companies are taking data loss and security issues seriously, which often means restricting what cloud services they are willing to use.

harryf · on May 13, 2024

I work at https://www.pdf-tools.com and we hear this again and again.

Despite the proliferation of cloud services, most large enterprises DO NOT want their sensitive documents entering the cloud. And in some cases, e.g. patient medical records, there are strict regulations about how those documents can be stored, which means on-premise is a requirement.

Good news for us, as that's what we specialise in, but also perplexing how trends in the software industry can completely ignore what customers actually want.

thr0waway001 · on May 13, 2024

Looks interesting.

However, the pricing page with no actual numbers and the ambiguous ‘Contact Us’ is a huge turn off.

I cannot stand the dance with business people who want to have a bunch of calls and meetings to know how big a company they’re dealing with is before they decide on a good rate to gouge them.

Pricing pages should be straight forward. Have tiers if you want to cover your rear but only at the limit of usage have the ‘Contact Us’ option.

I’m shopping around for a PDF solution and would’ve recommended this to my manager but I’m not willing to do more meetings to get quotes.

snehk · on May 13, 2024

> the ambiguous ‘Contact Us’ is a huge turn off

Same. About three years ago we introduced a company wide policy to not buy anything where the price is not known. So, so much time (money) being wasted on figuring out the actual costs, the offering would have to be really inexpensive to make up for this. And if that were the case, the price would be right there.

thr0waway001 · on May 13, 2024

Yup.

They usually do high usage volume pricing at high rates that are proportional to the size of the company and make you sign a yearly agreement so they can get a huge payment upfront.

How about building some trust? What if the service sucks? It will be hard to get your money back and you paid a year in advance.

They make you work to get a quote and the quote usually doesn’t work for your needs.

I too will not look at services with this pricing structure anymore unless word of mouth is favorable.

IG_Semmelweiss · on May 13, 2024

very good heuristic. I'll be borrowing. Any others you'd care to share ?

rekabis · on May 15, 2024

> the pricing page with no actual numbers and the ambiguous ‘Contact Us’ is a huge turn off.

It’s also one of the top-10 web usability mistakes as defined by the Nielsen Norman group.

As in, it drives away far more potential clients than it can possibly convert. It’s a massive anti-pattern.

bluGill · on May 13, 2024

Large enterprises can afford to take things in house and might even save money that way, not to mention the security gains. Medical offices have no choice. However small companies often don't have anyone in IT (other than the CEO who does everything and only rarely knows what he is doing other than the niche the company is in). These should be the prime market for tools like this - just pay us a little bit and we will worry about he details for you - everything is backed up. However if you can get one enterprise account that is a lot more money than thousands of little accounts and so everyone focuses on them anyway.

gruturo · on May 13, 2024

> Good news for us, as that's what we specialise in, but also perplexing how trends in the software industry can completely ignore what customers actually want.

I initially read this backwards and thought you were lamenting that people insist on on-prem stuff when cloud is clearly The Right Thing.

I certainly don't think the entire software industry is ignoring what customers actually want. Case in point, you. But also lots of other developers who thrive in covering the myriad use cases the myopic behemoths can't see. They just have very loud PR and marketing and pretend those cases don't exist, so you hear about them a lot.

isatty · on May 13, 2024

You seem to think that users want everything in the cloud and that’s what’s causing the proliferation of cloud services. You are wrong. Users want _convenience_. They couldn’t care less about the cloud or technical details. If your website can do what they want to do without uploading their documents to your server then and if it’s faster and cheaper then that’s what they’ll prefer.

iLoveOncall · on May 13, 2024

No PHP nor JavaScript SDK? You guys don't like money?

harryf · on May 13, 2024

It's a fair point. Most of our customers work with CPP, C# and Java in enterprise / back office contexts, which is why no PHP or Javascript right now - we've been tied up with other priorities. That said we just added Python to our main SDK and PHP is coming.

Plus our enterprise automation product can basically talk to anything via REST API ( https://www.pdf-tools.com/docs/conversion-service/api/conver... ).

But yeah - now you got me fired up to annoy some colleagues ;)

tracker1 · on May 13, 2024

I would think that JS/TS support would be relatively high up... my own bias speaking, but a lot of development and effort to easing cloud apps is JS/TS centric.

lomase · on May 13, 2024

PHP and Javascript? So you never worked on "enterprise"?

iLoveOncall · on May 13, 2024

I work in a FAANG on stuff that is definitely "enterprise software", a major part of what we develop is written in TypeScript.

I admit PHP will not be as good of a candidate but for smaller companies it is still extremely attractive, and it's probably easier to develop since you can write PHP extension in C.

m3h · on May 13, 2024

In that case, you can use https://www.pdftool.org/, which runs in the browser but offline and never uploads your files to any server.

IG_Semmelweiss · on May 13, 2024

I wanted to let you know that i disabled UBlock and badger for your site, but i'm still getting "please disable adblocker" ad error.

THe site renders fine otherwise. I'm not a technical user, but i do run Ublock in the complete Javascript disabled settings.

m3h · on May 13, 2024

I didn't create this tool, but I use it frequently. I'm also using uBlock Origin, but I don't see the issue you describe. I'm not sure what Badget is, though.

BizarroLand · on May 15, 2024

Privacy Badger

https://privacybadger.org/

szundi · on May 13, 2024

How can I really know that as a random user

Rinzler89 · on May 13, 2024

Unplug your network cable when you use it.

zo1 · on May 13, 2024

And it stores it in local storage and uploads it using a service worker later when I'm online?

a2800276 · on May 13, 2024

If that's your paranoia level: How do you know the "offline" tool you're using is not uploading to a server? Possibly inadvertently in the course of bug reports, or surreptitiously while contacting the license server...?

Should security concerns really warrant not trusting the (reputable) vendor that the files are not being uploaded, you would need to do some sort of audit and/or run in an isolated environment and wouldn't be the "random user" referred to in OP.

ffpip · on May 13, 2024

You can easily block network access for an app on Windows using Windows Firewall. Same on a few Android skins such as MIUI by Xiaomi

alandarev · on May 13, 2024

same is true for Chrome Browser, open dev tools and select Network to "Offline"

ffpip · on May 13, 2024

Thanks

Rinzler89 · on May 13, 2024

Use incognito mode then close that window before reconnecting online?

navane · on May 13, 2024

I'd suggest install a separate browser (there exists a myriad by now), unplug internet, use the service, uninstall the separate browser, reboot pc.

Rinzler89 · on May 13, 2024

I suggest a separate VM for that, that you can delete when you're done. Add put the VM on a separate PC that you bought with cash off craigslist. Then toss the PC away in a different postcode when you're done. Then you can use the PDF tool safely without fear you're being tracked.

intelVISA · on May 13, 2024

Run it on an air gapped breadboard 8086?

tqwhite · on May 13, 2024

Use 'Developer Tools' and Inspect. Watch the Network tab.

If you also wear a tinfoil hat, delete the local storage, etc, after you are done using it.

sp0ck · on May 13, 2024

Is is OpenSource ? Can it be run as docker pull; docker run ? If this is an option then use can make sure it will work offline..

m3h · on May 13, 2024

This isn't my tool but based on what I read on the previous thread about it, it doesn't seem to be open-source. However, some folks recommended this tool which does seem to run locally: https://github.com/torakiki/pdfsam

re-thc · on May 13, 2024

> This isn’t just a niche issue either: this is a very real consideration for any corporate user

Very true, but I'd wish this "common" knowledge is more widespread. Security is a major issue commonly overlooked. People do a lot of insecure things for convenience.

sanusihassan · on May 13, 2024

I understand that you want to keep your work private and not expose your documents to the internet, but there might be a situation where the document isn't that important to you and any online solution would be sufficient, let's say you one of your friends tells you to ask the ai a math problem they want to know how to solve/learn but the ai only understands text then you need to ocr the pdf which is jpg converted then copy it to the ai, you might be on your phone or away from your desktop environment, here you might consider using an online solution like pdfequips :)

nip · on May 13, 2024

For anyone looking edit/fill PDFs locally (the data you fill in and document you load stay in your browser): https://SimplePDF.eu

You can read more in the privacy policy [1]

It can also be embed in any website [2]

Disclosure: I’m the developer behind it

[1] https://simplepdf.eu/privacy-policy

[2] https://simplepdf.github.io/

thekevan · on May 13, 2024

I'd also not upload any personal or identifying docs up to this, but I would use it for fliers and it would REALLY be useful converting PDFs I downloaded off the ineternet to begin with. (I've downloaded stuff in the past that I had to convert in order enter the data on the PDF into my computer. Geologic data for maps, list of states with capitals, alphabetized by them--well before ChatGPT, the list goes on.)

pan69 · on May 13, 2024

Sounds to me like that (a desktop app version) is the product to sell (since the online service seems to be free).

chakintosh · on May 13, 2024

docker pull frooodle/stirling-pdf-base

nashashmi · on May 13, 2024

This was on hn a couple of days ago. Stirling pdf is a self hosted docker container and this way you don’t have to worry about files being uploaded. https://news.ycombinator.com/item?id=40242639

I almost thought this hn post was the same service wrapped in a show and tell.

darken · on May 13, 2024

I had just setup "Stirling PDF" on my home NAS a few of weeks ago, since my SO needed to merge some documents and I'd recently read that (or a similar) HN thread.

I definitely would recommend it. It was really quick to setup; though my already having a reverse proxy with wild card TLS certs setup probably helped streamline the networking side of things.

https://github.com/Stirling-Tools/Stirling-PDF

ranger_danger · on May 13, 2024

Stirling-pdf. You can self-host it. Even though it all runs locally anyway

hyuuu · on May 13, 2024

this might be a stupid question, but how do the teams share the documents?

sanusihassan · on May 12, 2024

I decided to create pdfequips.com when a friend kept sending me PDF files for translation, realizing the widespread need for PDF solutions Now, it serves as a central hub for PDF management, offering conversion tools like PDF to Word and CSV, as well as OCR technology Over the past year, I extensively developed the website, leveraging a wide range of open-source tools on both the front-end and back-end.

czl_my · on May 12, 2024

I'd like if there's more details on the open source software used.

coretx · on May 13, 2024

Same here. No (F)OSS licenses to be found on the page itself. Sus. Perhaps it is simply injecting remote root vulnerabilities into the PDF's.

sanusihassan · on May 13, 2024

the web app i.e the front end part is next.js and typescript mostly, the landing page is built using astro.js, and the back end is heavily python, flask and some javascript for web-to-pdf and markdown-to-pdf, the rest is mostly python

deathemperor · on May 13, 2024

just curious: what do you use to convert web pages to pdf?

cuu508 · on May 13, 2024

Not op, but I've had good experience with WeasyPrint. I use it for generating PDF invoices: I create a HTML invoice from a template, WeasyPrint turns it into a PDF document. It handles CSS, images, custom fonts, etc.

A neat trick to convert HTML to PDF in a browser environment is to open a new browser window, load the HTML in it, and call print() on it, like here: https://stackoverflow.com/a/33890644/5821. May be OK for an internal tool.

sanusihassan · on May 16, 2024

puppeteer

aspenmayer · on May 13, 2024

I hope those are FOSS remote root PDF vulns!

coretx · on May 13, 2024

If something is turing complete, don't trust/execute it until you have verified where it comes from, who is behind it and what it does.

Here you have what Adobe has to say about PDF's: https://www.adobe.com/acrobat/resources/can-pdfs-contain-vir...

sanusihassan · on May 13, 2024

i used open source solutions to built it, like libreoffice, ghostscript, google's tesseract and a bunch of other tools, Google's Tesseract: https://github.com/tesseract-ocr/tesseract

beagle3 · on May 13, 2024

I’m surprised everyone is using Tesseract. It was the only game in town 10 years ago, and it’s Ok on cleaned aligned data, but there are a few newer ones like EasyOCR [0] that can deal with much less organized text (albeit more slowly)

[0] https://github.com/JaidedAI/EasyOCR

harryf · on May 13, 2024

EasyOCR looks like it's more focused on the mobile use case of extract text from photos. That's a little bit different from extracting text from scanned documents, where document structure is an important aspect, and Tesseract is the devil we know. In the commercial space ABBYY Finereader still dominates - https://en.wikipedia.org/wiki/ABBYY_FineReader

But perhaps I'm wrong...

ianhawes · on May 13, 2024

ABBYY does indeed dominate, but Google Document AI is making inroads.

racl101 · on May 13, 2024

Careful with the Ghostscript AGPL licensing if you plan to make a commercial product that uses it.

sedro · on May 13, 2024

The PDF metadata says it's PyPDF2

sanusihassan · on May 13, 2024

i used PyPDF2 to implement some tools, but not all of them.

tqwhite · on May 13, 2024

I think it looks like a nice tool, naysayers notwithstanding. I don't have sensitive PDFs and, though I would probably not use it for my tax return, I'll use it for other stuff. For my level of security, I'm happy enough with your promise to delete the stuff right away.

sanusihassan · on May 14, 2024

i appreciate your trust, and yeah belive me i'm deleting the files right after the processing, the way it works is that i'm saving the files uploaded as a tmp file then process it then delete them after the response.

this is how the code looks like on the server side for most of the tools:

```python ... @after_this_request def remove_file(response): os.remove(tmp_file.name) return response return response ``` i don't have any reason to keep them.

sanusihassan · on May 14, 2024

indentation is not showing correctly, but you get the idea.

saturn5k · on May 12, 2024

This is great, definitely going into bookmarks. The website design lacks some refinement, but overall easy to use.

sanusihassan · on May 13, 2024

thanks for the feedback!

boffinAudio · on May 13, 2024

I have 85,000 PDF documents, collected over a few decades.

What I really want is a semantic interface to those PDF documents. Find me "all PDF files which mention <subject>", or "show me any PDF with python example code", or "all PDF's before 2011 on the subject of coding standards for SIL-4".

I keep thinking this is out there somewhere, but whenever something new comes along I get bogged down in the details of setting it up. Surely someone has come up with an AI that you can just 'give the folder to' and it figures things out automagically?

timc3 · on May 13, 2024

Have you tried Paperless NGX?

boffinAudio · on May 14, 2024

No I haven't, so thanks for recommending it to me - looks pretty detailed. I will try it out some time this week, maybe its exactly what I'm looking for. Thanks again!

spiderfarmer · on May 13, 2024

You can do this locally with your favourite LLM and Open WebUI: https://github.com/open-webui/open-webui

boffinAudio · on May 14, 2024

Looks like I've got a few days of hacking ahead of me, thanks for the recommendation - will put it alongside the other suggestions and check it out when I do my "PDF sortout workbench" session ..

andro_dev · on May 13, 2024

This is what I use for that

https://github.com/simon987/sist2

boffinAudio · on May 14, 2024

Looks pretty functional, if not entirely polished - I will try this out (alongside Paperless NGX, also suggested here..) - I appreciate the recommendation, thank you!

torgeros · on May 13, 2024

If you're so keen on the open source aspect, could you make the sources of your website and the tools involved, too? Otherwise there is no use to it

kordlessagain · on May 13, 2024

I have a lot of similar tools and it's all Open Source: https://mitta.ai

nicknow · on May 13, 2024

If this is entirely build using open-source software why not open source the site itself? Especially if you aren't planning to turn it into a commercial service.

sanusihassan · on May 16, 2024

i'm open-sourcing the backend, but not 100% of the code.

zxexz · on May 13, 2024

This is quite nice, but you really ought to have some page accessible with attributions to the open source projects you're using to power this!

mikabasketball · on May 12, 2024

What do people use to perform those pdf tasks without uploading sensitive files to a website?

vikp · on May 13, 2024

For PDF to markdown, I recently released V2 of my tool marker - https://github.com/vikparuchuri/marker

rch · on May 13, 2024

This is very effective - it consistently yields great results.

sanusihassan · on May 13, 2024

The files are deleted immediately after processing I'm considering implementing WebAssembly (Wasm) to do most of the work on the client's device and enable offline use

bruce511 · on May 13, 2024

You say that here, but its not in the privacy policy as far as I can see.

When you get to the stage of monetizing the site, I expect the most obvious starting point is monetizing the information inside the pdfs.

Then there's the obscurity of how much (if any) is passed on to other services (like google). You may have one policy about the PDFs, they may gave another.

So yeah, I'm in the "not for general use" camp myself. (Although there are edge cases where it may be useful.)

Don't get me wrong, I can see the upsides, and your web site looks professional, but alas the downsides are too significant to overcome the inconvenience of searching out something local.

lannisterstark · on May 12, 2024

Stirling-pdf. You can self-host it.

senectus1 · on May 13, 2024

using this myself. its pretty good.

lloydatkinson · on May 13, 2024

Only last week there was a HN thread about how the author said they just used chatgpt to make the entire thing and as a result the code is beyond bad. I don't think I'd trust it.

l8arrival · on May 13, 2024

They didn't say that. They said they wrote the first version in a few days, using ChatGPT. Then worked on it almost another year since then. Something of that nature. Pretty big difference.

lannisterstark · on May 13, 2024

>I don't think I'd trust it.

You can audit the code yourself then. What's stopping you?

lloydatkinson · on May 14, 2024

Nothing is stopping me using something else that isn’t ChatGPT hope and pray code.

senectus1 · on May 13, 2024

Am not a Coder :-P

lannisterstark · on May 14, 2024

That's a fair point lol, my bad.

acidburnNSA · on May 13, 2024

Pandoc, ocrmypdf, libreoffice, pdftk, pypdf2

tqwhite · on May 13, 2024

There is a site listed elsewhere in these comments that does the work entirely in-browser. That's what I would use.

brailsafe · on May 13, 2024

I use Preview on mac or even Spotlight for a good portion of these functions

andretti1977 · on May 13, 2024

Am i missing something or does it lacks pdf editing functionalities like adding/editing text or adding images? I usually use https://smallpdf.com/edit-pdf because 99% of the times i simply need to compile fields with text and attach a png of my signature on some pages and resend the document to the organization that required me to compile it (schools, medical self certifications, governative tax entities and so on). For those need, smallpdf is fantastic, but obviously i'd prefer an opensource or simply a self hostable solution

xvfLJfx9 · on May 13, 2024

This would be awesome, if it can be selfhosted. I work with sensitive documents I can't upload to a third party.

epalm · on May 12, 2024

Nice roundup of tools!

Just a small note, on safari mobile if I expand the Edit and then Convert sections, they open on top of each other.

https://i.imgur.com/bSZdRTN.png

sanusihassan · on May 13, 2024

Thanks for bringing that up, I'll take care of that issue.

omegant · on May 13, 2024

At this point a new format should emerge as a replacement of pdf. It’s very useful and easy to publish, but working with pdf documents beyond reading and printing is way too complicated.

laurensr · on May 13, 2024

If anyone knows about a FOSS pdf form editor, please share!

unanimous · on May 13, 2024

You can edit PDF forms by opening the files in Firefox, but maybe that's not exactly what you mean. I'm not sure about other browsers.

hatenberg · on May 13, 2024

90% overlap with the free and selfhostable stirling-pdf?

carte_blanche · on May 15, 2024

I've been using Stirling-PDF as my go-to solution for any pdf needs and have never needed any other service. Open source gold standard for any pdf needs: https://github.com/Stirling-Tools/Stirling-PDF

erremerre · on May 13, 2024

I am always surprised there is absolutely nothing like the Adobe Acrobat Pro on the open source space.

There are a collection of open source tools, everyone which is its own interface that does a subset of things.

The alternative, which is an online service, it is not great...

martin_a · on May 13, 2024

PDF is harder than most people think due to its variety. While there might be a tool for every job, "one for all" is hard to do.

xupybd · on May 12, 2024

You should run ads or charge for something. I suspect this is going to get very popular.

sanusihassan · on May 13, 2024

Thank you for your suggestion, I'm considering keeping the basic functions free and adding premium features, but still not changing any of the core features i.e the website is going to work as is.

teeray · on May 13, 2024

Anybody have a good open-source receipt data extraction tool for PDFs?

tappio · on May 13, 2024

We just launched a MVP for pdf data extraction https://excelifier.com/. The service is not open source and relies on open ai, which is probably a bit problematic in your case.

However, we understand that privacy concerns are really important for many organizations. Making it self-hostable and depend on a locally running LLM is something that we are looking into.

julianwachholz · on May 16, 2024

It sure sounds interesting, but I'm only getting timeouts. A possible hug-of-death period should be over by now?

madspindel · on May 13, 2024

Any plans to make this available as a docker container?

Refusing23 · on May 14, 2024

i tried converting a pdf to markdown and i just got a large bunch of ... seeminly random numbers and letters.

nilstycho · on May 12, 2024

I am a happy user of smallpdf, which seems quite similar. What advantages do you offer?

sanusihassan · on May 13, 2024

pdfquips is fast, free, and offers tools that are not available on smallpdf like pdf-to-csv, pdf-to-pdf-a, translate-pdf, you can OfCourse use what you feel most comfortable with, but i guess you should give pdfequips a try :)

tomthumb · on May 13, 2024

Bookmarked!

sanusihassan · on May 13, 2024

Thanks! :)