I wrote something similar so I could save recipes and web pages for reading offline. And if you save in html, it will inline images, so you can have a single file. In markdown, it just creates a link.
It also uses turndown and readability.
It's pretty finicky (readability doesn't always identify the correct content or misses pieces of the content). If you want to charge for it, you'd have to fix some of those edge cases.
Also, I don't think the value is this product is turning web pages into markdown, there are many free web clippers and archive sites that do this already. I see this as more of an "extra" in a product, like how Evernote has a web clipper built in to their note taking product.
Also, it's cool to see other people care about a stripped down web reading experience too!
It costs me only several cents to parse an entire page, and I think OP can make some money out of this if they get the pricing right.
Also, some unsolicited feedbacks on the API:
- An option to enable/disable javascript would be great, since not all pages actually need to have it enabled to be parsable.
- You can probably tweak the header of the headless browser to bypass the paywalls of some sites. Some are as simple as setting the useragent to a crawler bot (like `googlebot`).
Can you expand on this statement "It costs me only several cents to parse an entire page"? That sounds like quite a lot to me. We're talking _maybe_ a few seconds of compute time (if things are really slow) + a trivial amount of bandwidth.
Are you dividing the monthly hosting costs for a server by total seconds spent actually running this tool? I'm thinking if you did this with an AWS lambda it'd be free (maybe bandwidth cost, but again, trivial) unless you had way, _way_ more use than a single person could reasonably generate. Also, free if you used any of the free hosting services and were just doing it for a small number of users.
OP here, I've added server timing headers to https://content-parser.com/, the total fetch+parse is taking me around 0.6-1.2s. The local parsing as a separated step is sync, so I expected it to be negligible but it actually takes a good chunk, often 500-700ms! A lot more than I thought/expected here, I haven't seen any backend error yet but at some point might have to move this to a different thread or similar.
Thanks for the feedback! So far I plan on making this a stepping stone for a fully integrated HN reader, where you can read the whole thing in-page, and for those pages that cannot be parsed (paywalls etc) to just redirect to the original. I prefer not to circumvent any barriers nor hide the user agent for that, and in my situation instead just redirect to the original.
I should also find a better html-to-markdown parser, thanks for the recommendation there! From the "example", yes you guessed "readability" perfectly. And for downloading the page just fetch() + jsdom.
Suggestions:
- [JS]: I use fetch+jsdom, so no JS parsed at all! I've found most content-heavy websites (a.k.a. articles, blog posts, etc) are server-side-rendered, haven't searched too many but so far no issue without JS. Might move to puppeteer at some point for either failed parses with jsdom or for a domain whitelist if I keep one at some point.
- [header]: Already mentioned
- [Front matter]: Right now I'm actually returning two custom headers, `title` and `url`, might add more in the future. I did consider front-matter, but I want to keep the body as "raw" as possible.
- Edit: what I'm considering next is an endpoint to download articles with basic HTML style, or as pdf/epub.
It also uses turndown and readability.
It's pretty finicky (readability doesn't always identify the correct content or misses pieces of the content). If you want to charge for it, you'd have to fix some of those edge cases.
Also, I don't think the value is this product is turning web pages into markdown, there are many free web clippers and archive sites that do this already. I see this as more of an "extra" in a product, like how Evernote has a web clipper built in to their note taking product.
Also, it's cool to see other people care about a stripped down web reading experience too!
https://github.com/benprew/clippy