Currently, for the test page https://github.com/letmutex/htmd/blob/main/examples/page-to-..., the debug build is slower than turndown.js (~750ms vs ~670ms on my machine), the release build brings that down to ~170ms. It can definitely be faster, at least the debug build shouldn't be slower than turndown.js.
I haven't checked which parts can be improved, so I'm not sure how much time we can save after optimization.
The comparator - turndown.js - is built on top of domino.js, a mature HTML lib built as a more performance-focused version of Mozilla's DOMJS (an already mature project in itself), so even as slow as NodeJS itself may be, you're running up against some pretty well-crafted libraries in an ecosystem with a long lineage of HTML handling.
I see that you took the test cases from Turndown. However Turndown isn’t actually that accurate. This is especially noticeable when converting entires websites.
The best comparison would be against Pandoc. That is (in my opinion) the best html to markdown converter right now.
Although it is extremely difficult to handle every edge case. As an example, this usually causes problems:
<p>nitty<em>-gritty-</em>details</p>
Note: Six years ago I open sourced a Golang library [1]. Currently I am re-writing it completely with the aim of getting even better than Pandoc. And wrote about the encountered edge-cases [2].
To add to that, an additional benefit would be you can compile and release it as Python package (Py03/maturin) or compile to WASM so it runs in the browser (with javascript bindings). This makes the code portable while benefiting from Rust's performance/memory safety.
My use case involves scraping job boards so that I don't have to doomscroll them myself anymore, and storing them in Markdown makes them smaller while also removing a bunch of extraneous classes and structure.
Further, the side project I'm working on for managing all of this can then render them in a way that makes sense.
I created an RSS reader which has a uniform reader mode. I use something similar to this to parse each RSS article to a similar format. I'm sure there are many other use cases also.
I had to do this to recover my personal blog after both it and the backups had been lost due to two unrelated snafus during covid. I downloaded the pages from the internet archive and used my own shellscript to extract the text as markdown and then republished it using a static site generator.
Not exactly a common usecase I wouldn't think but it's good to be able to do this.
Advent of Code exercises are almost pure markdown, but rendered to HTML.
I’ve sometimes been converting it back to md to include the text for each exercise alongside my solutions.
In my case I used a custom HTML to Markdown converter that was specifically built to support only what I needed in order to convert those Advent of Code exercises to markdown.
If you want to store markdown in your database but you want to user to use a basic wysiwig/content editable editor it can allow you to not go through the full blown markdown editor.
Similarly for sanitization, if you want to really reduce the subset of allowed tags, and to normalize input it's a pretty good intermediate format. Did this for a classified ads site, that included listings from external companies... it was easy enough to shift H1-H3 to H4-H6, keep basic formatting elements and eliminate the rest.
Markdown in the database is also easier to look at, reason with and takes up a lot less space. Especially if the content was pasted from say MS-Word to a Content-Editable field... omg the level of chaos there.
I swim around a lot in the "XML High Priesthood" pool, and the latest new thing is this: AI (sucking down unstructured documents) isn't capable of efficient functioning without Knowledge Graph, and donchaknow a complex XML schema and a knowledge graph are practically the same thing.
So they're glueing on some new functionality to try and get writer teams to take the plunge and - same old same old - buy multimillion dollar tools to make PDFs with. One sign of a terminal bagholder is seeing the same tech come up every few years with the latest fashionable thing stapled on its face. They went through a "blockchain" phase too, where all the individual document elements would be addressable "through the chain".
Anyway . . .
Anyway, thing is, there's a teensy shred of truth in what they're saying, but everything else about what they're suggesting would, I think, either not work at all, or make retrieval even less dependable. Also, to do what they're trying to do, you don't actually need a gigantic full on XML schema. Using Asciidoc roles consistently would get you the same benefit, and would save a hell of a lot of space in a very limited window.
Additionally, when you have strict input token limits: it’s way easier to chunk Markdown while keeping track of context than it is to chunk HTML at all.
Okay, maybe I’m way off base here, but is this fast? 1.4MB of Wikipedia page is, what, 20k lines? This doesn’t sound like fast Rust to me.
I would guess that the amount of HTML parsing that’s happening is way more than is actually needed to render markdown.