Hacker News new | past | comments | ask | show | jobs | submit login
Htmd: A turndown.js inspired HTML-to-Markdown converter for Rust (github.com/letmutex)
96 points by letmutex 10 months ago | hide | past | favorite | 28 comments



> Fast, it takes less than 200ms to convert a ~1.4MB Wikipedia page on an i5 7th gen CPU

Okay, maybe I’m way off base here, but is this fast? 1.4MB of Wikipedia page is, what, 20k lines? This doesn’t sound like fast Rust to me.

I would guess that the amount of HTML parsing that’s happening is way more than is actually needed to render markdown.


Author here, I'm working on making it faster.

Currently, for the test page https://github.com/letmutex/htmd/blob/main/examples/page-to-..., the debug build is slower than turndown.js (~750ms vs ~670ms on my machine), the release build brings that down to ~170ms. It can definitely be faster, at least the debug build shouldn't be slower than turndown.js.

I haven't checked which parts can be improved, so I'm not sure how much time we can save after optimization.


I'm guessing it's fast compared to the javascript version they ported to rust?


The slow part here is going to be HTML handling.

The comparator - turndown.js - is built on top of domino.js, a mature HTML lib built as a more performance-focused version of Mozilla's DOMJS (an already mature project in itself), so even as slow as NodeJS itself may be, you're running up against some pretty well-crafted libraries in an ecosystem with a long lineage of HTML handling.


Cool to see another library in this space!

I see that you took the test cases from Turndown. However Turndown isn’t actually that accurate. This is especially noticeable when converting entires websites.

The best comparison would be against Pandoc. That is (in my opinion) the best html to markdown converter right now.

Although it is extremely difficult to handle every edge case. As an example, this usually causes problems:

  <p>nitty<em>-gritty-</em>details</p>

Note: Six years ago I open sourced a Golang library [1]. Currently I am re-writing it completely with the aim of getting even better than Pandoc. And wrote about the encountered edge-cases [2].

[1] https://github.com/JohannesKaufmann/html-to-markdown

[2] https://html-to-markdown.com/edge-cases


Thanks for the information! This is really helpful, glad to know these resources for improving it.


Im curious why one would use this vs something like pandoc?


Probably because it's rust. Realistically, you would only need it when you have already written a part of your software in rust.


To add to that, an additional benefit would be you can compile and release it as Python package (Py03/maturin) or compile to WASM so it runs in the browser (with javascript bindings). This makes the code portable while benefiting from Rust's performance/memory safety.


The majority of use cases are surely as a separate binary and not integrated into your code?


Nice! I made a CLI in go for quickly converting mardown to html sometime ago: https://github.com/thebigbone/markhtml


Why even convert HTML to markdown? Isn't it usually the other way around?


1. For reading. I simplify all documents to mdown then render back to html on my readers...e-readers and rss

2. Saves context for llms and they are often trained on markdown so work best with it.

3. For search. Can search markdown much better than html with postgres

Wrote https://markdown.download to help me with these


We have HTML templates for sending transactional email in our SaaS applications.

The templates are very basic, eg. <p> <b> <a>, etc. Users can also customise these via a WYSIWYG editor.

We then use turndown.js to convert the rendered HTML email to markdown which we then use for the text version of the email.


Storage.

My use case involves scraping job boards so that I don't have to doomscroll them myself anymore, and storing them in Markdown makes them smaller while also removing a bunch of extraneous classes and structure.

Further, the side project I'm working on for managing all of this can then render them in a way that makes sense.


you might be interested in https://www.kadoa.com/use-cases/jobs if you prefer to fully automate the process of job boards scraping


Only use case I can think of is to save web pages to your MarkDown notes. Web links usually break after a year or two. Unfortunately.


I created an RSS reader which has a uniform reader mode. I use something similar to this to parse each RSS article to a similar format. I'm sure there are many other use cases also.


I had to do this to recover my personal blog after both it and the backups had been lost due to two unrelated snafus during covid. I downloaded the pages from the internet archive and used my own shellscript to extract the text as markdown and then republished it using a static site generator.

Not exactly a common usecase I wouldn't think but it's good to be able to do this.


Advent of Code exercises are almost pure markdown, but rendered to HTML.

I’ve sometimes been converting it back to md to include the text for each exercise alongside my solutions.

In my case I used a custom HTML to Markdown converter that was specifically built to support only what I needed in order to convert those Advent of Code exercises to markdown.

Mine was also written in Rust.


If you want to store markdown in your database but you want to user to use a basic wysiwig/content editable editor it can allow you to not go through the full blown markdown editor.


Similarly for sanitization, if you want to really reduce the subset of allowed tags, and to normalize input it's a pretty good intermediate format. Did this for a classified ads site, that included listings from external companies... it was easy enough to shift H1-H3 to H4-H6, keep basic formatting elements and eliminate the rest.

Markdown in the database is also easier to look at, reason with and takes up a lot less space. Especially if the content was pasted from say MS-Word to a Content-Editable field... omg the level of chaos there.


To feed content to LLMs


"But it's not structured!"

I swim around a lot in the "XML High Priesthood" pool, and the latest new thing is this: AI (sucking down unstructured documents) isn't capable of efficient functioning without Knowledge Graph, and donchaknow a complex XML schema and a knowledge graph are practically the same thing.

So they're glueing on some new functionality to try and get writer teams to take the plunge and - same old same old - buy multimillion dollar tools to make PDFs with. One sign of a terminal bagholder is seeing the same tech come up every few years with the latest fashionable thing stapled on its face. They went through a "blockchain" phase too, where all the individual document elements would be addressable "through the chain".

Anyway . . .

Anyway, thing is, there's a teensy shred of truth in what they're saying, but everything else about what they're suggesting would, I think, either not work at all, or make retrieval even less dependable. Also, to do what they're trying to do, you don't actually need a gigantic full on XML schema. Using Asciidoc roles consistently would get you the same benefit, and would save a hell of a lot of space in a very limited window.


Yeah, this. Markdown uses less tokens than HTML and most LLMs have been trend on large amounts of Markdown.

That's why tools like this exist: https://jina.ai/reader/

Demo: https://r.jina.ai/https://news.ycombinator.com/item?id=40695...


Additionally, when you have strict input token limits: it’s way easier to chunk Markdown while keeping track of context than it is to chunk HTML at all.


fun project, saving pages to Obsidian


Obsidian supports html->md natively. Just copy the texts in browser and paste them in Obsidian.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: