gradientDissent's comments

gradientDissent · on July 23, 2024

Nice work. Main content extraction based on the <main> tag won’t work with most of the web pages these days. Arc90 could help.

leroman · on July 23, 2024

Thank you! this is exactly why there's support for this specific use case- https://github.com/romansky/dom-to-semantic-markdown/blob/ma... (see `findContentByScoring`)

And if you pass an optional flag `extractMainContent` it will use some heuristics to find the main content container if there is no such tag..