Hacker News new | past | comments | ask | show | jobs | submit | gradientDissent's comments login

Nice work. Main content extraction based on the <main> tag won’t work with most of the web pages these days. Arc90 could help.


Thank you! this is exactly why there's support for this specific use case- https://github.com/romansky/dom-to-semantic-markdown/blob/ma... (see `findContentByScoring`)

And if you pass an optional flag `extractMainContent` it will use some heuristics to find the main content container if there is no such tag..


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: