Hacker News new | past | comments | ask | show | jobs | submit login

> I imagine it starts with "links -dump", but then there's using the title as the filename,

The title tag may exceed the filename length limit, be the same for nested pages, or contain newlines that must be escaped.

These might be helpful for your use case:

"Newspaper3k: Article scraping & curation" https://github.com/codelucas/newspaper

lazyNLP "Library to scrape and clean web pages to create massive datasets" https://github.com/chiphuyen/lazynlp/blob/master/README.md#s...

scrapinghub/extruct https://github.com/scrapinghub/extruct

> extruct is a library for extracting embedded metadata from HTML markup.

> It also has a built-in HTTP server to test its output as JSON.

> Currently, extruct supports:

> - W3C's HTML Microdata

> - embedded JSON-LD

> - Microformat via mf2py

> - Facebook's Open Graph

> - (experimental) RDFa via rdflib

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact