The title tag may exceed the filename length limit, be the same for nested pages, or contain newlines that must be escaped.
These might be helpful for your use case:
"Newspaper3k: Article scraping & curation"
lazyNLP "Library to scrape and clean web pages to create massive datasets"
> extruct is a library for extracting embedded metadata from HTML markup.
> It also has a built-in HTTP server to test its output as JSON.
> Currently, extruct supports:
> - W3C's HTML Microdata
> - embedded JSON-LD
> - Microformat via mf2py
> - Facebook's Open Graph
> - (experimental) RDFa via rdflib