
Ask HN: Scripts/commands for extracting URL article text? (links -dump but) - WCityMike
I&#x27;d like to have a Unix script that basically generates a text file named, with the page title, with the article text neatly formatted.<p>This seems to me to be something that would be so commonly desired by people that it would&#x27;ve been done and done and done a hundred times over by now, but I haven&#x27;t found the magic search terms to dig up people&#x27;s creations.<p>I imagine it starts with &quot;links -dump&quot;, but then there&#x27;s using the title as the filename, and removing the padded left margin, wrapping the text, and removing all the excess linkage.<p>I&#x27;m a beginner-amateur when it comes to shell scripting, python, etc. - I can Google well and usually understand script or program logic but don&#x27;t have terms memorized.<p>Is this exotic enough that people haven&#x27;t done it, or as I suspect does this already exist and I&#x27;m just not finding it?  Much obliged for any help.
======
westurner
> _I imagine it starts with "links -dump", but then there's using the title as
> the filename,_

The title tag may exceed the filename length limit, be the same for nested
pages, or contain newlines that must be escaped.

These might be helpful for your use case:

"Newspaper3k: Article scraping & curation"
[https://github.com/codelucas/newspaper](https://github.com/codelucas/newspaper)

lazyNLP "Library to scrape and clean web pages to create massive datasets"
[https://github.com/chiphuyen/lazynlp/blob/master/README.md#s...](https://github.com/chiphuyen/lazynlp/blob/master/README.md#step-4-clean-
the-webpages)

scrapinghub/extruct
[https://github.com/scrapinghub/extruct](https://github.com/scrapinghub/extruct)

> _extruct is a library for extracting embedded metadata from HTML markup._

> _It also has a built-in HTTP server to test its output as JSON._

> _Currently, extruct supports:_

> _\- W3C 's HTML Microdata_

> _\- embedded JSON-LD_

> _\- Microformat via mf2py_

> _\- Facebook 's Open Graph_

> _\- (experimental) RDFa via rdflib_

------
WCityMike
Just for the record in case anyone digs this up on a later Google search,
install the newspaper, unidecode, and re python libraries (pip3 install),
then:

    
    
      from sys import argv
      from unidecode import unidecode
      from newspaper import Article
      import re
    
      script, arturl = argv
    
      url = arturl
      article=Article(url)
    
      article.download()
      article.parse()
    
      title2 = unidecode(article.title)
    
      fname2 = title2.lower()
      fname2 = re.sub(r"[^\w\s]", '', fname2)
      fname2 = re.sub(r"\s+", '-', fname2)
       
      text2 = unidecode(article.text)
      text2 = re.sub(r'\n\s*\n', '\n\n', text2)
    
      f = open( '~/Desktop/' + str(fname2) + '.txt', 'w' )
      f.write( str(title2) + '\n\n' )
      f.write( str(text2) + '\n' )
      f.close()
    

I execute via from shell:

    
    
      #!/bin/bash
      /usr/local/opt/python3/Frameworks/Python.framework/Versions/3.7/bin/python3 ~/bin/url2txt.py $1
    
    

If I want to run it on all the URLs in a text file:

    
    
      #!/bin/bash
      while IFS='' read -r l || [ -n "$l" ]; do
        ~/bin/u2t "$l"
      done < $1
      

I'm sure most of the coders here are wincing at one or multiple mistakes or
badly formatted items I've done here, but I'm open to feedback ...

~~~
westurner
There could be collisions where `fname2` is the same for different pages;
resulting in unintentionally overwriting. A couple possible solutions:
generate a random string and append it to the filename, set fname2 to a hash
of the URL, replace unsafe filename characters like '/' and/or '\' and/or '\n'
with e.g. underscores. IIRC, URLs can be longer than the max filename length
of many filesystems, so hashes as filenames are the safest solution. You can
generate an index of the fetched URLs and store it with JSON or e.g. SQLite
(with Records and/or SQLAlchemy, for example).

If or when you want to parallelize (to do multiple requests at once because
most of the time is spent waiting for responses from the network) write-
contention for the index may be an issue that SQLite solves for better than a
flatfile locking mechanism like creating and deleting an index.json.lock.
requests3 and aiohttp-requests support asyncio. requests3 supports HTTP/2 and
connection pooling.

SQLite can probably handle storing the text of as many pages as you throw at
it with the added benefit of full-text search. Datasette is a really cool
interface for sqlite databases of all sorts.
[https://datasette.readthedocs.io/en/stable/ecosystem.html#to...](https://datasette.readthedocs.io/en/stable/ecosystem.html#tools-
for-creating-sqlite-databases)

...

Apache Nutch + ElasticSearch / Lucene / Solr are production-proven crawling
and search applications:
[https://en.m.wikipedia.org/wiki/Apache_Nutch](https://en.m.wikipedia.org/wiki/Apache_Nutch)

------
spaceprison
I don't know if a specific script but you might be able to make something with
python using the requests, beautifulsoup and markdownify modules.

Requests to fetch the page. beautifulsoup to grab the tags you care about
(title info) and then markdownify to take the raw html and turn it into
markdown.

