
Show HN: Gutenberg – A simple interface to the Project Gutenberg corpus - c-w
https://bitbucket.org/c-w/gutenberg
======
c-w
Hey all, OP here.

I built this because I think that Project Gutenberg is a great resource for
NLP (e.g. stylometry, tracking writing styles over time, authorship detection,
...) - I wanted to use the data on Project Gutenberg a number of times in the
past but always ended up using another corpus because there wasn't an easy way
to access the Project Gutenberg data. Hopefully this library fixes that.

The project currently is "works on my machine" quality, so please do report
any bugs you stumble across.

Also, if you can think of any use-cases for the Project Gutenberg data that
aren't easily doable using the functionality that is currently available in
the library, please let me know (e.g. by filing a ticket on the Bitbucket
repo).

------
sethish
This is fantastic!

I just made a github repo for each Gutenber book:
[https://github.com/GITenberg](https://github.com/GITenberg)

This will be very helpful, the XML/RDF files are a hassle.

~~~
walterbell
Are the github repos intended to collect errata? Do you know of a database
which has metadata for all the Gutenberg books?

~~~
c-w
There's a database of RDF files that describe the books
([http://www.gutenberg.org/cache/epub/feeds/rdf-
files.tar.bz2](http://www.gutenberg.org/cache/epub/feeds/rdf-files.tar.bz2)),
but its a bit of a pain to use and doesn't link the books back to the API that
should be used for crawling Project Gutenberg
([http://www.gutenberg.org/robot/harvest](http://www.gutenberg.org/robot/harvest)).

~~~
sethish
I think the previous version of the metadata included a path to the ftp
server. Splitting the book id (4443 -> 4/4/4/4443) works for _most_ books, but
there were somewhere between 800 and 3000 books organized in a different
folder structure that I still need to track down.

------
brickcap
Wow this is just what I have been looking for. A few questions:-

1\. Is it possible to download files in html format? I really prefer ebooks in
html cause I can attach my preferred "readSettings.css" to it that way. *

2\. Is it possible to run a custom script after the book has completed
downloading?

3\. I don't understand what you mean by "Making meta-data about the texts
easily accessible through a database" in the description. Can you expand a bit
on this?

4\. Is it possible to specify other donwload contexts like "genre"

Oh Thanks a lot for this :) I always wanted a command line uitlity for project
gutenberg.

* I also think that html files are lot easier to read on the phone as you can style them where as with the txt files you have got no choice but to use horizontal scrolling unless you are in a landscape mode.

~~~
c-w
Your use-case is not what I built the library for (natural language
processing, not text consumption), but let's see what we can do...

You can download HTML E-Books using the following command:

    
    
      python -m gutenberg.download -vvv --filetypes=html --limit=5mb ./ebooks
    

This will download 5mb of zipped E-Books for which there exists an HTML
version to the ./ebooks directory.

It seems as though the legal disclaimers and copyright notices in the HTML
files are all within <pre> tags so we can easily clean-up the files with a
small shell script:

    
    
      EBOOK_DIR="./ebooks"
    
      find "${EBOOK_DIR}" -name *.zip -type f -exec unzip -d "${EBOOK_DIR}" {} \;
      find "${EBOOK_DIR}" -name *.html -type f -exec sed -i '/<[pP][rR][eE]>/,/<\/[pP][rR][eE]>/d' {} \;
    

This will probably not work for all E-Books, but it'll give you something to
work with. Note that removing the copyright notices may or may not be against
the Project Gutenberg terms of service.

Downloading E-Books via genre, author, etc. is not currently supported but is
something that I wanted to implement - so watch this space.

------
tokai
+1 for bitbucket

------
sroerick
This is really awesome.

