I built this because I think that Project Gutenberg is a great resource for NLP (e.g. stylometry, tracking writing styles over time, authorship detection, ...) - I wanted to use the data on Project Gutenberg a number of times in the past but always ended up using another corpus because there wasn't an easy way to access the Project Gutenberg data. Hopefully this library fixes that.
The project currently is "works on my machine" quality, so please do report any bugs you stumble across.
Also, if you can think of any use-cases for the Project Gutenberg data that aren't easily doable using the functionality that is currently available in the library, please let me know (e.g. by filing a ticket on the Bitbucket repo).
I think the previous version of the metadata included a path to the ftp server. Splitting the book id (4443 -> 4/4/4/4443) works for _most_ books, but there were somewhere between 800 and 3000 books organized in a different folder structure that I still need to track down.
The github repos are intended to collect issues and received pull requests. Project Gutenberg doesn't have a public bugtracker, nor do they use version control.
Wow this is just what I have been looking for. A few questions:-
1. Is it possible to download files in html format? I really prefer ebooks in html cause I can attach my
preferred "readSettings.css" to it that way. *
2. Is it possible to run a custom script after the book has completed downloading?
3. I don't understand what you mean by "Making meta-data about the texts easily accessible through a database"
in the description. Can you expand a bit on this?
4. Is it possible to specify other donwload contexts like "genre"
Oh Thanks a lot for this :) I always wanted a command line uitlity for project gutenberg.
* I also think that html files are lot easier to read on the phone as you can style them where as with
the txt files you have got no choice but to use horizontal scrolling unless you are in a landscape mode.
This will download 5mb of zipped E-Books for which there exists an HTML version to the ./ebooks directory.
It seems as though the legal disclaimers and copyright notices in the HTML files are all within <pre> tags so we can easily clean-up the files with a small shell script:
EBOOK_DIR="./ebooks"
find "${EBOOK_DIR}" -name *.zip -type f -exec unzip -d "${EBOOK_DIR}" {} \;
find "${EBOOK_DIR}" -name *.html -type f -exec sed -i '/<[pP][rR][eE]>/,/<\/[pP][rR][eE]>/d' {} \;
This will probably not work for all E-Books, but it'll give you something to work with. Note that removing the copyright notices may or may not be against the Project Gutenberg terms of service.
Downloading E-Books via genre, author, etc. is not currently supported but is something that I wanted to implement - so watch this space.
This is a pretty small python module at the moment. To fetch metadata, it downloads a 230mb zip from PG and parses it for a few categories. Project Gutenberg has some metadata about Subject, but the info is inconsistent, but there is sometimes a Library of Congress code in the metadata.
Not all books have an html version. Most PG books are plaintext, and _some_ have a separate html variant. A handful are written in a markup language that can become html or plaintext.
I built this because I think that Project Gutenberg is a great resource for NLP (e.g. stylometry, tracking writing styles over time, authorship detection, ...) - I wanted to use the data on Project Gutenberg a number of times in the past but always ended up using another corpus because there wasn't an easy way to access the Project Gutenberg data. Hopefully this library fixes that.
The project currently is "works on my machine" quality, so please do report any bugs you stumble across.
Also, if you can think of any use-cases for the Project Gutenberg data that aren't easily doable using the functionality that is currently available in the library, please let me know (e.g. by filing a ticket on the Bitbucket repo).