
What it looks like to process 3.5M books in Google’s cloud - doppp
http://googlecloudplatform.blogspot.com/2016/02/what-it-looks-like-to-process-3.5-million-books-in-Googles-cloud.html
======
placeybordeaux
Would have been nice to include a cost estimate.

~~~
mcguire
Did I miss any discussion of what the "processing" is?

Using the Stanford Part-Of-Speech tagger, my goofy project, Ashurbanipal, can
tag all the words in one book in about 8 seconds on one core, or ~25,000 books
from the Project Gutenberg 2010 DVD image on my 4-core (hyperthreaded) laptop
with a 10GB JVM heap in about 8 hours.

~~~
dholowiski
Nope, there was almost no mention of what this was actually used for. The
closest I found was a mention of the final output:

"single output files, tab-delimited with data available for each year, merging
in publication metadata and other information about each book"

[edit] More info in a link at the bottom of the article:
[http://blog.gdeltproject.org/3-5-million-
books-1800-2015-gde...](http://blog.gdeltproject.org/3-5-million-
books-1800-2015-gdelt-processes-internet-archive-and-hathitrust-book-archives-
and-available-in-google-bigquery/)

------
faizshah
Anybody find a download for the dataset? Would prefer to take a look at it
locally.

