

The Case for a Git-Powered Project Gutenberg - ComputerGuru
http://neosmart.net/blog/2012/the-case-for-a-git-powered-project-gutenberg/

======
felix_faber
Project Gutenberg is a complete mess.

It took them years of discussion and they haven't yet chosen a master file
format for their books.[1] As a consequence, they are hosting thousands of
books in many different file-formats provided by distributed proofreaders[2]
and some other sources. But since no master-format exists and conversion is ..
well .. buggy, they have no leverage on their quality problems. The textual
content may be fine, but the reading experience for the casual user shows how
big the iceberg is under the surface.

Having seen endless flamewars[3] and several unsuccessful approaches to their
formatting and complexity problems, I propose clear reset.

What about a nice fork with only a few important books that get proper
treatment in regard to

\- master file format

\- conversion (mainly html, pdf and ereader formats)

\- design and page layout (e.g. for pdf versions)

From then on, one could build a growing git repository of _nice_ books and
build an infrastructure around that. Tackling their current mess directly (and
taking on the burden of their internal politics and historical toolchain) will
be completely in vain, as the past has shown.

I don't have much time to spare. But for a really good cause (and the future
of bookreading is a really good cause), I am willing to invest.

Who is in?

[1] This one is the latest candidate:
<http://www.gutenberg.org/wiki/Gutenberg:RST> Before that, there was one guy
working on a simplified TEI format. But that never took off.

[2] <http://www.pgdp.net/c/>

[3] The home of many-a-flamewar:
[http://blog.gmane.org/gmane.culture.literature.e-books.guten...](http://blog.gmane.org/gmane.culture.literature.e-books.gutenberg.volunteers)

~~~
SoftwareMaven
I would be interested. PG is amazing wrt the content, but the experience of
actually reading the books is awful. My email is in my profile.

~~~
danking00
Agreed. Let's do something about our shared cultural history. Email is also in
my profile.

~~~
felix_faber
Wow.

@ComputerGuru, SoftwareMaven, dankin I will contact you via mail in a bit.

Do you have experience in managing the level of discussion needed to get this
right? What would be best? A public mailing list?

<https://github.com/felix-faber/project-alexandria>

Feedback very welcome.

~~~
danking00
I signed up to troutwine's mailing list and started following the github
project.

I've participated in mailing lists for big projects before. They seem to work
fairly well.

------
jerf
There's no need to wait for central blessing. Just do it. It wouldn't take
much scripting work to treat the existing PG as just another branch that you
can periodically merge on to your trunk. And if they decide they like it, they
can switch to it easily later.

~~~
filiwickers
My thoughts exactly. The whole library is available as a torrent. Why ask if
its okay when asking is not even necessary.

How to be more awesome: do cool stuff for people without needing them to hold
your hand.

------
dmlorenzetti
_We’ll be contacting Project Gutenberg with a link to this article..._

The article makes good points, but makes them in a way that's likely
persuasive only to people who already know about version control. Suppose you
represent Project Gutenberg and you've never heard of distributed version
control. Here's what you get told:

 _It’s downright foolish not to take advantage of the wonders a good VS can
work with this sort of content: versions are revisions, editions are branches,
commit logs preserve integrity and posterity, and an index of all changes is
forever kept. Nothing is ever lost or overwritten, and the changes over time
can be analyzed, indexed, and reviewed... [N]othing could be easier than
forking the original, making your changes, then opening a ticket to propose
that PG merge your changes back into their "official" distribution!_

Some of this will sound interesting, but most will sound like gobbledygook. I
also doubt whether the paragraph that follows-- which describes how a DVCS
will make it easier for people who intentionally corrupt works found on PG to
benefit from the future work of PG members-- will excite the sorts of people
who rise to the level of decision-maker at Project Gutenberg.

In short, this is good advocacy for getting people who know about DVCS to help
out PG, but it probably isn't good advocacy to get PG interested in DVCS.

~~~
ComputerGuru
The email I sent included several paragraphs description of the benefits to PG
and the problems that DVCS would solve.

------
haberman
CPDL (the Choral Public Domain Library), which is like Project Gutenberg
except for Choral Music, has exactly the same problem. I can't tell you how
many scores I've used from CPDL that have obvious errors in them that we end
up fixing in rehearsal. I have no way of contributing back these fixes except
to send a message to the score's author which is not worth the hassle so I
don't do it.

------
jberryman
I think this is a great idea in general. As an aside, when I first discovered
Project Gutenberg I thought it was really exciting, but as I've grown up my
perspective has changed: it feels like a desperate attempt to scrape together
something like what used to be recognizable as culture from cast-off bits of
art that capitalism can no longer commodify.

Same with FLOSS and Creative Commons. They are at the same time wonderful, and
yet seem to be the products of a reality in which every relevant part of our
so-called cultural identity is owned by a corporation.

I have an idea that the "creative commons" used to exist and that sometime in
the 20th century it became hopelessly barren.

~~~
SoftwareMaven
I think your view may be just a little more cynical than reality. I'm not
saying reality isn't bad (it is), but, in the little bit of "free" blue ocean
that exists, incredibly vibrant communities are thriving. Whether that is
people share truly open photos on Flickr or entire operating systems being
built and given away or authors and musicians providing their content to the
community, there is still life in non-corporate-controlled culture.

(And, FWIW, I'm truly cynical about our corporate overlords.)

------
breckinloggins
Something else that badly needs DVCS: government documents including
legislation.

But I'm preaching to the choir.

~~~
ehsanu1
Why stop at DVCS?: <http://www-
formal.stanford.edu/jmc/future/objectivity.html>

------
RyanMcGreal
Given the requirement that non-technical users should be able to use it, a
wiki seems like a better model: GUI editing and version control.

~~~
ComputerGuru
How well do concepts like forking and branching work with wikis, though? You
can create categories and place the pages in them (to put the different
editions for each title), but you can't really "link" revisions and branches
to one another or indicate where they were derived from. And forking is
absolutely out of the question.

~~~
lazerwalker
It seems that something like GitHub's wikis would be a great compromise: a
simple web-based editing interface, but backed by a Git repo that you can also
access as a vanilla Git repo with plaintext files.

------
zdw
Version control is superior than other "sync" solutions when the data is:

    
    
      - Modified mainly by humans, at human speeds
      - Mistakes can be made and need reverted atomically
      - Data integrity and chains of changes need to be preserved
     

I agree with the article that PG is a great example of where version control
could work with great effect.

The downside is that teaching non-techies how to make it work is difficult,
and the first time a merge is required they're frequently in too deep. This is
more of a human problem - Word's revision tracking is somewhat inscrutable as
well...

~~~
ComputerGuru
I'm the author of TFA and I must completely agree with you on this point. Is
there a human-friendly VCS out there? I don't care how featured or not it is,
as Git can be configured to pull from it (a la git-svn). Just anything that
non-tech-related persons (or even just non-coders!) can use without wanting to
pull their hair out and feeling very stupid?

For example, one of my software products has translations powered by the
community. The entire translation files (XMLs) are hosted on GitHub, yet
invariably all the volunteers would much prefer to email me their translations
and I commit them on my end with silly commit messages like "Commit by John
Doe jdoe@example.com with fixes for bla bla bla"

\---

I suppose a web interface to git _could_ be written, but it would be very
limited and would have to be strict about what can/cannot be uploaded. But
multiple drop downs with a simple "pick the file you have an update for,"
"pick the version your changes are based on," "browse for the new version of
the file," and "what changes did you make?" inputs would probably be enough
for the vast majority of non-techies..

~~~
donatzsky
Mercurial (Hg) with TortoiseHg might be a good option. You could even create a
pre-configured PG version, to make it even easier for novices. Along with a
guide on the basics, I think that could end up working quite well. Hg also has
the benefit, compared to Git, of being strictly cross-platform.

~~~
icebraining
I think Mercurial (and Git) is ridiculously complex for a non-technical user
(say, my mother, who has helped with translation projects online, using Word +
email). TortoiseHg is great as a GUI version, but it doesn't effectively
reduce the complexity - you still need to understand the basic concepts of a
DVCS to use it.

In graph, they'd be like:

    
    
          |--|------------------------------|-------------------------------|
        git  hg                           wiki                      actually usable
                                                                    by non-techies

~~~
freshhawk
Completely agree, but with this limited use case it should be easy to provide
a tool to abstract away the complexity and give them easy access to the very
few features they need.

------
LoonyPandora
Shove it on GitHub, it's edit in-browser functionality means it is accessible
to non-techies. Problem solved.

All it needs is someone to pick a master format for the text and run with it.
That should be someone with enough domain knowledge to choose wisely. I would
say something like markdown with embedded metadata (like how you format a blog
post with jekyll), but I admit I don't know anything about the requirements of
the project.

------
pronoiac
I'm not sure git's the best choice for handling multiple gigabytes of data.
That estimate's based on their DVD releases. Oddly, I'm not positive that
they're compressed.

I've seen a proposal for using Usenet (or something like it) for Wikipedia;
perhaps that would be appropriate here? Surfacing changes would be easier. And
if you didn't want to follow each and every format available, you could.
(Having every format in the same silo seems painfully redundant to me.)

~~~
aiscott
I wouldn't think that the whole of PG would be a single repository. What came
to mind when I read the article was that each book would be in its own
repository, and that seems to me to be well within Git's capabilities.

~~~
pronoiac
Oh, that would probably work well on the individual book level, with the right
plumbing. Dealing with the quantity is another problem, but that might just
require more plumbing.

