Hacker News new | past | comments | ask | show | jobs | submit login
The Case for a Git-Powered Project Gutenberg (neosmart.net)
117 points by ComputerGuru on Feb 27, 2012 | hide | past | favorite | 42 comments

Project Gutenberg is a complete mess.

It took them years of discussion and they haven't yet chosen a master file format for their books.[1] As a consequence, they are hosting thousands of books in many different file-formats provided by distributed proofreaders[2] and some other sources. But since no master-format exists and conversion is .. well .. buggy, they have no leverage on their quality problems. The textual content may be fine, but the reading experience for the casual user shows how big the iceberg is under the surface.

Having seen endless flamewars[3] and several unsuccessful approaches to their formatting and complexity problems, I propose clear reset.

What about a nice fork with only a few important books that get proper treatment in regard to

- master file format

- conversion (mainly html, pdf and ereader formats)

- design and page layout (e.g. for pdf versions)

From then on, one could build a growing git repository of _nice_ books and build an infrastructure around that. Tackling their current mess directly (and taking on the burden of their internal politics and historical toolchain) will be completely in vain, as the past has shown.

I don't have much time to spare. But for a really good cause (and the future of bookreading is a really good cause), I am willing to invest.

Who is in?

[1] This one is the latest candidate: http://www.gutenberg.org/wiki/Gutenberg:RST Before that, there was one guy working on a simplified TEI format. But that never took off.

[2] http://www.pgdp.net/c/

[3] The home of many-a-flamewar: http://blog.gmane.org/gmane.culture.literature.e-books.guten...

Any reboot of PG (if any at all) should from the ground up have a single master format for any given title, these master formats under DVCS, and an integrated automated build system for conversion to auxiliary formats. These are the very basic components of any large scale library being designed in this day and age, and ensure integrity, posterity, and availability.

I would be interested. PG is amazing wrt the content, but the experience of actually reading the books is awful. My email is in my profile.

Agreed. Let's do something about our shared cultural history. Email is also in my profile.


@ComputerGuru, SoftwareMaven, dankin I will contact you via mail in a bit.

Do you have experience in managing the level of discussion needed to get this right? What would be best? A public mailing list?


Feedback very welcome.

I signed up to troutwine's mailing list and started following the github project.

I've participated in mailing lists for big projects before. They seem to work fairly well.

I'm interested in tossing my hat into the ring as well. I couldn't find a way to contact you in your profile or on github so here's me saying "me too!"

signed up for troutwine's mailing list too

A decent master file system is good. Including conversion is sub-optimal. Have a sister project that creates and maintains good software for conversion.

That would draw 2 crowds; people wanting to help PG and people unwilling to continue using Calibre.

IMO, calibre is two-way. From any file format to any other.

That is _hard_ to do really good. Text is the easy part. What about tables? What about image captions and footnotes?

What such a project would need is one master file format from which other formats are derived. But not the other way around.

Having disjoint repositories for books/software is a good idea though.

I'd like to throw into this. I started a mailing-list: http://groups.google.com/group/prj-alexandria What say we move the conversation there?

There's no need to wait for central blessing. Just do it. It wouldn't take much scripting work to treat the existing PG as just another branch that you can periodically merge on to your trunk. And if they decide they like it, they can switch to it easily later.

My thoughts exactly. The whole library is available as a torrent. Why ask if its okay when asking is not even necessary.

How to be more awesome: do cool stuff for people without needing them to hold your hand.

We’ll be contacting Project Gutenberg with a link to this article...

The article makes good points, but makes them in a way that's likely persuasive only to people who already know about version control. Suppose you represent Project Gutenberg and you've never heard of distributed version control. Here's what you get told:

It’s downright foolish not to take advantage of the wonders a good VS can work with this sort of content: versions are revisions, editions are branches, commit logs preserve integrity and posterity, and an index of all changes is forever kept. Nothing is ever lost or overwritten, and the changes over time can be analyzed, indexed, and reviewed... [N]othing could be easier than forking the original, making your changes, then opening a ticket to propose that PG merge your changes back into their "official" distribution!

Some of this will sound interesting, but most will sound like gobbledygook. I also doubt whether the paragraph that follows-- which describes how a DVCS will make it easier for people who intentionally corrupt works found on PG to benefit from the future work of PG members-- will excite the sorts of people who rise to the level of decision-maker at Project Gutenberg.

In short, this is good advocacy for getting people who know about DVCS to help out PG, but it probably isn't good advocacy to get PG interested in DVCS.

The email I sent included several paragraphs description of the benefits to PG and the problems that DVCS would solve.

CPDL (the Choral Public Domain Library), which is like Project Gutenberg except for Choral Music, has exactly the same problem. I can't tell you how many scores I've used from CPDL that have obvious errors in them that we end up fixing in rehearsal. I have no way of contributing back these fixes except to send a message to the score's author which is not worth the hassle so I don't do it.

I think this is a great idea in general. As an aside, when I first discovered Project Gutenberg I thought it was really exciting, but as I've grown up my perspective has changed: it feels like a desperate attempt to scrape together something like what used to be recognizable as culture from cast-off bits of art that capitalism can no longer commodify.

Same with FLOSS and Creative Commons. They are at the same time wonderful, and yet seem to be the products of a reality in which every relevant part of our so-called cultural identity is owned by a corporation.

I have an idea that the "creative commons" used to exist and that sometime in the 20th century it became hopelessly barren.

I think your view may be just a little more cynical than reality. I'm not saying reality isn't bad (it is), but, in the little bit of "free" blue ocean that exists, incredibly vibrant communities are thriving. Whether that is people share truly open photos on Flickr or entire operating systems being built and given away or authors and musicians providing their content to the community, there is still life in non-corporate-controlled culture.

(And, FWIW, I'm truly cynical about our corporate overlords.)

Something else that badly needs DVCS: government documents including legislation.

But I'm preaching to the choir.

Given the requirement that non-technical users should be able to use it, a wiki seems like a better model: GUI editing and version control.

How well do concepts like forking and branching work with wikis, though? You can create categories and place the pages in them (to put the different editions for each title), but you can't really "link" revisions and branches to one another or indicate where they were derived from. And forking is absolutely out of the question.

It seems that something like GitHub's wikis would be a great compromise: a simple web-based editing interface, but backed by a Git repo that you can also access as a vanilla Git repo with plaintext files.

That is a limitation, but I'm not sure the marginal benefit of forking and merging functionality justifies the significant added complexity. Each edition gets its own page, and the revision history of each edition reflects versioning.

On the rare occasion that changes in one edition need to cross over to another, a non-technical user should be able to copy/paste the changes across the two edit pages.

there are more than a few wikis that use version control for their backends. i'm aware of at least 2 that use git; branching and forking in that case would simply be UI issues, and could be handled the same way that github does.

Version control is superior than other "sync" solutions when the data is:

  - Modified mainly by humans, at human speeds
  - Mistakes can be made and need reverted atomically
  - Data integrity and chains of changes need to be preserved
I agree with the article that PG is a great example of where version control could work with great effect.

The downside is that teaching non-techies how to make it work is difficult, and the first time a merge is required they're frequently in too deep. This is more of a human problem - Word's revision tracking is somewhat inscrutable as well...

I'm the author of TFA and I must completely agree with you on this point. Is there a human-friendly VCS out there? I don't care how featured or not it is, as Git can be configured to pull from it (a la git-svn). Just anything that non-tech-related persons (or even just non-coders!) can use without wanting to pull their hair out and feeling very stupid?

For example, one of my software products has translations powered by the community. The entire translation files (XMLs) are hosted on GitHub, yet invariably all the volunteers would much prefer to email me their translations and I commit them on my end with silly commit messages like "Commit by John Doe jdoe@example.com with fixes for bla bla bla"


I suppose a web interface to git could be written, but it would be very limited and would have to be strict about what can/cannot be uploaded. But multiple drop downs with a simple "pick the file you have an update for," "pick the version your changes are based on," "browse for the new version of the file," and "what changes did you make?" inputs would probably be enough for the vast majority of non-techies..

Depends on what features you need. I've seen wikis used as "human-friendly" VCSes. There is no branching, forking and automated merging in most implementations... But that might be good things, based on who your target group is.

Mercurial (Hg) with TortoiseHg might be a good option. You could even create a pre-configured PG version, to make it even easier for novices. Along with a guide on the basics, I think that could end up working quite well. Hg also has the benefit, compared to Git, of being strictly cross-platform.

I think Mercurial (and Git) is ridiculously complex for a non-technical user (say, my mother, who has helped with translation projects online, using Word + email). TortoiseHg is great as a GUI version, but it doesn't effectively reduce the complexity - you still need to understand the basic concepts of a DVCS to use it.

In graph, they'd be like:

    git  hg                           wiki                      actually usable
                                                                by non-techies

Completely agree, but with this limited use case it should be easy to provide a tool to abstract away the complexity and give them easy access to the very few features they need.

Something like Github's "forking with the edit button"[1] functionality would be perfect for this. If you want to make a correction, press a button to see the content you were looking at in a web-based editor. Make your changes, and add your email address to a notification field. Changes go into an approval queue, and you get an email if/when your change goes live. No need to expose the casual user to the perils of DVCS, though of course a "proper" git-based infrastructure would exist in parallel for people who want more control.

[1] https://github.com/blog/844-forking-with-the-edit-button

Indeed, I've been working on this type of tool (without the distributed component) for blog authors who want user-submitted corrections: http://edithuddle.com

When you want a central, de-facto version, it makes sense to have a way of aggregating "similar" corrections (both in type and content) from changes.

Why not just go up to github and ask them to make a web interface for non nerds? It's half way there already with editing in webpages and what not.

Oh it's nowhere near halfway there. I think the idea of using git is a fine one, but some software engineering is very much required. Ideally the project would work for all phases including pushing updated copies to consumers. Once you think in those terms you start approaching a product that's going to work for non technical editors and writers.

A customized book reading application that allow critical readers to make marginal notes and corrections, and see and rate other people's marginal notes and corrections would be the ticket.

I would recommend Mercurial for such a project. It's much more graspable by non-nerds than git IMO.

The closest thing to a human friendly VCS for non-programmers would be document management software like Autonomy's WorkSite or EMC's Documentum.

Shove it on GitHub, it's edit in-browser functionality means it is accessible to non-techies. Problem solved.

All it needs is someone to pick a master format for the text and run with it. That should be someone with enough domain knowledge to choose wisely. I would say something like markdown with embedded metadata (like how you format a blog post with jekyll), but I admit I don't know anything about the requirements of the project.

I'm not sure git's the best choice for handling multiple gigabytes of data. That estimate's based on their DVD releases. Oddly, I'm not positive that they're compressed.

I've seen a proposal for using Usenet (or something like it) for Wikipedia; perhaps that would be appropriate here? Surfacing changes would be easier. And if you didn't want to follow each and every format available, you could. (Having every format in the same silo seems painfully redundant to me.)

You wouldn't place the "binary" output in the repository, just the whatever source format you're using to compile the rest. Say (for the sake of the example) LaTeX or, heck, plain UTF-8 text. Only that is stored in the git repo, the rest are generated on the fly with a continuous build system.

I'm not sure how big that would be, but it's not going to be a lot. Currently, they have 38,000 titles so that's 38,000 files of, I dunno, 50,000 words on average? Probably less, as a lot of the PG works are rather short (novellas and short stories), though some are also equally monstrously long.

I wouldn't think that the whole of PG would be a single repository. What came to mind when I read the article was that each book would be in its own repository, and that seems to me to be well within Git's capabilities.

Oh, that would probably work well on the individual book level, with the right plumbing. Dealing with the quantity is another problem, but that might just require more plumbing.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact