It took them years of discussion and they haven't yet chosen a master file format for their books.[1]
As a consequence, they are hosting thousands of books in many different file-formats provided by distributed proofreaders[2] and some other sources.
But since no master-format exists and conversion is .. well .. buggy, they have no leverage on their quality problems. The textual content may be fine, but the reading experience for the casual user shows how big the iceberg is under the surface.
Having seen endless flamewars[3] and several unsuccessful approaches to their formatting and complexity problems, I propose clear reset.
What about a nice fork with only a few important books that get proper treatment in regard to
- master file format
- conversion (mainly html, pdf and ereader formats)
- design and page layout (e.g. for pdf versions)
From then on, one could build a growing git repository of _nice_ books and build an infrastructure around that.
Tackling their current mess directly (and taking on the burden of their internal politics and historical toolchain) will be completely in vain, as the past has shown.
I don't have much time to spare. But for a really good cause (and the future of bookreading is a really good cause), I am willing to invest.
Who is in?
[1] This one is the latest candidate: http://www.gutenberg.org/wiki/Gutenberg:RST
Before that, there was one guy working on a simplified TEI format. But that never took off.
Any reboot of PG (if any at all) should from the ground up have a single master format for any given title, these master formats under DVCS, and an integrated automated build system for conversion to auxiliary formats. These are the very basic components of any large scale library being designed in this day and age, and ensure integrity, posterity, and availability.
I'm interested in tossing my hat into the ring as well. I couldn't find a way to contact you in your profile or on github so here's me saying "me too!"
A decent master file system is good. Including conversion is sub-optimal. Have a sister project that creates and maintains good software for conversion.
That would draw 2 crowds; people wanting to help PG and people unwilling to continue using Calibre.
There's no need to wait for central blessing. Just do it. It wouldn't take much scripting work to treat the existing PG as just another branch that you can periodically merge on to your trunk. And if they decide they like it, they can switch to it easily later.
We’ll be contacting Project Gutenberg with a link to this article...
The article makes good points, but makes them in a way that's likely persuasive only to people who already know about version control. Suppose you represent Project Gutenberg and you've never heard of distributed version control. Here's what you get told:
It’s downright foolish not to take advantage of the wonders a good VS can work with this sort of content: versions are revisions, editions are branches, commit logs preserve integrity and posterity, and an index of all changes is forever kept. Nothing is ever lost or overwritten, and the changes over time can be analyzed, indexed, and reviewed... [N]othing could be easier than forking the original, making your changes, then opening a ticket to propose that PG merge your changes back into their "official" distribution!
Some of this will sound interesting, but most will sound like gobbledygook. I also doubt whether the paragraph that follows-- which describes how a DVCS will make it easier for people who intentionally corrupt works found on PG to benefit from the future work of PG members-- will excite the sorts of people who rise to the level of decision-maker at Project Gutenberg.
In short, this is good advocacy for getting people who know about DVCS to help out PG, but it probably isn't good advocacy to get PG interested in DVCS.
CPDL (the Choral Public Domain Library), which is like Project Gutenberg except for Choral Music, has exactly the same problem. I can't tell you how many scores I've used from CPDL that have obvious errors in them that we end up fixing in rehearsal. I have no way of contributing back these fixes except to send a message to the score's author which is not worth the hassle so I don't do it.
I think this is a great idea in general. As an aside, when I first discovered Project Gutenberg I thought it was really exciting, but as I've grown up my perspective has changed: it feels like a desperate attempt to scrape together something like what used to be recognizable as culture from cast-off bits of art that capitalism can no longer commodify.
Same with FLOSS and Creative Commons. They are at the same time wonderful, and yet seem to be the products of a reality in which every relevant part of our so-called cultural identity is owned by a corporation.
I have an idea that the "creative commons" used to exist and that sometime in the 20th century it became hopelessly barren.
I think your view may be just a little more cynical than reality. I'm not saying reality isn't bad (it is), but, in the little bit of "free" blue ocean that exists, incredibly vibrant communities are thriving. Whether that is people share truly open photos on Flickr or entire operating systems being built and given away or authors and musicians providing their content to the community, there is still life in non-corporate-controlled culture.
(And, FWIW, I'm truly cynical about our corporate overlords.)
How well do concepts like forking and branching work with wikis, though? You can create categories and place the pages in them (to put the different editions for each title), but you can't really "link" revisions and branches to one another or indicate where they were derived from. And forking is absolutely out of the question.
It seems that something like GitHub's wikis would be a great compromise: a simple web-based editing interface, but backed by a Git repo that you can also access as a vanilla Git repo with plaintext files.
That is a limitation, but I'm not sure the marginal benefit of forking and merging functionality justifies the significant added complexity. Each edition gets its own page, and the revision history of each edition reflects versioning.
On the rare occasion that changes in one edition need to cross over to another, a non-technical user should be able to copy/paste the changes across the two edit pages.
there are more than a few wikis that use version control for their backends. i'm aware of at least 2 that use git; branching and forking in that case would simply be UI issues, and could be handled the same way that github does.
Version control is superior than other "sync" solutions when the data is:
- Modified mainly by humans, at human speeds
- Mistakes can be made and need reverted atomically
- Data integrity and chains of changes need to be preserved
I agree with the article that PG is a great example of where version control could work with great effect.
The downside is that teaching non-techies how to make it work is difficult, and the first time a merge is required they're frequently in too deep. This is more of a human problem - Word's revision tracking is somewhat inscrutable as well...
I'm the author of TFA and I must completely agree with you on this point. Is there a human-friendly VCS out there? I don't care how featured or not it is, as Git can be configured to pull from it (a la git-svn). Just anything that non-tech-related persons (or even just non-coders!) can use without wanting to pull their hair out and feeling very stupid?
For example, one of my software products has translations powered by the community. The entire translation files (XMLs) are hosted on GitHub, yet invariably all the volunteers would much prefer to email me their translations and I commit them on my end with silly commit messages like "Commit by John Doe jdoe@example.com with fixes for bla bla bla"
---
I suppose a web interface to git could be written, but it would be very limited and would have to be strict about what can/cannot be uploaded. But multiple drop downs with a simple "pick the file you have an update for," "pick the version your changes are based on," "browse for the new version of the file," and "what changes did you make?" inputs would probably be enough for the vast majority of non-techies..
Depends on what features you need. I've seen wikis used as "human-friendly" VCSes. There is no branching, forking and automated merging in most implementations... But that might be good things, based on who your target group is.
Mercurial (Hg) with TortoiseHg might be a good option. You could even create a pre-configured PG version, to make it even easier for novices. Along with a guide on the basics, I think that could end up working quite well.
Hg also has the benefit, compared to Git, of being strictly cross-platform.
I think Mercurial (and Git) is ridiculously complex for a non-technical user (say, my mother, who has helped with translation projects online, using Word + email). TortoiseHg is great as a GUI version, but it doesn't effectively reduce the complexity - you still need to understand the basic concepts of a DVCS to use it.
In graph, they'd be like:
|--|------------------------------|-------------------------------|
git hg wiki actually usable
by non-techies
Completely agree, but with this limited use case it should be easy to provide a tool to abstract away the complexity and give them easy access to the very few features they need.
Something like Github's "forking with the edit button"[1] functionality would be perfect for this. If you want to make a correction, press a button to see the content you were looking at in a web-based editor. Make your changes, and add your email address to a notification field. Changes go into an approval queue, and you get an email if/when your change goes live. No need to expose the casual user to the perils of DVCS, though of course a "proper" git-based infrastructure would exist in parallel for people who want more control.
Indeed, I've been working on this type of tool (without the distributed component) for blog authors who want user-submitted corrections: http://edithuddle.com
When you want a central, de-facto version, it makes sense to have a way of aggregating "similar" corrections (both in type and content) from changes.
Oh it's nowhere near halfway there. I think the idea of using git is a fine one, but some software engineering is very much required. Ideally the project would work for all phases including pushing updated copies to consumers. Once you think in those terms you start approaching a product that's going to work for non technical editors and writers.
A customized book reading application that allow critical readers to make marginal notes and corrections, and see and rate other people's marginal notes and corrections would be the ticket.
Shove it on GitHub, it's edit in-browser functionality means it is accessible to non-techies. Problem solved.
All it needs is someone to pick a master format for the text and run with it. That should be someone with enough domain knowledge to choose wisely. I would say something like markdown with embedded metadata (like how you format a blog post with jekyll), but I admit I don't know anything about the requirements of the project.
I'm not sure git's the best choice for handling multiple gigabytes of data. That estimate's based on their DVD releases. Oddly, I'm not positive that they're compressed.
I've seen a proposal for using Usenet (or something like it) for Wikipedia; perhaps that would be appropriate here? Surfacing changes would be easier. And if you didn't want to follow each and every format available, you could. (Having every format in the same silo seems painfully redundant to me.)
You wouldn't place the "binary" output in the repository, just the whatever source format you're using to compile the rest. Say (for the sake of the example) LaTeX or, heck, plain UTF-8 text. Only that is stored in the git repo, the rest are generated on the fly with a continuous build system.
I'm not sure how big that would be, but it's not going to be a lot. Currently, they have 38,000 titles so that's 38,000 files of, I dunno, 50,000 words on average? Probably less, as a lot of the PG works are rather short (novellas and short stories), though some are also equally monstrously long.
I wouldn't think that the whole of PG would be a single repository. What came to mind when I read the article was that each book would be in its own repository, and that seems to me to be well within Git's capabilities.
Oh, that would probably work well on the individual book level, with the right plumbing. Dealing with the quantity is another problem, but that might just require more plumbing.
It took them years of discussion and they haven't yet chosen a master file format for their books.[1] As a consequence, they are hosting thousands of books in many different file-formats provided by distributed proofreaders[2] and some other sources. But since no master-format exists and conversion is .. well .. buggy, they have no leverage on their quality problems. The textual content may be fine, but the reading experience for the casual user shows how big the iceberg is under the surface.
Having seen endless flamewars[3] and several unsuccessful approaches to their formatting and complexity problems, I propose clear reset.
What about a nice fork with only a few important books that get proper treatment in regard to
- master file format
- conversion (mainly html, pdf and ereader formats)
- design and page layout (e.g. for pdf versions)
From then on, one could build a growing git repository of _nice_ books and build an infrastructure around that. Tackling their current mess directly (and taking on the burden of their internal politics and historical toolchain) will be completely in vain, as the past has shown.
I don't have much time to spare. But for a really good cause (and the future of bookreading is a really good cause), I am willing to invest.
Who is in?
[1] This one is the latest candidate: http://www.gutenberg.org/wiki/Gutenberg:RST Before that, there was one guy working on a simplified TEI format. But that never took off.
[2] http://www.pgdp.net/c/
[3] The home of many-a-flamewar: http://blog.gmane.org/gmane.culture.literature.e-books.guten...