It took them years of discussion and they haven't yet chosen a master file format for their books.
As a consequence, they are hosting thousands of books in many different file-formats provided by distributed proofreaders and some other sources.
But since no master-format exists and conversion is .. well .. buggy, they have no leverage on their quality problems. The textual content may be fine, but the reading experience for the casual user shows how big the iceberg is under the surface.
Having seen endless flamewars and several unsuccessful approaches to their formatting and complexity problems, I propose clear reset.
What about a nice fork with only a few important books that get proper treatment in regard to
- master file format
- conversion (mainly html, pdf and ereader formats)
- design and page layout (e.g. for pdf versions)
From then on, one could build a growing git repository of _nice_ books and build an infrastructure around that.
Tackling their current mess directly (and taking on the burden of their internal politics and historical toolchain) will be completely in vain, as the past has shown.
I don't have much time to spare. But for a really good cause (and the future of bookreading is a really good cause), I am willing to invest.
Who is in?
 This one is the latest candidate: http://www.gutenberg.org/wiki/Gutenberg:RST
Before that, there was one guy working on a simplified TEI format. But that never took off.
 The home of many-a-flamewar: http://blog.gmane.org/gmane.culture.literature.e-books.guten...
@ComputerGuru, SoftwareMaven, dankin
I will contact you via mail in a bit.
Do you have experience in managing the level of discussion needed to get this right?
What would be best? A public mailing list?
Feedback very welcome.
I've participated in mailing lists for big projects before. They seem to work fairly well.
That would draw 2 crowds; people wanting to help PG and people unwilling to continue using Calibre.
That is _hard_ to do really good. Text is the easy part. What about tables? What about image captions and footnotes?
What such a project would need is one master file format from which other formats are derived. But not the other way around.
Having disjoint repositories for books/software is a good idea though.
How to be more awesome: do cool stuff for people without needing them to hold your hand.
The article makes good points, but makes them in a way that's likely persuasive only to people who already know about version control. Suppose you represent Project Gutenberg and you've never heard of distributed version control. Here's what you get told:
It’s downright foolish not to take advantage of the wonders a good VS can work with this sort of content: versions are revisions, editions are branches, commit logs preserve integrity and posterity, and an index of all changes is forever kept. Nothing is ever lost or overwritten, and the changes over time can be analyzed, indexed, and reviewed... [N]othing could be easier than forking the original, making your changes, then opening a ticket to propose that PG merge your changes back into their "official" distribution!
Some of this will sound interesting, but most will sound like gobbledygook. I also doubt whether the paragraph that follows-- which describes how a DVCS will make it easier for people who intentionally corrupt works found on PG to benefit from the future work of PG members-- will excite the sorts of people who rise to the level of decision-maker at Project Gutenberg.
In short, this is good advocacy for getting people who know about DVCS to help out PG, but it probably isn't good advocacy to get PG interested in DVCS.
Same with FLOSS and Creative Commons. They are at the same time wonderful, and yet seem to be the products of a reality in which every relevant part of our so-called cultural identity is owned by a corporation.
I have an idea that the "creative commons" used to exist and that sometime in the 20th century it became hopelessly barren.
(And, FWIW, I'm truly cynical about our corporate overlords.)
But I'm preaching to the choir.
On the rare occasion that changes in one edition need to cross over to another, a non-technical user should be able to copy/paste the changes across the two edit pages.
- Modified mainly by humans, at human speeds
- Mistakes can be made and need reverted atomically
- Data integrity and chains of changes need to be preserved
The downside is that teaching non-techies how to make it work is difficult, and the first time a merge is required they're frequently in too deep. This is more of a human problem - Word's revision tracking is somewhat inscrutable as well...
For example, one of my software products has translations powered by the community. The entire translation files (XMLs) are hosted on GitHub, yet invariably all the volunteers would much prefer to email me their translations and I commit them on my end with silly commit messages like "Commit by John Doe email@example.com with fixes for bla bla bla"
I suppose a web interface to git could be written, but it would be very limited and would have to be strict about what can/cannot be uploaded. But multiple drop downs with a simple "pick the file you have an update for," "pick the version your changes are based on," "browse for the new version of the file," and "what changes did you make?" inputs would probably be enough for the vast majority of non-techies..
In graph, they'd be like:
git hg wiki actually usable
When you want a central, de-facto version, it makes sense to have a way of aggregating "similar" corrections (both in type and content) from changes.
A customized book reading application that allow critical readers to make marginal notes and corrections, and see and rate other people's marginal notes and corrections would be the ticket.
All it needs is someone to pick a master format for the text and run with it. That should be someone with enough domain knowledge to choose wisely. I would say something like markdown with embedded metadata (like how you format a blog post with jekyll), but I admit I don't know anything about the requirements of the project.
I've seen a proposal for using Usenet (or something like it) for Wikipedia; perhaps that would be appropriate here? Surfacing changes would be easier. And if you didn't want to follow each and every format available, you could. (Having every format in the same silo seems painfully redundant to me.)
I'm not sure how big that would be, but it's not going to be a lot. Currently, they have 38,000 titles so that's 38,000 files of, I dunno, 50,000 words on average? Probably less, as a lot of the PG works are rather short (novellas and short stories), though some are also equally monstrously long.