BTW, I recently learned the Gutenberg was not his name and is really a significant historical inaccuracy. His name was Hannes Gensfleisch. "Gutenberg" was just one of the places his family resided.
Alternatively, you could use, and customize, an open-source self-hostable alternative, such as gitorious https://gitorious.org/gitorious
> BTW, I recently learned the Gutenberg was not his name and is really a significant historical inaccuracy.
Do you have any source for this fact that Gutenberg wasn't using "Gutenberg" as his last name?
Wallau adds, "His surname was derived from the house inhabited by his father and his paternal ancestors 'zu Laden, zu Gutenberg'. The house of Gänsfleisch was one of the patrician families of the town, tracing its lineage back to the thirteenth century." Patricians (aristocrats) in Mainz were often named after houses they owned. Around 1427, the name zu Gutenberg, after the family house in Mainz, is documented to have been used for the first time.
Github does have issues with Unicode repo names, so it may be worth moving elsewhere for that reason.
But in the short term, I like Github because they have the most to gain by making git easy to use. If I can get away with people editing books in their browsers on github then we have a editing/rendering toolchain right out of the box!
We need momentum around one integrated OSS toolchain, including illustration (Inkscape) and CMYK color for printing (Scribus). This would help editors and publishers who want to move away from rental pricing for authoring software.
As anyone who has tried to find good books in a sea of free books knows, there isn't a standardized way of collaboratively improving book Metadata. This needs to include e-production history, translators, print publication history and content objects _within_ books, viz. Doug Engelbart's purple numbers, http://www.dougengelbart.org/about/ohs.html
re: Metadata. I'm slowly working on that problem. I'm a HUGE fan of purple numbers.
Would you recommend any existing orgs, mailing lists or forums where there's already an active community for discussion?
And there has been some conversation about the project on the OKFN-humanities list:
OKFN is organizing a skype call about the project on the 1st of September 6pm UK time that will be announced on the humanities list in the next few days.
The new project follows the Github programming model, which means if a person wants to contribute to a book she has to clone the book, make local changes, push, issue a pull request. This is far, far more complex than stopping by pgdp for a ten-minute proofing session. Very few of the drive-by proofers at DP could manage that technically, or would want to be that involved.
Most important it lacks the QC inherent in the DP model of having multiple later reviewers catching earlier errors. Who will do line-by-line vetting of the accuracy of pull requests? Besides the inevitable detail mistakes, there are potential problems similar to those faced by Wikipedia: who will even notice if some local zealot decides to insert editorial comments in, or to bowdlerize the language of, some classic?
It's true that DP's work ends when the book is posted to PG, and that PG has only a feeble update method (email to their errata contact). I could certainly see something like this project as an adjunct to PG, dedicated to continually refreshing the library, but with editorial control over which books go into it.
At some point, I would like to investigate DP tools and see if there is something I could contribute.
For example, I contributed enhancements to "Alice in Wonderland" https://github.com/GITenberg/Tenniel-Illustrations-for-Alice... (from a version in mobileread). DP produced one edition; There are in fact many public domain versions of Alice. How to keep them all straight? PG has no answer, but version control systems allow us to use fork and merge processes to start to deal with the way the real world works.
It's the diff-viewing infrastructure that needs to be completely replaced for prose.
I think you're right: if the diff-viewing infrastructure was abstracted (plugin?) it could do this and more (eg word-diff-by-paragraph, image-diff).
Harpjs is two steps ahead on this one.
What often happens is that editors have one translation of a book, say Les Misérable, and keep reprinting the same translation independently of the quality. So I was thinking that a github like platform to foster translation would be a great idea. Looks like gitenberg might by the project just for that.
But maybe it should pick a clone (gitlab ?), self host and fork/extend that tool to ease the use so that non-developer could use the site without git knowledge. Then again, tailorisation for translation might not be needed.
We're on the same wavelength here. I forked GitLab to make a 'github for writers'. Still backed by git repos, but a simplified web UI for less technical users. If you're interested in working together, lets chat.
I'll still reproduce the comment here in case you can get some insights or anyone else is interested.
I've been researching on Markdown+Git book publishing (inspired by 'Markdown to Ebook') and found that there are already three 'GitHubs for writers': GitBook, Penflip and Arturo.io. Each has its own strengths and weaknesses:
Just a publishing platform backed by Git.
- Standalone app.
- It is a bookstore.
- Publishes to major stores.
- Ugly MSWord-like typesetting.
- There's no "social collaboration" at all, seems like it's just backed by Git. Not sure if small-scale collaboration (couple authors) is supported in the app or you have to deal with Git complexities yourself.
- Seems technical-oriented. No fiction categories.
Penflip seems to fit your idea more (note: now I know why it fits your idea :P), being collaborative like GitHub.
- It has no integration with bookstores.
- It's not a bookstore and you can't discover books easily.
- Looks more like a "free books" platform using 'free' in the FSF sense.
- It's hard to find a complete book to peek into, but the output seems to be just as ugly as GitBook's. AFAIK they let you customize output, but seems typesetting is not LaTeX-like and suspect won't be up to the job. Defaults are very important, it should be beautiful right out of the box.
Arturo.io's page is currently down (some cert error). It looked just like a bunch of webhooks for GitHub. Seemed immature and not for less technical users (still requires Git/Hub knowledge).
As non-Git alternatives I found Leanpub and Softcover.
Social bookstore and publishing platform.
- Lean Publishing.
- The author-reader interaction is awesome.
- Their PDF output is beautiful.
- It's a bookstore (90% royalties!) with social-network aspects.
- Makes it really easy to create bundles, sharing royalties with other authors, etc. Really awesome feature.
- Tools for marketing are awesome. Integrated with Google Analytics.
- I can't download their toolchain (but a local workflow is somewhat reproducible following 'Markdown to Ebook')
- Does not rely on Git, but in Dropbox. No proper version control.
- No collaboration.
- Does not publish to major bookstores (but allows you to do so).
Major Leanpub competitor. In the publishing aspect seems to be pretty much the same, but their store philosophy is different. Their aim is not to be a bookstore but just a payment processor. You deal with your own marketing, set up your own domain.
- I can download their toolchain. As far as I can tell, I could self-publish not using their platform without hassle. Not being tied to a provider is a HUGE selling point for me.
- AFAICT still supports Lean Publishing with their generated landing pages.
- It's a book payments processor. 90% royalties!
- Lets you control your own marketing, domain, etc. (has downsides)
- Since you control your own marketing there are no social network aspects. Each book is supposed to be its own page. No bookstore. No way to explore and discover other books.
- A bit too technical.
- DIY version control, no integrated collaboration.
- Does not publish to major bookstores (but allows you to do so AFAICT).
Even though I'd love a Git-backed workflow I'll stick with one of Leanpub or Softcover because of how beautiful they look. I can still Git it myself. Major selling point for me and the non-techie friends I've been talking to.
The bookstore integration in both is a big selling point too!
I still consider Leanpub since I can replicate their toolchain and seems so easy and powerful for my non-tech friends. Letting users discover your book in the bookstore is really useful.
Now that I know you're from Penflip I will summarize:
I can see you're not a competitor with publishing platforms. As far as I can tell, you're more like GitHub, in the private-repo business instead of taking a cut from sales. Penflip seems great for social collaborative stuff, but I wouldn't choose it if I planned on selling my book.
As I said typesetting is very important. Your platform is awesome, but rendering really put off my friends. Penflip books like like HTML rendered to a PDF (which I guess they actually are). Did you consider moving to LaTeX-based rendering for PDFs? Markdown -> LaTeX -> PDF is the way to go.
Git is a great selling point, but secondary. Book authors just don't know it yet, even though it's one of those features that you just love when you try.
I fail to realize how this could be useful for GITenberg though. Do you intend to publish the books or automate the publishing perhaps? If so, as far as I can tell GITenberg files are not structured, and won't lend themselves easily to automated publishing.
I guess the great thing about GITenberg is anyone could do their own structured .md version and request a pull. Would be cool with some automation to generate and release cool PDFs if .md file is available.
GitHub should really put some work into improving their feed algorithm so one project can't just clog it all.
For example, known books by established publishers, but with a self-publishing arm http://self.gutenberg.org/
That toolchain is something I effectively have to build for GITenberg anyway.
Pull requests welcome: https://github.com/GITenberg/gitenberg.github.com
I found https://www.penflip.com/ a few months ago... It isn't focused on building a digital library yet but what I like of this project is the good execution. It would be nice to merge them together!
Besides translations, what can people besides the author contribute? Doesn't it, on some level, ruin the character of these books? If you look at a non-fiction book from 80 years ago, is it worth bothering to correct the information when you can probably find it at your fingertips on wikipedia?
But to answer your question, the main area where I've found Project Gutenberg's epubs could be improved is in their navigation outline (the toc.ncx file). For example, they often use top-level headings for each line from the title page, then put the entire book under the last line. Whereas other books are closer to what you'd expect, albeit at inconsistent levels of detail. For my project, I abandoned their TOC's altogether and created a simpler format.
The images are also at a bare-minimum of resolution. In some cases, higher-quality versions are available in the public domain (such as on Wikimedia Commons). Most of the books are also scanned on archive.org, and so can be referenced there in facsimile. These tend to be higher-resolution scans (although those are all monochrome that I've seen).
For corrections in the works proper, I have occasionally submitted corrections by email but never received a response.
Otherwise, they are perfect, and I thank them for their outstanding work.
EDIT: There are also rare cases (I think Seneca was the one I came across) where the id's are not unique across the book, even if they are within the HTML files. I couldn't find anywhere in the EPUB specification that would require this, yet for practical purposes I think they should be made unique across the book, since the division into HTML files is arbitrary.
Further to that, there are some PG books that have a unique (serial) ID on every paragraph. Again, this is not required, but it's extremely helpful when it's there (for anchor referencing). It would make the whole library more usable if this were applied consistently, and the serial id's are apparently mechanically applied.
* substitutes for characters missing from ASCII (e.g. L for £, no proper dashes)
* incorrectly delimited chapter heads (e.g.:
*CHAPTER 11: THE GREAT*
BOONDOGGLE George Boondoggle sat on his lawn...
*CHAPTER 11: THE GREAT BOONDOGGLE*
George Boondoggle sat on his lawn...)
* missing italics, underlines, etc.
* ASCII-fied equations and diagrams
These do not detract from PG's original goal of being an archive of plain text, and suffice to provide scholars of the 22nd century a good view of what was written in the 19th, but they do detract from the experience of somebody who just wants to read Anna Karenina for fun. (Especially if they are a typography nerd like me)
Additionally, using a single repo would allow me to fork and specify my own styles that I want applied to any work I "compile", and these might be hyper-specific.
I'm actually willing to help consolidate these repos if you're willing to go in this direction. I'd also like to hear reasoning for multiple repos if there's something I'm missing.
I was looking to write an alternative search for Gutenberg, based on the RDF dump, so would be happy to collaborate/discuss ideas.
I would love to have a complete python parser for the metadata. I strongly recommend collaborating with the Gutenberg package posted to HN a few weeks ago (and his rdf branch):
GITenberg has a mailing list and would love to have you!
Using Git for just about anything other than what it was built for is a terrible idea. I mean the underlying system is incredibly powerful and could be useful in various projects, but the interface is horrific. I swear its like someone tried to make Git as difficult as possible to use. Programmers have a hard time understanding and using Git, non-programmers will just laugh and walk away. Every time a programmer has an issue with Git, whoever helps them has to sit down and explain the underlying system for 20 minutes and draw a bunch of sticks and bubbles. Non-programmers will never put up with this.
I appreciate this comment with respect to Git right now. I've recently spent a lot of "hammock time" trying to come to grips with my views about this profession generally and what I believe is best going forward. One thing I feel strongly about is that while we are still maturing as a field, the pain points are unacceptable. There is still so much work to offload to the machine, requiring fundamental rethinking at many levels. So although I agree in principle with the initiative to help people "learn to code" (so that we can bring system design closer to the domain experts), I also believe that in the current state of things, it's a wasteful effort, since it requires conveyance of ideas that should be deprecated.
But even short of programming, version control alone would be useful in so many other fields. There's no reason why it shouldn't be a mainstream concept even for personal use (e.g., you're working on a thesis). Just an hour ago, during my annual flirtation with Git (I'm a Mercurial user), I wrote in my notes:
> the barrier to entry for new programmers is important. This would appear to weigh in favor of Mercurial — and yet, realistically, is a “layman,” i.e. someone who knows nothing about software development and has never used a CLI, really going to distinguish between these two systems, or will the very concepts of a VCS not prove to be the biggest hurdle?
I have used Git, and I think that for linear history the differences are not remarkable. But the attitude you refer to is crucial: do we want to hide complexity or expose it?
Incidentally, I have several Project Gutenberg epubs under version control for a personal project, and like the OP I attest that their work is first-rate. There's no comparison to any other digitizer in the public domain (that I know of).
This isn't true at all for a lot of people. I know a lot of people that just read the docs and are able to solve the issues. Others will Google the problem and find the solution on stack overflow. Everyone learns differently...
> Anytime someone shoe-horns it into a product they talk about how Git is so amazing and solves all these problems, but what they are really talking about is just a version control system, not Git specifically.
Git is amazing and does solve a lot of problems, but there are problems that aren't solved by Git. Even Linus himself says this here: (https://www.youtube.com/watch?v=4XpnKHJAok8).
Using the github API, rather than git, for creating epub books and pdfs is a great. Using git to control changes as the do is perfect as well.
> Non-programmers will never put up with this.
Ermm don't assume that everyone gives up right away. With the GUI interfaces we have today, Git is really simple once you learn it.
Pretty much everything is simple once you learn it. That's what learning is. But git certainly doesn't go out of its way to make that process easy.
I wouldn't say so. A lot of things are designed-by-committee implemented-by-the-lowest-bidder messes that are painful and complex even once you know how they work.
Git may have some weird design decisions but for the most part it's well-implemented and follows a simple conceptual model.
I think that's why people are now starting to think about applying version control to domains outside of code and choosing Git to do it. For example, I had an idea a few years ago to make a CMS on top of Subversion as the data store (never got around to building it though). Now there are lots of projects like that built on top of Git: CMS, Wikis, you name it. Generally anything that can work off flat files is very easily converted to use Git as a back-end, giving you advanced version control features more or less for free.
From a practical perspective, the difference is not just that Git is a trendy new silver bullet, it's the "dumbness" that makes it actually easier to do that kind of work than it would be on older version control systems like SVN. Interestingly, for the most part, most of these projects do not really benefit from the distributed nature of Git (although for things like wikis and CMSs it can offer yet another feature: content migration between instances). It's more about the ease of use for getting data into a repository and under version control without it exploding when something unexpected happens, like a file getting renamed.
You might not get the best front-end experience for actually doing stuff with that version history (as other comments have noted wrt diff tools, etc., which tend to be geared towards code rather than other types of content) but that's the fault of those tools rather than Git (which is dumb enough not to care about content types), so it's just a question of incrementally building up a better toolset for your particular content domain. That's much easier to do and more approachable than building the whole infrastructure from scratch.
As for Github, it happens to have a nice interface, toolset, documentation and mindshare. Developers are familiar and comfortable with it, so there's no need to research and learn "yet another tool". And because it's cloud based, you can get up and running very quickly without worrying about hosting, etc. That's just more icing on the cake really.
Git's popularity isn't because it's the best tool out there for all scenarios. It's popular because it's a distributed system that helped communities grow around code managed with it while removing barriers to entry. In my opinion, that more than anything will be Git's lasting legacy.
You can use a distributed vcs as a centralized one if you want.
But from my experience, they have to do this exactly once per (non-stupid) programmer. The moment you grok underlying structure (basically all graphs and pointers), the apparent complexity disappears and most of the things in git become obvious. I see no problems with explaining this to non-programmers as well, you just have to spend a little more time, because they probably aren't used to think in terms of graphs.