Hacker News new | past | comments | ask | show | jobs | submit login

Can I bother you with some accessibility questions? (Sorry HN but no contact info in profile) I am currently converting a giant (2800+ pages) government pdf[1] to markdown so that it can be converted to html/epub/latex/etc. I am casually aware of the DAISY file format but have not found any free software tools for creating DAISY files. Is there anything I can do that would be better for you than DRM free epub/mobi files and/or plain markdown text files?

[1]: http://www.gpo.gov/fdsys/pkg/GPO-CONAN-2013/pdf/GPO-CONAN-20... -- http://www.gpo.gov/fdsys/pkg/GPO-CONAN-2013/content-detail.h...




Will try to look at that site tonight. I've updated my profile with contact info.


Many thanks. When I have something close to a final product I will get in touch with you.


How are you converting it?


The initial bit is kind of easy. I am just using pdftotext. This is a simplification (I actually use one txt file per page grouped by section):

  PDFOPT='-x 0 -y 160 -W 556 -H 553 -nopgbrk -layout'
  pdftotext $PDFOPT pdfs/GPO-CONAN-2013.pdf out.txt
Things get tricky after the raw txt is created. Once I have the raw output the conversion is a lot of shell scripts and elbow grease. I have not yet been able to create a workflow that is completely automated. The part I am currently struggling with is the footnote replication (footnote extraction is pretty easy with csplit) and the ability to recreate the index. There are thousands of footnotes (as in some sections have thousands of footnotes) and the numbering resets with each section. I think I am going to use python to recreate the footnotes in markdown format.

Once I have the converted it to markdown I am going to use pandoc for output formatting.

Are you familiar with CONAN? It is an amazing work of scholarship, I don't know how to express my appreciation without resorting to hyperbole. It is a shame that it is distributed in a format that does not allow it to shine like it could. There are some neat legal citation extraction scripts in node.js that could lead to a really useful web version. I look forward to the day I can add it to my kindle and read it from start to finish.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: