Hacker News new | past | comments | ask | show | jobs | submit login

i have been using catdoc and pdftotext to convert doc and pdf files, respectively. nice to see that there's an alternative that also includes a library, will be checking this out.

a couple questions i have, seems firstly that old school .doc files are not supported, docx yes. unfortunately i still get a lot of docs in .doc format which seems to be microsoft's proprietary format (docx seems to be more open).

my second question is whether or not there's a filter for golang, most of my development is in golang, so i either need to call your cli as a forked process or best to have a native library. i have never worked with haskell so not sure if i can import a haskell library from golang directly. i imagine there'd need to be a golang wrapper around the cli.

You could use Libreoffice's command line interface to convert from .doc to a more manageable format.

  lowriter --convert-to odt some-document.doc
odt is not the only supported target, but doc --libreoffice--> odt --pandoc--> plain seems to give better results than e.g. doc --libreoffice--> txt or doc --libreoffice--> docx --pandoc--> plain.

if that's the case, i'll stick with catdoc. my use case is to create a full text search index of the content, trading libre office cli for catdoc, i'd rather just stick with catdoc, but thanks.

1. yes, only docx is supported. 2. for Go pandoc filters, this seems to work: https://github.com/oltolm/go-pandocfilters

thanks, will check this out

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact