Hacker News new | past | comments | ask | show | jobs | submit login

In case this helps, here is a mapping from Unicode ligature-->ascii for all the ligatures I know of (the ones supported by LaTeX fonts): https://github.com/ivanistheone/arXivLDA/blob/master/preproc...

This is assuming you cleaning up the output of `pdftotext` which in my experience is the best command line tool for extracting plain text.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: