Thanks for the suggestions, I do appreciate it. I was being pretty brief with my post but I really have spent a lot of time and tried this from a number of angles. I've had good luck with non-LLM tools to do the initial OCR, but it's not context aware especially about column/page breaks (like I mentioned it's kind of a dirty scan, and if the breaks happen on a Shipibo part it barfs a bit. Good for a rough search at least).
I would love to create a json version of it that would essentially have a bunch of fields for each word (Shipibo/Spanish/English word/definition/example, type of word, etc). It's further complicated by how words can be modified in Shipibo (it's actually a very technical language- words can have any number of prefixes and suffixes tagged on to change their meaning and their precision. In their "icaros", the healing songs they sing in ceremony, the most technical use of the language is considered to be the most beautiful. Essentially poetry from their "medical" jargon).
I've done some human-in-the-loop attempts but still come up short in one way or another (I end up getting frustrated and throwing my hands up after seeing how much time I dump on it). So I figure this will remain a good test as the tools (and my prompting abilities) get better. It's definitely not urgent for me.
I would love to create a json version of it that would essentially have a bunch of fields for each word (Shipibo/Spanish/English word/definition/example, type of word, etc). It's further complicated by how words can be modified in Shipibo (it's actually a very technical language- words can have any number of prefixes and suffixes tagged on to change their meaning and their precision. In their "icaros", the healing songs they sing in ceremony, the most technical use of the language is considered to be the most beautiful. Essentially poetry from their "medical" jargon).
I've done some human-in-the-loop attempts but still come up short in one way or another (I end up getting frustrated and throwing my hands up after seeing how much time I dump on it). So I figure this will remain a good test as the tools (and my prompting abilities) get better. It's definitely not urgent for me.