I'm building it out to be a fully open source and community driven project -> https://github.com/australia/mobtranslate-server
I've also only started work on my own tribes language (Kuku Yalanji)
We have a dictionary in PDF form -> http://www.ausil.org.au/sites/ausil/files/WP-B-7%20Kuku-Yal%...
And I'm currently manually transcribing it all into YAML -> https://github.com/australia/mobtranslate-server/blob/master...
I'm trying to keep the project simple but with enough consideration to keep it robust.
I'd love if anyone is interested enough to perhaps lend a hand. I'm always looking out for any other indigenous developers to network with. (I believe I've only ever met one other person)
Also, for anyone who works in this space, any resource recommendations or tips for writing a translation engine. (I currently will just for likely word replacements but will eventually make the engine try to understand grammar)
"mob" is a common way Aboriginal's refer to each others tribes. "Which mob are you from?"
Edit: I've only added 20% of the dictionary so will finish over the next couple days.l
The file contains a copyright notice by the Summer Institute for Linguistics. That might cause legal trouble, although if the Australian government can host the PDF without repercussions, you'll probably fly under the radar.
> I'm currently manually transcribing it all into YAML
The scan looks relatively clean, so I think it should be possible to speed this up massively using OCR.
> Also, for anyone who works in this space, any resource recommendations or tips for writing a translation engine.
I don't work in this space, but I know about Apertium https://www.apertium.org/ , a framework for writing rule-based translation engines. As for machine-learning approaches, some are simple enough to be used as tutorials for ML frameworks, e.g. https://pytorch.org/tutorials/intermediate/seq2seq_translati... Note that this tutorial uses data from Tatoeba https://tatoeba.org/eng/ , which currently only has a single sentence in an Australian Aboriginal language, Noongar: https://tatoeba.org/eng/sentences/show_all_in/nys/und For languages that are not closely related, you'll need much more training data. (A friend of mine tried to make it work for English-Korean with a few thousand sentences, but failed.)
I will look into automating with OCR but it is also kind of cool to re-type it all to get a feel for the language myself.
Apertium looks awesome, and I'm going through all the docs now. I think it will make sense for me to structure things like Apertium as it looks like a pretty active and relevant project.
And yeah, the odds of getting some sweet ML system working seem quite low, as the languages are almost dead anyway so will be nigh impossible to get a data set.
I have family member on the language advisory group so I will double check the copy right.
I ran the PDF through OCRmyPDF https://ocrmypdf.readthedocs.io/en/latest/ with the language set to English and it actually did pretty well even at recognizing Kuku Yalanji words. Getting it into a structured format will require some custom processing to parse the layout, though.
> the languages are almost dead anyway so will be nigh impossible to get a data set.
One way you might be able to slightly boost usage is to create a mobile keyboard based on the dictionary, making it more convenient to type in the language. Keyman https://keyman.com/ is an open-source tool for creating keyboard layouts, coincidentally created by the Summer Institute for Linguistics.
I will try the OCR too and make it work some how.
Don't. Or do, but in any event be really clear of your goals. Is your aim to get something done? Then do it the most efficient way (not by hand). Do you want to get a feel for the language? Then hand-type it all, but at the cost of slowing you down greatly.
Decide up front what your goals are else you will mix your goals and try to ride two horses, as the saying goes, and make achieving either much, much harder.
I've made this mistake many times, so take it as wisdom derived from really painful experience. Focus!
(this may come across as lecturing and presumptuous but I've screwed so much up by lack of clear goals and focus).
As someone who often took the transcribe-by-hand route on my language projects to sneak in some incidental language practice, you also have to budget for all the typos and inexplicable mistakes you will make.
You're going to have to go line by line in the source either way to proofread, so I recommend machine transcription over ape transcription.
Please reconsider this technical choice. YAML is a not only quite hard to edit manually, it's also difficult to transform automatically. I'm working in the dictionary space and the field is mostly using XML. While there is no standardized set of tags (I know of a PhD working on it), this is the format people are confortable with in the field, and is very easy to transform into any other formats including YAML.
Moreover I agree that the source material is clear enough, and most of it is English, making OCR an obvious go-to in this situation.
Not because I think YAML is the ideal end target for the datasets. Likely it would be best represented in a database with a custom CMS and outputted in a more robust format such as XML.
Though considering that the current archives for the indigenous languages are primitive in their documentation, I think YAML will get the project started.
It is just ever so slightly easier to get contributors started.
As the project matures (and before it gets to complex), I will convert all the YAML datasets to something more appropriate and build the tooling to handle it in tandem.
Again, you are correct, and I will add my reasoning to the current documentation.
I just realized I haven't added a LICENSE to the repos yet either.
Does anybody what would be the most open for software plus language data?
All the successful translation websites/apps you're familiar with use machine learning. ML stomped all over NLP approaches because it gives a rough translation between so many more languages for so much less work.
On the question of where the data comes from, you might be a bit closer than you think. That dictionary you're transcribing has some sentence pairs, and like yorwba said, sentence pairs are food for ML language models. Extracting all the sentence pairs into a dataset might raise some interest from ML people.
I continually see consumer-facing ml approaches (FB, Google) give terrible Vietnamese translations because they assume all of the context needed for a translation is available in the text. In general this is not the case. In Vietnamese this is hugely obvious because their pronoun system is largely based on 3rd person relationships ("sister walks down the street", "boyfriend loves girlfriend"), which is impossible to map to/from English 2nd person ("you walk down the street", "I love you") without basically a full conscious intelligence. Even FB, which is in a unique circumstance of actually having a lot of the requisite relationship data between people available to it, does a terrible job at this.
My (tiny) understanding of the incredibly rich kinship systems in indigenous Australian cultures suggests that this would be a huge issue there as well, assuming these complexities are also present in their languages. (...OP? :) )
I did just finish my first ML CV project recently so kind of in the head space.
I know grant applications can be a real pita, but from the face of it sounds like you could be eligible (just in case you were not already aware).
I personally prefer to stay away from funding and will do this project in my spare time.
Though, once there is a big enough team, and they vote that the project deserves funding, I'd be happy to investigate it.
Have you looked into affiliations with unis and education departments?
Current plan will be to finish one whole dictionary.
Then find the second most popular dictionary.
Once two are fully integrated, I will do a bit of promotion to the places you mentioned.
And hopefully just network everyone who is interested in thoroughly digitising the tribal dialects.
I was thinking of putting together a group chat somewhere if you are interested in joining.
please msg me
Email me at email@example.com
1) Complete copying the dictionary in
2) Add more variations of words e.g. hi, hello, gday
3) Add a probability matcher that looks for % matches of liklihood e.g. house != hosue (0.75)
4) Use the saved grammar types to try infer if the word is a verb before noun etc
5) Use a real ML/NLP system to really try understand the language
rinse and repeat those steps for optimizations. A few developers have already reached out to contribute so hopefully get this up to scratch asap.
Edit: I see the title of this post is changed now