I'm trying to keep the project simple but with enough consideration to keep it robust.
I'd love if anyone is interested enough to perhaps lend a hand. I'm always looking out for any other indigenous developers to network with. (I believe I've only ever met one other person)
Also, for anyone who works in this space, any resource recommendations or tips for writing a translation engine. (I currently will just for likely word replacements but will eventually make the engine try to understand grammar)
"mob" is a common way Aboriginal's refer to each others tribes. "Which mob are you from?"
Edit: I've only added 20% of the dictionary so will finish over the next couple days.l
The file contains a copyright notice by the Summer Institute for Linguistics. That might cause legal trouble, although if the Australian government can host the PDF without repercussions, you'll probably fly under the radar.
> I'm currently manually transcribing it all into YAML
The scan looks relatively clean, so I think it should be possible to speed this up massively using OCR.
> Also, for anyone who works in this space, any resource recommendations or tips for writing a translation engine.
I don't work in this space, but I know about Apertium https://www.apertium.org/ , a framework for writing rule-based translation engines. As for machine-learning approaches, some are simple enough to be used as tutorials for ML frameworks, e.g. https://pytorch.org/tutorials/intermediate/seq2seq_translati... Note that this tutorial uses data from Tatoeba https://tatoeba.org/eng/ , which currently only has a single sentence in an Australian Aboriginal language, Noongar: https://tatoeba.org/eng/sentences/show_all_in/nys/und For languages that are not closely related, you'll need much more training data. (A friend of mine tried to make it work for English-Korean with a few thousand sentences, but failed.)
I will look into automating with OCR but it is also kind of cool to re-type it all to get a feel for the language myself.
Apertium looks awesome, and I'm going through all the docs now. I think it will make sense for me to structure things like Apertium as it looks like a pretty active and relevant project.
And yeah, the odds of getting some sweet ML system working seem quite low, as the languages are almost dead anyway so will be nigh impossible to get a data set.
I have family member on the language advisory group so I will double check the copy right.
> I will look into automating with OCR but it is also kind of cool to re-type it all to get a feel for the language myself.
I ran the PDF through OCRmyPDF https://ocrmypdf.readthedocs.io/en/latest/ with the language set to English and it actually did pretty well even at recognizing Kuku Yalanji words. Getting it into a structured format will require some custom processing to parse the layout, though.
> the languages are almost dead anyway so will be nigh impossible to get a data set.
One way you might be able to slightly boost usage is to create a mobile keyboard based on the dictionary, making it more convenient to type in the language. Keyman https://keyman.com/ is an open-source tool for creating keyboard layouts, coincidentally created by the Summer Institute for Linguistics.
> also kind of cool to re-type it all to get a feel for the language myself
Don't. Or do, but in any event be really clear of your goals. Is your aim to get something done? Then do it the most efficient way (not by hand). Do you want to get a feel for the language? Then hand-type it all, but at the cost of slowing you down greatly.
Decide up front what your goals are else you will mix your goals and try to ride two horses, as the saying goes, and make achieving either much, much harder.
I've made this mistake many times, so take it as wisdom derived from really painful experience. Focus!
(this may come across as lecturing and presumptuous but I've screwed so much up by lack of clear goals and focus).
As someone who often took the transcribe-by-hand route on my language projects to sneak in some incidental language practice, you also have to budget for all the typos and inexplicable mistakes you will make.
You're going to have to go line by line in the source either way to proofread, so I recommend machine transcription over ape transcription.
Just to mention, the Apertium community are very collaborative and always interested in all sort of minority languages, endangered and extinct languages and language revival. You should join their IRC channel #apertium and show them what you have already --- I'm sure they'd be interested and also possibly even have some ideas about things that could help.
Please reconsider this technical choice. YAML is a not only quite hard to edit manually, it's also difficult to transform automatically. I'm working in the dictionary space and the field is mostly using XML. While there is no standardized set of tags (I know of a PhD working on it), this is the format people are confortable with in the field, and is very easy to transform into any other formats including YAML.
Moreover I agree that the source material is clear enough, and most of it is English, making OCR an obvious go-to in this situation.
I completely agree. I gave it a decent amount of thought and pulled the trigger, and went with YAML.
Not because I think YAML is the ideal end target for the datasets. Likely it would be best represented in a database with a custom CMS and outputted in a more robust format such as XML.
Though considering that the current archives for the indigenous languages are primitive in their documentation, I think YAML will get the project started.
It is just ever so slightly easier to get contributors started.
As the project matures (and before it gets to complex), I will convert all the YAML datasets to something more appropriate and build the tooling to handle it in tandem.
Again, you are correct, and I will add my reasoning to the current documentation.
To help ensure your work is preserved, ensure that your code is also archived on https://softwareheritage.org, and sites/documents on https:archive.org. I just submitted a save request on softwareheritage. not sure if it will work.
If you are trying to promote usage of the data you might want to put the dictionary under a more permissive license as GPL-3+ conflicts with a lot of other licenses. Something like Creative Commons Attribution-ShareAlike Licence (V3.0)[1] might work better.
> any resource recommendations or tips for writing a translation engine
All the successful translation websites/apps you're familiar with use machine learning. ML stomped all over NLP approaches because it gives a rough translation between so many more languages for so much less work.
On the question of where the data comes from, you might be a bit closer than you think. That dictionary you're transcribing has some sentence pairs, and like yorwba said, sentence pairs are food for ML language models. Extracting all the sentence pairs into a dataset might raise some interest from ML people.
I think ML approaches are not well suited here - or at least there are still huge problems between non-germanic/Latin languages that have yet to be approached.
(My background: white Australian, not a linguist, know a little Vietnamese, know a very tiny bit about indigenous Australian languages based on chatting with friends who studied that area)
I continually see consumer-facing ml approaches (FB, Google) give terrible Vietnamese translations because they assume all of the context needed for a translation is available in the text. In general this is not the case. In Vietnamese this is hugely obvious because their pronoun system is largely based on 3rd person relationships ("sister walks down the street", "boyfriend loves girlfriend"), which is impossible to map to/from English 2nd person ("you walk down the street", "I love you") without basically a full conscious intelligence. Even FB, which is in a unique circumstance of actually having a lot of the requisite relationship data between people available to it, does a terrible job at this.
My (tiny) understanding of the incredibly rich kinship systems in indigenous Australian cultures suggests that this would be a huge issue there as well, assuming these complexities are also present in their languages. (...OP? :) )
If you are from FNQ, i had heard somewhere there are grants available from qld gov for language initiatives. Did a which search and found one here [0].
I know grant applications can be a real pita, but from the face of it sounds like you could be eligible (just in case you were not already aware).
Great initiative that will have significant impact by making translations available to all, including journalists, governmentS, lawyers, courts and merchants. And of course it will give kids one more reason to contextualise aboriginality as first class citizenship.
Have you looked into affiliations with unis and education departments?
As having dabbled in NN's,speech,language recognition, I feel rightly ashamed that Aboriginal langauge has never crossed my mind (ever!); That's not the worse part,, I'm a native of Australia :(
Commendable effort, but do I understand correctly that you're just doing word-for-word replacements? If so, the end result will not be at all grammatical, since you're missing the case markings and inflections required in most Aboriginal languages.
I'm building it out to be a fully open source and community driven project -> https://github.com/australia/mobtranslate-server
I've also only started work on my own tribes language (Kuku Yalanji)
We have a dictionary in PDF form -> http://www.ausil.org.au/sites/ausil/files/WP-B-7%20Kuku-Yal%...
And I'm currently manually transcribing it all into YAML -> https://github.com/australia/mobtranslate-server/blob/master...
I'm trying to keep the project simple but with enough consideration to keep it robust.
I'd love if anyone is interested enough to perhaps lend a hand. I'm always looking out for any other indigenous developers to network with. (I believe I've only ever met one other person)
Also, for anyone who works in this space, any resource recommendations or tips for writing a translation engine. (I currently will just for likely word replacements but will eventually make the engine try to understand grammar)
"mob" is a common way Aboriginal's refer to each others tribes. "Which mob are you from?"
Edit: I've only added 20% of the dictionary so will finish over the next couple days.l