Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Mob Translate – Translation for Australian Aboriginal Languages (mobtranslate.com)
85 points by thomasfromcdnjs 13 days ago | hide | past | favorite | 33 comments

So the site is mostly a prototype at this point -> https://mobtranslate.com/

I'm building it out to be a fully open source and community driven project -> https://github.com/australia/mobtranslate-server

I've also only started work on my own tribes language (Kuku Yalanji)

We have a dictionary in PDF form -> http://www.ausil.org.au/sites/ausil/files/WP-B-7%20Kuku-Yal%...

And I'm currently manually transcribing it all into YAML -> https://github.com/australia/mobtranslate-server/blob/master...

I'm trying to keep the project simple but with enough consideration to keep it robust.

I'd love if anyone is interested enough to perhaps lend a hand. I'm always looking out for any other indigenous developers to network with. (I believe I've only ever met one other person)

Also, for anyone who works in this space, any resource recommendations or tips for writing a translation engine. (I currently will just for likely word replacements but will eventually make the engine try to understand grammar)

"mob" is a common way Aboriginal's refer to each others tribes. "Which mob are you from?"

Edit: I've only added 20% of the dictionary so will finish over the next couple days.l

> We have a dictionary in PDF form

The file contains a copyright notice by the Summer Institute for Linguistics. That might cause legal trouble, although if the Australian government can host the PDF without repercussions, you'll probably fly under the radar.

> I'm currently manually transcribing it all into YAML

The scan looks relatively clean, so I think it should be possible to speed this up massively using OCR.

> Also, for anyone who works in this space, any resource recommendations or tips for writing a translation engine.

I don't work in this space, but I know about Apertium https://www.apertium.org/ , a framework for writing rule-based translation engines. As for machine-learning approaches, some are simple enough to be used as tutorials for ML frameworks, e.g. https://pytorch.org/tutorials/intermediate/seq2seq_translati... Note that this tutorial uses data from Tatoeba https://tatoeba.org/eng/ , which currently only has a single sentence in an Australian Aboriginal language, Noongar: https://tatoeba.org/eng/sentences/show_all_in/nys/und For languages that are not closely related, you'll need much more training data. (A friend of mine tried to make it work for English-Korean with a few thousand sentences, but failed.)

Ah, you are an absolute legend.

I will look into automating with OCR but it is also kind of cool to re-type it all to get a feel for the language myself.

Apertium looks awesome, and I'm going through all the docs now. I think it will make sense for me to structure things like Apertium as it looks like a pretty active and relevant project.

And yeah, the odds of getting some sweet ML system working seem quite low, as the languages are almost dead anyway so will be nigh impossible to get a data set.

I have family member on the language advisory group so I will double check the copy right.

> I will look into automating with OCR but it is also kind of cool to re-type it all to get a feel for the language myself.

I ran the PDF through OCRmyPDF https://ocrmypdf.readthedocs.io/en/latest/ with the language set to English and it actually did pretty well even at recognizing Kuku Yalanji words. Getting it into a structured format will require some custom processing to parse the layout, though.

> the languages are almost dead anyway so will be nigh impossible to get a data set.

One way you might be able to slightly boost usage is to create a mobile keyboard based on the dictionary, making it more convenient to type in the language. Keyman https://keyman.com/ is an open-source tool for creating keyboard layouts, coincidentally created by the Summer Institute for Linguistics.

It seems like you have done this before aha

I will try the OCR too and make it work some how.

> also kind of cool to re-type it all to get a feel for the language myself

Don't. Or do, but in any event be really clear of your goals. Is your aim to get something done? Then do it the most efficient way (not by hand). Do you want to get a feel for the language? Then hand-type it all, but at the cost of slowing you down greatly.

Decide up front what your goals are else you will mix your goals and try to ride two horses, as the saying goes, and make achieving either much, much harder.

I've made this mistake many times, so take it as wisdom derived from really painful experience. Focus!

(this may come across as lecturing and presumptuous but I've screwed so much up by lack of clear goals and focus).

Good point.

As someone who often took the transcribe-by-hand route on my language projects to sneak in some incidental language practice, you also have to budget for all the typos and inexplicable mistakes you will make.

You're going to have to go line by line in the source either way to proofread, so I recommend machine transcription over ape transcription.

Just to mention, the Apertium community are very collaborative and always interested in all sort of minority languages, endangered and extinct languages and language revival. You should join their IRC channel #apertium and show them what you have already --- I'm sure they'd be interested and also possibly even have some ideas about things that could help.

It looks like an awesome org, I will be definitely reaching out after reading your endorsement.

> into YAML

Please reconsider this technical choice. YAML is a not only quite hard to edit manually, it's also difficult to transform automatically. I'm working in the dictionary space and the field is mostly using XML. While there is no standardized set of tags (I know of a PhD working on it), this is the format people are confortable with in the field, and is very easy to transform into any other formats including YAML.

Moreover I agree that the source material is clear enough, and most of it is English, making OCR an obvious go-to in this situation.

I completely agree. I gave it a decent amount of thought and pulled the trigger, and went with YAML.

Not because I think YAML is the ideal end target for the datasets. Likely it would be best represented in a database with a custom CMS and outputted in a more robust format such as XML.

Though considering that the current archives for the indigenous languages are primitive in their documentation, I think YAML will get the project started.

It is just ever so slightly easier to get contributors started.

As the project matures (and before it gets to complex), I will convert all the YAML datasets to something more appropriate and build the tooling to handle it in tandem.

Again, you are correct, and I will add my reasoning to the current documentation.

To help ensure your work is preserved, ensure that your code is also archived on https://softwareheritage.org, and sites/documents on https:archive.org. I just submitted a save request on softwareheritage. not sure if it will work.

ah that's great, thanks.

I just realized I haven't added a LICENSE to the repos yet either.

Does anybody what would be the most open for software plus language data?

Apertium uses GPL-3+. If you'd like to combine with Apertium data, it might be easiest to stick to that.

Thanks, that is a good start.

If you are trying to promote usage of the data you might want to put the dictionary under a more permissive license as GPL-3+ conflicts with a lot of other licenses. Something like Creative Commons Attribution-ShareAlike Licence (V3.0)[1] might work better.

1. https://creativecommons.org/licenses/by-sa/3.0/

You guys are all amazing, and thank you in particular.

> any resource recommendations or tips for writing a translation engine

All the successful translation websites/apps you're familiar with use machine learning. ML stomped all over NLP approaches because it gives a rough translation between so many more languages for so much less work.

On the question of where the data comes from, you might be a bit closer than you think. That dictionary you're transcribing has some sentence pairs, and like yorwba said, sentence pairs are food for ML language models. Extracting all the sentence pairs into a dataset might raise some interest from ML people.

I think ML approaches are not well suited here - or at least there are still huge problems between non-germanic/Latin languages that have yet to be approached. (My background: white Australian, not a linguist, know a little Vietnamese, know a very tiny bit about indigenous Australian languages based on chatting with friends who studied that area)

I continually see consumer-facing ml approaches (FB, Google) give terrible Vietnamese translations because they assume all of the context needed for a translation is available in the text. In general this is not the case. In Vietnamese this is hugely obvious because their pronoun system is largely based on 3rd person relationships ("sister walks down the street", "boyfriend loves girlfriend"), which is impossible to map to/from English 2nd person ("you walk down the street", "I love you") without basically a full conscious intelligence. Even FB, which is in a unique circumstance of actually having a lot of the requisite relationship data between people available to it, does a terrible job at this.

My (tiny) understanding of the incredibly rich kinship systems in indigenous Australian cultures suggests that this would be a huge issue there as well, assuming these complexities are also present in their languages. (...OP? :) )

Ahh this is a super useful comment. Thanks for the insight. I will think about how I can get more language pairs too.

I did just finish my first ML CV project recently so kind of in the head space.

If you are from FNQ, i had heard somewhere there are grants available from qld gov for language initiatives. Did a which search and found one here [0].

I know grant applications can be a real pita, but from the face of it sounds like you could be eligible (just in case you were not already aware).

[0] https://www.datsip.qld.gov.au/programs-initiatives/grants/in...

Good thinking, I imagine that it would be a grant worthy project..

I personally prefer to stay away from funding and will do this project in my spare time.

Though, once there is a big enough team, and they vote that the project deserves funding, I'd be happy to investigate it.

Great initiative that will have significant impact by making translations available to all, including journalists, governmentS, lawyers, courts and merchants. And of course it will give kids one more reason to contextualise aboriginality as first class citizenship.

Have you looked into affiliations with unis and education departments?

Great ideas.

Current plan will be to finish one whole dictionary.

Then find the second most popular dictionary.

Once two are fully integrated, I will do a bit of promotion to the places you mentioned.

And hopefully just network everyone who is interested in thoroughly digitising the tribal dialects.

As having dabbled in NN's,speech,language recognition, I feel rightly ashamed that Aboriginal langauge has never crossed my mind (ever!); That's not the worse part,, I'm a native of Australia :(

Hey mate, part of me posting this here was to find other Aboriginal programmers, as I've only ever met one other in my life.

I was thinking of putting together a group chat somewhere if you are interested in joining.

Absolutely, would really appreciate access to a group chat.. I can organize free hosting/servers if it helps

please msg me

can't send you a message.

Email me at thomasalwyndavis@gmail.com

Commendable effort, but do I understand correctly that you're just doing word-for-word replacements? If so, the end result will not be at all grammatical, since you're missing the case markings and inflections required in most Aboriginal languages.

Yeah, the translation hasn't quite been massaged yet. The next steps are;

1) Complete copying the dictionary in

2) Add more variations of words e.g. hi, hello, gday

3) Add a probability matcher that looks for % matches of liklihood e.g. house != hosue (0.75)

4) Use the saved grammar types to try infer if the word is a verb before noun etc

5) Use a real ML/NLP system to really try understand the language

rinse and repeat those steps for optimizations. A few developers have already reached out to contribute so hopefully get this up to scratch asap.

Good effort, but this is not Google Translate, just Translate.

Edit: I see the title of this post is changed now

Sorry I meant to put it in double quotes.

Great project!

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact