Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Unbabel API – Human Corrected Machine Translation (unbabel.com)
143 points by vasco_ on Jan 30, 2014 | hide | past | favorite | 65 comments



This type of crowdsourcing meets ML is a really nice example of where we can leverage humans and machines to the best of their current abilities.

It would be great if the feedback from the human workers could get reintegrated into the translation models in an online fashion so that they get better over time. I realize they're probably outsourcing their machine translation, but that would be a terrific fully integrated pipeline.


Duolingo has an also cool approach, where they offer free language courses for students, and charge companies that want certain documents translated.

The students themselves translate those documents as part of the course, and once a consensus is reached (a given amount of students came up with the same translation) the document is returned to the submitter, who pays for its translation.

http://www.duolingo.com/info

Almost like how recaptcha works :-)


Both recaptcha and duolingo were started by the brilliant Louis Von Ahn. I am a big fan of his work. He was the inspiration for my first startup, while at CMU.


I am Unbabel's CTO. Thanks for your comment. Our goal is exactly to use that feedback to improve our MT systems. We are currently outsourcing our MT, while training our own models (using Moses). Besides generating parallel text, the types of data we will be collecting (e.g. chain of editions performed by each editor on a task), will allow new and interesting algorithms to update the translation models.


I remember reading your NLP papers back in my academic days. Great work. One typo: in the https://www.unbabel.com/pricing/ page "from Portuguese to Portuguese can take some time".


And Italian to Portuguese should be not available.


Why should it not be available?


There is currently a gray dot, which for the other languages is used to indicate that the source and destination language is the same. For Italian to Portuguese this is not the case. It should have one of the other colored icons.


Ah, thanks, sorry, bug on our part, being corrected right now, thank you for pointing it out.


Do you offer support for time-coded transcriptions (i.e. video subtitles) as well?

We would love a good provider to enrich video with translations (for when we find a good one that offers machine transcription.)


Hello,

Not yet, but could you send us an example?


We haven't found a good provider yet to do this properly for our use case, but SpeakerText, Koemei and VoiceBase are examples of companies that offer these functionalities.

Unfortunately SpeakerText doesn't offer non-post-processed prices, Koemei integrated it into their own product and VoiceBase didn't offer post-processing on request, which we would need for integration into our product.

Which format will become mainstream probably depends on HTML5 adoption, which is detailed here http://www.3playmedia.com/how-it-works/how-to-guides/html5-v... Currently WebVTT seems to be in the lead.

Those formats don't accommodate for timestamps per spoken word though, which would be possible with machine transcription and which I would pay a premium for.


Awesome. I'm learning about Moses at the moment (MT course at Edinburgh), see you worked with Ben Taskar also, RIP. Exciting idea here!


Maybe this is intentional, but from your homepage, I have no idea what languages you handle.


Good point, we are in the process of updating our homepage. Things have been moving so fast it is hard to keep up. In any case we currently offer translations in English, Portuguese, Spanish, Italian, French and Turkish. More languages to come soon.


I didn't know the product by itself, I'm not commenting about the API.

Seems a really good idea, however I looked at:

http://news.unbabel.co/fr/fobo-est-presente-a-san-francisco-...

Vs.

http://techcrunch.com/2014/01/10/fobo/

and the translation, by _5_ people, is really poor: Just look at the first line :

>Maintenant, vous sauriez déjà que Craigslist n'est pas bon comme un lieu pour vendre votre produits.

This translation is unpleasant to read and has some mistakes. Google translate gave me a better translation.

I guess the problem comes from the fact that translators are not native in the target language. When you use the product, can you request native language translators ?


Also, you are correct, one of the things we are noticing is that requiring that the translator be native in the target language provides significant quality. As we increase the community, we will optimize for that. Thanks for commenting.


As mmastrac said, if there's a way to add hints to the translator on how to translate small sentences (for a mobile app for instance) I'll definitely use your product!

For me, having a native translator seems a must have, I hope your community will grow enough for it to be possible!


There is already a way to send comments to the translator, Thank you for your comment. When you send the message you can define the tone and give specific instructions. We also have an beta version of the android app and are almost ready to launch iphone. Also, we already have a lot of natives in the platform. For paid tasks, the quality is vastly superior to the news. I would invite you to try. Top up and experiment with giving instructions to the translators. And please get in touch with us, I would love to chat.


The content you see on the news.unbabel.com website is translated by all the user community on unbabel. However, only about 15% of our users, the best ones, are given paid translation tasks. We select these 15% based on the feedback system on which users give each other feedback and continuously improve. This and other measures are increasing the overall quality of the translations gradually.


There is a kernel of genius in this: I can't translate a word of Chinese, but I can usually do a good job of correcting a good machine translation. If the machine does the first part, I can finish the last part, and I never knew Chinese at all! So this would appear to dramatically reduce the requirement for the extremely rare skill - being highly expert in two languages - down to the very common skill of being expert in one language, and just passable in the other. I love it.


Thank you for your comment, that is exactly where are trying to get to.


I use unbabel a lot, and the really fast translation has a lot of implications that aren't immediately obvious. You can communicate with international customers efficiently in their own language. You don't have to build translation time into the end of your release cycle. It's really sad when you have to punt a last minute feature because you don't have time for translation before the deadline.


Thanks for the comment, we love our customers and we hope we continue to bring value for you and your customers.


> Human Corrected Translations for 1 cent per word

"Per word" of the source language, or the target language? Sum of both? What about languages which have a different concept of "words" in written text (e.g. Chinese, Turkish, ...).

And by the way... "cent" of which currency? :)

Edit: I just saw that the list of supported languages does not contain languages with "exotic" types of word boundaries (yet).


We are still trying to figure that out, there is no clear answer regarding how to base the pricing. Perhaps it should be based on words that were actually corrected? So for the time being we basing it only on source language words. For chinese, for example, we probably will do it based on characters, but ideas are welcome.


I think only pricing the source language makes sense. That way the consumer knows the cost going in and there's no incentive for you to provide more verbose translations.


How do you guys deal with "context" around translations? Lots of shorter strings in web application require some hand-holding for translators to let them know where the phrase will be used, as the original English text might translate to vastly different things depending on surrounding functionality/text.


That is a great point. We have some ideas on how to tackle that problem (like integrating with pontoon, from Mozilla, for example), but right now there would be some strings where that would be a problem. Surprisingly that has proven less of an issue than we had feared originally. In any case, it is a real problem for localization.


I have read some of your news translated in Italian by your service. That's a very good translation, though in some parts the automatic nature of the original translation is still detectable. E.g. The piece I read is the one about Adolf Hitler Platz, translated from TechCrunch. Commas are the dead giveaway: they're still used like in English. There's also a coordinate clause introduced by an em dash – we don't usually do that in Italian.

That's to say that the service is super interesting, but I guess the final user still has some manual editing to do if he/she wants to use this kind of translations in a professional environment. Given the price, it's still a great deal.

One last thing: the table in your home page shows that the translation from English to Italian is not available (Italy's listed only under the "from" column and not in the "to" row, if I'm reading that correctly).

Good luck with the product :)


Thank you, we are working hard to keep improving. Keep in mind that the news is done by the whole community, while paid tasks are only done by the best editors.


Just curious, how is dynamic text handled in the Unbabel API? Can you pass something like "Your score is %1" where %1 would be replaced by the first parameter? Or would you have to request "Your score is 1", "Your score is 2", "Your score is 3", etc separately?


Right now you can pass variables. We tell editors to ignore those variables. For instance if you text is "Your score is %1" and you wanted to translate to Portuguese, the translation would be "A sua pontuação é %1".


Pretty neat, will have to try it out for Kiva. We translate a giant volume of words every year (we've translated nearly 100 million words total).

As it stands, we have an awesome community of volunteer translators who take care of most our needs, but sometimes in times of high demand we could use help getting through a large batch of loans to translate. That said, when we tested out some external services for leveraging machine translation and translation memories, what we found were a few problems that keep us from being able to leverage external solutions

1) Our volunteers don't like "post-editing", meaning what you are doing here of a human manually fixing up a machine translation. Since you are paying your translators, I imagine they don't mind though.

2) It seems the majority of companies are focused on English -> Foreign Language, whereas the vast majority of our translation needs are Foreign Language -> English, and this proved decisive with most of the software being geared in such a way.

3) Our partners are often in remote areas with not always the highest level of education, and often are writing in a language that is their second language (say perhaps French in a Senegal where the person's native language is Wolof), so the grammar of the French is not going to be great to begin with. This throws off the machine translation and makes it nearly impossible to develop a translation memory that is segmented in the right way to actually produce usable translation suggestions.

4) We need to review the text for policy guidelines (say for instance a partner puts in directions to a business by accident in a region where our borrowers are anonymized for safety reasons). But if we send a translation out to a service like yours and then just have the English back, and then need to report an issue in it back to the partner, the reviewer who would just know English would not be able to communicate back to the partner the issue and identify it in the original language version.

Anyways, just some food for thought in what we've had trouble with in the space of trying to help us get our lenders connected to our borrowers by providing them accurate translations of the borrowers' stories.


Thank you for your comment, very insightful concerns and advice. We would love to chat with you about your experience with crowd translation, it would be really helpful. Would it be possible to get in touch?

One thing that we could be useful for is to actually help your reviewers to communicate with the partner, that is to help in translating the communication itself.

In any case, Kiva is an amazing organization, so obviously we would love to find out how we can help in any way.


Interesting idea. I am wondering though about "send message in your native language". How will that work exactly ? I tried the demo but do you actually support the native language script ? I got "* Unfortunately we currently do not support this language. Please try a diferent message." How can I type in Chinese for example ?

On a side note, just something that I find all the time. The unbabel blog does not link anywhere to the main site and I had to type unbabel.com manually to go there. Isn't this something that you guys care about in terms of traffic source ?


The demo is currently restricted to 4 languages: english, portuguese, spanish and french. We do have chinese editors and can support chinese characters in the live product, but are not offering it right now, as we don't have critical mass.

Thanks for the heads up about the blog. Corrected at the end of the post.


Was going to make a product similar to this exclusively for Japanese language -- interested to see how you guys do - I think the premise might be largely flawed though, you do not want more than one person translating a large document. Context is important in a lot of languages, and incoherent writing style (as translation is almost never an idempotent function, there is always more than one way to say something) often seems unprofessional.

(nvm, I think my idea is differentiated enough that I still might make it some day)


Hi, that is good point. It is something we are working hard on. I think that starting with machine translation helps with maintaining coherence in style. In way it is as if the first translator (the machine) translated the whole document. The rest can be tackled with a lot of preprocessing and post-processing. That being said, our method is not suited for really long forms, such as 25 page documents or novels. Anything that requires creative translation would probably need a professional translation dedicated on the project. I would be interested in talking to you about your idea for Japanese. Anything I can do to help let me know.


So, I'm not likely to pursue my own idea (though I registered a domain name a long time ago and have a MVP that's rough around the edges) -- but I don't want to be an idea-hoarder, I'd like to see where you guys go with this, so I want to make suggestions (if you don't mind, I'm really not trying to sound uppity at all)

- I don't think that starting with machine translation to maintain coherence in style is a good idea, while AI is still in it's infancy. Things like sentiment detection, NLP, etc are still too infant in my opinion -- this is baked into the premise of the idea as a whole... We still NEED humans to write good translations - it seems unreasonable to start at the assumption that you will get high-quality output from the imperfect machine process that you are trying to improve (if that makes sense).

I think at best, you will START with bad style, at worst, people will essentially re-translate the chunks to make more sense anyway, and you're left with the hodge-podge.

- If your method WAS suited for medium/long-form, I would suggest adding another tier of worker-bee: the proof-reader. Allow worker bees to apply/become proof readers, and create multiple proofs for large documents. These workers would have qualified for longer-form proofing and possibly editing. A possible increase to the relative pay of the proof-readers (as they are even more closely linked to your revenue and customer satisfaction, and are doing more work to boot), and providing multiple or a combined proof to the customer (up-charge for this) would be a great addition to what you already offer. This will probably do wonders for quality control, and will remove the problem above (I think, to the extent humanly possible). This also gives the people who work with you chance for improvement, chance to build a personal brand, and a chance to take pride in their work (and maybe even build personal/business relationships/trust that benefit the company).

- Why not play in all the vertical space that you guys are in? Part of my version of this service dictates a flat rate for a certain length, and a CLEAR indication that that kind of service is for people with small blurbs to translate. Some companies only need to translate small blurbs (disconnected paragraphs, tag lines, etc), and could benefit immensely and constantly (if you make a brochure for your company, or even an earnings report, etc, you would need this service EVERY month/year, for example). I don't think you would have to make too many structural changes to accommodate such a group of potential customers.

- I have not operated a system like this at scale, so all my suggestions are largely baseless (keep that in mind please)

Oh and my idea was to rid the world of "Engrish", especially at the corporate level.


Thank for your comments, really insightful ideas. A lot of what you say we are already seeing. For example in Turkish, where the quality of the MT output is not as good as let's say Spanish, we are already seeing our translators replacing entire chunks of text. Interestingly, in other language pairs, the output is close enough that there is usually minimal changes in the output.

We have been thinking about having the editor position, we are experimenting with the concept and how it fits with out current workflow.

Anyway, thank you for your ideas, really cool.


glad that some of it made sense, I really liked the site and obviously believe in the idea, interested to see where you guys will go with it!


This is an awesome approach to human crowdsourced translation. The project I was working on back in early 2012 (http://crowdlation.com) was going to use machine translation coupled with human editing and reviews in a very similar fashion. I'm glad that this idea has such positive feedback, and I wish you guys the best of luck! Feel free to reach out to me via the email in my profile if you guys would like to chat.


Thanks, I reached out through linked. I am looking forward to talking to you about your experience in crowdlation.


I just saw a machine-translated text that was given to a translator I know for "post-editing". Basically you have to delete everything that the machine translated and do it all over again.

If you want to offer that for 1 cent per word you are going to get exactly what you are paying for. 90% of qualified translators are bad enough that I would never let them translate anything for me. I cannot imagine the remaining 10% will work for 1 ct/word.


Thank you for your comment. There is a lot of variation depending on the language. In Turkish, that tends to happen, a lot of times the translation needs to be redone, but in EN-SP it is surprisingly good. The crowd aspect of it tends to help with the quality problem. Still a lot of work to do though, but we are off to a promising start.


Since the title is 'Human Corrected .. Translation' I assumed the API includes a way for the end-user to provide feedback on the translation. According to the docs there is a way to provide some instructions, define topics, but all prior to the translation.

Also what if you want to translate multiple snippets of text, and keep them somehow consistent? (for example some .po files from a project, translated one entry at a time).


There is an endpoint to report a translation which we will release soon (it's at the end of the documentation), meanwhile you can report a translation directly to us.

We have an endpoint for bulk translations. We are working on a way to submit XLIFF and PO files directly. Part of the consistency is achieved by the first step of MT. We are working on keeping consistency between editors by propagating their changes.


Just a heads up for OP, looks like there's a mistake in your matrix of support languages:

https://s3.amazonaws.com/unbabel-assets-production/img/chart...

Italian - Portuguese has a "n/a" style dot, but Portuguese - Portuguese translation says "Can take some time".


Thanks! :) I guess that one we could do pretty much instantaneously. Thank you for pointing out.


Very interesting concept! I registered, and will keep an eye out for when/if you guys add languages that I speak. I wonder if it's an error on my end, but my profile says I'm in Arrifana, Portugal. I can only seem to change my country of birth, which for the record isn't Portugal, so I wonder where that is pulled from.


Very cool!

Let's say I wanted to Unbabel something from German to English. How long is 'Can take some time?'

Also, how long does it take to Unbabel something given 'Regular service' conditions?

Last question, how long before 'Unbabel' catches on as a verb?


Thanks Trey. Our goal is to get to 15 minutes of translation time. That would make it usable for email and customer service messages. Right now it depends on when our users are awake and which language pair. Faster time has been a few minutes, average is around an hour.


That's fantastic! Great turnaround.


re the transformation into a verb, it works well and feels natural. I would say not long. Great name.


Agreed.


Have you guys looked at https://github.com/pksunkara/alpaca to generate API client libraries (SDK) instead of spending time on developing them?


I will look at it. Thanks for the info.


Great platform. I hope you add more languages soon. My english is not good enough but I could provide you with 2 east european languages and translate them into german.

Also payment in BC or similar would be great.


Thank you. We are planning to add Bitcoin at some, paying our translators across the world is sometimes hard.


this is an interesting API to have onto Mashape: http://mashape.com


Certainly, we started posting it there, haven't had a chance to finish it, but will certainly do so. Mashape is a great website.


Congrats vasco! Your pitches were very good


Good luck with your project.


I wish you guys the best :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: