Translation quality is not great. A sentence like "maak de verlichting een beetje meer warm hier" is absolutely not idiomatic Dutch, and "wat is de tijdt in de andere tijdzones" contains an easily caught spelling error. "laat me mijn wekkers zien" is quite ridiculous; it means: "show me my (alarm) clocks". "ontvang me updates van bram´s facebook van het weekend" is ungrammatical. Well, not great.
Jack here from Alexa. Thanks for the feedback. Quality control was nontrivial, to put it succinctly, but we certainly always want to be better. I'll have to check if the issues you've noticed were detected in the judgement scores. For the first utterance you mentioned, all three raters put a score of 1 for spelling_score, which means "There are 1-2 spelling errors." So that's good.
Though we re-collected some utterances with low scores, we didn't have the budget to get perfect scores for all utterances. As such, we decided to include all utterances along with the scores from the 3 raters, such that users can perform filtering as they'd like. Some may want to keep the noise intact to help with training.
"Laat me mijn wekkers zien" is indeed a bit odd, but it makes sense if you think of it in the context of a voice assistant that displays information on a screen.
Ah, that's unfortunate to hear. In this case I see grammar_score ranging from 2 (Some errors (the meaning can be understood but it doesn't sound natural in your language)) to 4 (Perfect)--looks like one of the raters was too generous with a rating of 4.
I think this feedback is helping me to realize that we should be more explicit about our philosophy of keeping some of the noise in the dataset and allowing people to filter based on judgments.
That said, people do say some strange and ungrammatical things to virtual assistants (not saying this example is representative per se), so it's nice to include some of the odd ones.
I guess what happened is a user of your system tried to be funny. In this utterance there are 3 short fully grammatical sentences. And a 4th one which is not very grammatical but fully understandable, commanding the device to clean the carpet :)
- Cleaning is good.
- Dust is so bad.
- Make your miracle.
- Cleanly my carpet. [1]
[1] original is an ungrammatical imperative
Some (more) human review would have been needed.
The Finnish samples are full of very weird utterances, too. Some like people might have written over 50 years ago, but nobody would speak like this.
Maybe they were reviewed on Mechanical turk with moderate requirements / payment? Well, you get what you pay for...
Absolutely. I thought the same thing. If virtual assistants [1].were widely used and they don't fully understand all natural language the user might want to use, users will adapt and start to use some restricted language.
[1] Personally I think it's a useless concept and the examples in the dataset mostly confirmed my view. Come on, a time zone conversion. Do we want to get so stupid that we cannot calculate it ourselves anymore?.Adjusting a light? Most of us sit too much and standing up is good for your body. Such technology is not only useless, but just bad. Not to mention the surveillance capitalism aspect which is proably also there or even prevailing .
I would probably ask the assistant “Toon mijn wekkers”. Never actually used one in Dutch though and any “command” sounds odd to me in Dutch even though it’s my native language
Luckily, this matters a lot less than one might think. A good AI architecture can tolerate a few % of noise in the data. For example, A LOT of people in the Oscar dataset spelled the German "Metzger" as "Metzker" but AIs trained on it still perform great and strongly prefer the correct spelling.
Hi everyone, Jack FitzGerald here from the Alexa team who created this new dataset, the corresponding code, the leaderboard, etc. I'd be happy to answer any questions you might have, and I'll jump in with the other comments already here. Thank you.
Jack, this project and competition remind me of the Netflix Prize so much! Can’t wait till we see what people can do with this dataset. Congrats on the launch!
The sentences in Hebrew are hilarious. Either fixated on Omer Adam[1] or very formal language that I don't think people would actually use. Or formal sentences asking about Omer Adam's greatest hits.
The paper [1] has an interesting discussion of european famliar/formal pronous:
Many languages encode different levels of polite-
ness through their use of pronouns. Many European
languages distinguish between “familiar” and “for-
mal” pronouns, with the “formal” pronouns often
morphologically identical to a plural. In French,
the second-person singular “tu” is used between
friends, while the second-person plural “vous” is
used when speaking to a group, or to an individual
of higher social rank (such as an employee to a
manager). These politeness systems are very heav-
ily influenced by social context, and the MASSIVE
dataset gives us a chance to see how people adapt
their language when speaking to a virtual assistant
instead of another human.
Historically those languages have included English too. "you" and "ye" being the formal/plural form (and indeed "you" can still be used to refer to either an individual or a group) with "thou" and "thee" being the now mostly (although not entirely) disused informal version.
> while the second-person plural “vous” is used when speaking to a group, or to an individual of higher social rank (such as an employee to a manager).
From my experience "vous" is not always about social rank, but often about "distance". If you go to a supermarket, you'll use "vous" with the employees and they'll use it with you. Most of the time when I use "vous" the other person will also use it. "vous"/"tu" cases are more rare.
Are the sentences in the dataset supposed to be correct? (specifically, the `.utt` key)
Seeing some strange results in multiple languages, that don't look correct. Even some English ones look incorrect (but not a native English speaker, so maybe I'm wrong). Here are some examples:
- "i want to play that music one again"
- "what's that the album is current music from"
- "which alarms do i have"
Maybe I'm misunderstanding the purpose of the dataset...
Experience (both from work on either Alexa or a similar system in previous years, and as an actual user) tells me that yes, people do often say things that don't "look correct". From personal experience I know that something about the pressure of knowing that the device will detect the end of your utterance before you're ready, if you're silent for too long, can end up rushing you and scrambling your brain a little. Even if you ordinarily have near-perfect instincts for English grammar and sentence construction, you can end up saying some pretty weird things.
But thinking of them as user mistakes, and framing them as "correct" or "incorrect", is counterproductive here. It may be a useful distinction as a learner, when you're trying to improve your own command of the language. But when they're inputs to your system, you have to do your best to do what the user meant, if that's possible to infer. All three of your examples are actually clear, if nonstandard (play music again; tell me what album the currently playing music is from; list my alarms). So a system that takes these as input should handle them as valid requests, not classify them as "incorrect". It might be different for a system designed to teach 'proper' English, rather than a system designed to enact the user's will.
These seem like it probably things people said to Alexa. There is probably a combination of speech recognition errors and actual ungrammatical speech happening. Spoken word is often ungrammatical.
It’s kinda wild to think about how incomplete written language is compared to spoken language. Punctuation does a lot of heavy lifting. If I write “I think well the I mean what album is this the song thats playing from” it takes some work to parse, but that’s what an ASR system has to do. A more human transcription might be “I think-- well the-- I mean, what album is this (the song that’s playing) from?”
On top of those, these are translated ~nonsense~ text that aren’t suited for translation. Most likely a list of English texts are handed to translators without intended use cases or extra contexts well communicated to them. Some of texts are only correct in literal sense only or acceptably only in translated literature only, and still passing reviews.
Looks like a sentence merged from separate ones. In that context, you could derive multiple intentions. So while not grammatically correct, it could still be useful.
- "i want to play that music one again"
I want to play that music again.
I want to play that one again.
I want that music again.
Play that music again.
Play that one again.
They are supposed to mimic what an intelligent voice assistant might encounter, so not always grammatical, but both the original dataset and the localizations were crowdsourced. Despite efforts at quality control some errors might persist, but as mentioned by another commenter shortened or cutoff phrases or re-phrasings are common.
- unanimous grammar score: 4, spelling score: 2:
3046: "今どうしても中華料理が食べたいのでテイクアウトで注文させてください": "I really want to eat Chinese right now, is it okay if I order some delivered(asking the boss nicely)" , (4,2),(4,2),(4,2)
13367: "単語を定義するとき文で明確にします": "[this will be]clarified in sentence when defining the word", (4,2),(4,2),(4,2)
13258: "ビーエムダブリューのマイレージ": "Mileage [reward program] for BMW", (4,2),(4,2),(4,2)
13986: "トップモデルの車": "Top-style car" or "car of top [beauty] model", (4,2),(4,2),(4,2)
14556: "この会社の株価を更新してください": "Please update stock price for this company", (4,2),(4,2),(4,2)
14592: "スマホのサーキットは何なのか教えて": "Tell me what is the smartphone raceway", (4,2),(4,2),(4,2)
15414: "マネージャーが必要だ": "There will have to be a manager", (4,2),(4,2),(4,2)
- 2/3 grammar score: 4, spelling score: 2:
6235: "素晴らしい映画を見たのでコピーがリリースされたら予約してください": "Because I watched a magnificent movie, please reserve in case a knockoff is let go", (4,2),(4,2),(3,2)
14401: "あなたが私に探した最後のことを教えてください": "Tell me the last thing you have searched of me", (2,2),(4,2),(4,2)
14407: "感じることができるの": "Are you able to sense that", (4,2),(2,2),(4,2)
- less than above
14744: "ブロイラーは何ですか、どのように使えばいいですか": "Broiler [meat] is what, how may I use it(for you/for community)", (4,2),(4,1),(3,2)
15456: "ユナイテッド航空にあなたが私の荷物を無くしたのに怒るとツイートをしたいけど": "Though I want to tweet to [Mr.]United Airlines that you become enraged despite you losing my luggage", (4,2),(4,2),(2,2)
I've since given a bit more thoughts on this - Part B question 3 and 4, from Fig.3 on the paper, asks only whether the sentence sound natural and (technically) correct, without specifying context, that the sentence is supposed to be an utterance spoken to a voice assistant. From my own comment above:
3046: "I really want to eat Chinese right now, is it okay if I order some delivered"
14556: "Please update stock price for this company"
15414: "There will have to be a manager"
These are all context errors. What is not widely acknowledged(as I'd like...) is that translations for short sentences like "I need to see the manager" are situational, beyond depending on tones and politeness. Consider following:
- I have complaints as a customer, and *I need to see the manager*.
- I need to escalate this issue, and *I need to see the manager*.
- You asked what I am about to do, and *I need to see the manager*.
- "The manager" is a dashboard app, and *I need to see the manager*.
I won't list my translations because I've already dumped way too much CJK characters here, but each of above four should turn up differently against each others from organic translators(especially the last one).
Yeah, definitely, but the goal (transfer learning across languages) is super good, and this seems like a good dataset for that purpose (though I suspect it's entirely insufficient to learn an NLP model from scratch).
The paper provides baseline results on how some pre-trained open source models will perform, when fine-tuned on this dataset. https://arxiv.org/abs/2204.08582
There is also Apertium, a rule-based system which is very good for some closely-related pairs that have had a lot of work put into them (especially translation between Romance languages, e.g. Spanish→Catalan, and Norwegian Bokmål→Nynorsk), and the only OK translator for some lesser-resourced languages (e.g. Northern Saami→Norwegian Bokmål), but very underdeveloped for anything to/from English (it feels a bit pointless writing rules for English where there is so much available data; RBMT shines where there's not enough available data, ie. most of the languages of the world). It's packaged for most distros: https://packages.debian.org/search?suite=sid&arch=any&search...
Argos Translate fits with those requirements, but its translation quality is not great. I'd love to see something like DeepL (that has really good translation quality) running locally, but I think that's a pipe-dream for now.
Sorry, we didn't have enough space to fit it on the blog post. You can find the list of languages on page 4 of our paper: https://arxiv.org/pdf/2204.08582.pdf
Just today I was looking at how to use Alexa to help my son with his school spanish. I am not convinced it is perfect but it looks like being part of the puzzle.