
Grounding Natural Language Instructions to Mobile UI Actions - theafh
https://ai.googleblog.com/2020/07/grounding-natural-language-instructions.html
======
cch_
> To train the grounding model, we synthetically generate 295K single-step
> commands to UI actions, covering 178K different UI objects across 25K mobile
> UI screens from a public android UI corpus.

Sounds like a decent size training set.

> A Transformer with area attention obtains 85.56% accuracy for predicting
> span sequences that completely match the ground truth. The phrase extractor
> and grounding model together obtain 89.21% partial and 70.59% complete
> accuracy for matching ground-truth action sequences on the more challenging
> task of mapping language instructions to executable actions end-to-end.

85.56%, 89.21%, and 70.59% don't seem impressive to me. I may be
oversimplifying, but why can't you just fine-tuned a Transformer model to map
sentences ("Now tap the right-top side of the screen") to a fixed set of
commands ("Tap(MAX_WIDTH, 0)")?

I used transformers before for classification, and for other cases, and they
are quite powerful when you have "enough" data; 295K / 178K / 25K seems ok to
me, but even if it's not, why not synthesize more.

