Sounds like a decent size training set.
> A Transformer with area attention obtains 85.56% accuracy for predicting span sequences that completely match the ground truth. The phrase extractor and grounding model together obtain 89.21% partial and 70.59% complete accuracy for matching ground-truth action sequences on the more challenging task of mapping language instructions to executable actions end-to-end.
85.56%, 89.21%, and 70.59% don't seem impressive to me. I may be oversimplifying, but why can't you just fine-tuned a Transformer model to map sentences ("Now tap the right-top side of the screen") to a fixed set of commands ("Tap(MAX_WIDTH, 0)")?
I used transformers before for classification, and for other cases, and they are quite powerful when you have "enough" data; 295K / 178K / 25K seems ok to me, but even if it's not, why not synthesize more.