The article states that it could be a way to use advances in
computer vision to succeed where RNNs have difficulties. (Kinda sceptical about the success of this technique but that's the claim.)
This is a relatively common trick in ML - mapping temporal data to image form so one can use the relatively efficient convolutional operations on it. An example would be mapping audio data via a short-time fast fourier transform (https://en.wikipedia.org/wiki/Short-time_Fourier_transform) and using 2d convolutions on that for processing.
Architectural improvements in 1D convolutions (atrous convolutions) have possibly surpassed these techniques in some areas though, I'm not sure.