[1] is a good start, although if you want to train from scratch on CPU, you'll have to downscale as transformers need quite a bit of data before they learn to use position embeddings. For example try a single-layer RNN on Shakespeare texts [2] or a list of movie titles from IMDB [3]. You'll have to fill in the blanks because things have evolved quite a bit since those were used for language models, but you can find some tutorials [4] and examples [5].