> a "universal" version of this -- if you ran it across books and e-mails and text messages from thousands of authors covering diverse backgrounds and contexts, then what would most reliably help everyone
No, this is simple statistics. Just list all the words in your corpus in order by how frequently they're used, then use a dedicated software solution like the one given here to make shorthand from it and expand as you type. Machine learning is massive overkill for this, like nuking a fly-- it's a massive waste of resources (time, energy, energy over time, every metric I can think of), and you're going to have targeting issues at that scale (how exactly do you propose to ensure a LLM stays on target and doesn't duplicate answers, make up words, skip words, etc? Are you sure your LLM actually knows what the correct distribution is? If you know it does because you checked, doesn't that mean you already have a word list you can use?)
I was thinking the exact same thing. That sounds like a small ByT5 LLM model. Upon space you auto-complete the past word based on a moving context window of the past 32 characters or so.
Isn't this approaching llm territory?