Why is that an issue? Training the tokenizer seems much more straightforward than training the model as it is based on the statistics of the input data. I guess it may take a while for massive datasets, but is calculating the frequencies impossible to be done on a bigger scale?
I’ve trained tokenizers on medium-sized datasets (+5GB of text, although that could be considered small or large depending on who you ask) and have always found training quite fast. As in, it takes a couple minutes.
Maybe if we’re talking terabytes it might not scale as well but so far in my experience training tokenizers has never been an issue. It’s training models that takes ages.