I’ve trained tokenizers on medium-sized datasets (+5GB of text, although that co...

I’ve trained tokenizers on medium-sized datasets (+5GB of text, although that could be considered small or large depending on who you ask) and have always found training quite fast. As in, it takes a couple minutes.

Maybe if we’re talking terabytes it might not scale as well but so far in my experience training tokenizers has never been an issue. It’s training models that takes ages.