> We’re paying a cost for each split_whitespace call, which allocates intermediate slices.
This part seems bit confused, I don't think `split_whitespace` does any allocations. I wish there were few intermediary steps here, e.g. going from &str and split_whitespace to &[u8] and split.
The tokenizer at that point is bit clunky, it is not really comparable to split_whitespace. The new tokenizer doesn't actually have any whitespace handling, it just assumes that every token is followed by exactly one whitespace. That alone might explain some of the speedup.
Likely the reason `split_whitespace` is so slow is
> ‘Whitespace’ is defined according to the terms of the Unicode Derived Core Property White_Space.
If they used `split_ascii_whitespace` things would likely be faster.
Switching parsing from `&str` to `&[u8]` can offer other benefits. In their case, they do `&str` comparisons and are switching that to a `u8` comparison. A lot of other parsers are doing `char` comparisons which requires decoding a `&str` to a `char` which can be expensive and is usually not needed because most grammars can be parsed as `&[u8]` just fine.
So, we're just calling split(IsWhitespace).filter(IsNotEmpty) and keeping the resulting iterator.
Rust's iterators are lazy, they only do work when asked for the next item, so their internal state is only what is necessary to keep doing that each time.
IsWhitespace and IsNotEmpty are both predicates which do exactly what you think they do, they're provided in the library because they might not get inlined and if they don't we might as well only implement them exactly once.
Can you help me understand what’s happening between the split and the filter on “a <space> <space> <space> b”? I expect that to be a series of calls to split, each yielding an empty slice. So the whole iterator yields a slice pointing at a, then a slice pointing at b—but it’s had to handle three intermediate slices to get the b. Right?
It creates a Split<'_> iterator using the IsWhitespace function as the pattern. As the user calls .next() on the outer SplitWhitespace<'_>, it calls .next() on the inner Split<'_>, which yields slices "a", "", "", and "b", and the filtered iterator reduces them to "a" and "b".
(But as mentioned, this doesn't perform any allocations, since each slice is just a pointer + length into the original string.)
You're right - split_whitespace returns an iterator that yields string slices (&str) which are just views into the original string without allocation, though the performance difference likely comes from avoiding the iterator indirection and boundary checks.
This part seems bit confused, I don't think `split_whitespace` does any allocations. I wish there were few intermediary steps here, e.g. going from &str and split_whitespace to &[u8] and split.
The tokenizer at that point is bit clunky, it is not really comparable to split_whitespace. The new tokenizer doesn't actually have any whitespace handling, it just assumes that every token is followed by exactly one whitespace. That alone might explain some of the speedup.