> We’re paying a cost for each split_whitespace call, which allocates intermedia...

epage · 2025-07-10T15:17:37 1752160657

Likely the reason `split_whitespace` is so slow is

> ‘Whitespace’ is defined according to the terms of the Unicode Derived Core Property White_Space.

If they used `split_ascii_whitespace` things would likely be faster.

Switching parsing from `&str` to `&[u8]` can offer other benefits. In their case, they do `&str` comparisons and are switching that to a `u8` comparison. A lot of other parsers are doing `char` comparisons which requires decoding a `&str` to a `char` which can be expensive and is usually not needed because most grammars can be parsed as `&[u8]` just fine.

tialaramex · 2025-07-10T13:11:36 1752153096

> I don't think `split_whitespace` does any allocations.

Correct. Here's the implementation of split_whitespace

    pub fn split_whitespace(&self) -> SplitWhitespace<'_> {
        SplitWhitespace { inner: self.split(IsWhitespace).filter(IsNotEmpty) }
    }

So, we're just calling split(IsWhitespace).filter(IsNotEmpty) and keeping the resulting iterator.

Rust's iterators are lazy, they only do work when asked for the next item, so their internal state is only what is necessary to keep doing that each time.

IsWhitespace and IsNotEmpty are both predicates which do exactly what you think they do, they're provided in the library because they might not get inlined and if they don't we might as well only implement them exactly once.

brians · 2025-07-10T14:12:53 1752156773

Can you help me understand what’s happening between the split and the filter on “a <space> <space> <space> b”? I expect that to be a series of calls to split, each yielding an empty slice. So the whole iterator yields a slice pointing at a, then a slice pointing at b—but it’s had to handle three intermediate slices to get the b. Right?

LegionMammal978 · 2025-07-10T14:26:28 1752157588

It creates a Split<'_> iterator using the IsWhitespace function as the pattern. As the user calls .next() on the outer SplitWhitespace<'_>, it calls .next() on the inner Split<'_>, which yields slices "a", "", "", and "b", and the filtered iterator reduces them to "a" and "b".

(But as mentioned, this doesn't perform any allocations, since each slice is just a pointer + length into the original string.)

ethan_smith · 2025-07-10T15:58:45 1752163125

You're right - split_whitespace returns an iterator that yields string slices (&str) which are just views into the original string without allocation, though the performance difference likely comes from avoiding the iterator indirection and boundary checks.