
Rapidstring: Maybe the fastest string library ever - etrevino
https://github.com/boyerjohn/rapidstring
======
johnboyer
Hey there, creator of the library here. Strangely enough my initial post about
this on HN received no attention, but better late than never!

If any of you have some questions or feedback, I would be delighted to hear
it.

~~~
sbjs
Hey man don't worry about that, that's just the way of life. People notice
things when they do, that part's not up to us. I've seen articles here posted
_years_ after they were written.

Very nice and thoroughly documented code! Well done! What gave you the initial
idea to write this maybe-fastest-ever string library? Did you just have an
idea one day of how it could be done and you just went for it? Or did you have
a performance issue and come up with this to speed something up at work, or
what?

~~~
johnboyer
I just really needed a string library written in C, and it didn't seem as
though there were many options. The first thing I found was the Simple Dynamic
Strings library, but it wasn't maintained and depended on GCC extensions, so I
decided to write my own. After getting a basic functional concatenation, I
benchmarked it against std::string to see how slow it was, and to my great
surprise, it was actually faster. From that point, I realized I can actually
write some pretty fast code, and I decided to make the library centered around
that.

~~~
sbjs
I'm surprised to hear it was faster, I just assumed C++ standard library is
optimized for speed since I thought that's what C++ is known for compared to
JavaScript.

------
rurban
I like the stack trick.

But it's still just a membuf library, without any string support. No encoding,
no Unicode, no upper/lower/fc/norm support, which would be important to
compare or find strings. And coreutils (e.g grep) still have no unicode
support. It's 2018, not the seventies anymore. Unicode strings need to be
normalized to be able to be found.

~~~
mamcx
I think instead is better to title this as "maybe the fastest ascii string
library"

~~~
johnboyer
Unicode is supported with UTF-8, only different character types aren't
supported because generics are messy in C. I thought of accepting void* to
support wchar_t and others, but some performance penalties came with it so I
decided against it.

~~~
lifthrasiir
No, having different character types (I believe you are referring to C11's
`char16_t` and `char32_t`?) is not a requirement for Unicode support. At the
very least you need to have a single function or two that...

* Receives a string expected to be encoded in UTF-8, and an offset to it expected to be a UTF-8 sequence boundary.

* Scans forward or backward for the next or previous UTF-8 sequence boundary.

* Optionally returns the code point for the scanned UTF-8 sequence.

* Has proper error handling for every imaginable cases: out of boundary, not a boundary, not a valid UTF-8 sequence. (OOB case needs to be handled because it will be the end condition of the iteration. Preferably should be distinct from other error conditions.)

Every other functionality can build upon this little function, in particular
the iteration and UTF-8 validation will be trivial. The full Unicode support
including case mapping, folding, normalization and property lookup will of
course require a not-so-small table but is not strictly necessary anyway.

Björn Höhrmann's Flexible and Economical UTF-8 Decoder [1] will be handy for a
concise implementation.

[1]
[https://bjoern.hoehrmann.de/utf-8/decoder/dfa/](https://bjoern.hoehrmann.de/utf-8/decoder/dfa/)

~~~
flohofwoe
I don't see any functions in the OP's library that would require dedicated
UTF-8 handling. The string length is given in bytes, not characters or
codepoints. There's no functionality to give you the character at n-th
location etc... you can easily implement all UNICODE-specific functionality in
a separate library and use it together with the OPs library. IMHO that's even
preferable.

~~~
rurban
Yes, but don't call it string library then. Strings should handle strings, and
strings are unicode now. Unicode needs to be normalized and needs case-
insensitive support.

And it's not easy. I implemented the third of its kind. First there was ICU,
which is overly bloated. You don't need 30MB for a simple string libc. Then
there is libunistring which has overly slow iterators, so not usable for
coreutils. And then there's my safelibc, which is small and fast, but only for
wide-chars, not utf-8.

I fixed and updated the musl case-mapping, making it 2x faster, but this is
not in yet. And there's not even a properly spec'ed wcscmp/wcsicmp to find
strings. glibc is an overall mess. I won't touch that. wcsicmp/wcsfc/wcsnorm
are not even in POSIX.

~~~
zlynx
Why try to redefine the word "string?"

In computer jargon I believe CISC and the PDP-11 have seniority. That's why
all multi-word functions like memcpy are in C's string.h header.

~~~
lifthrasiir
Hey, even C contains a locale-dependent string comparison, namely `strcoll`
(since 1990!).

I admit two words "string" and "text" are now interchangable. But that doesn't
make strings have less requirements, people are just expecting more out of
strings.

------
sid-kap
The author's last name gives some credibility that they might be good with
strings.

~~~
meowface
True, but sadly no relation in this case, it appears.

~~~
johnboyer
As it turns out, I actually have the Boyer-Moore string searching algorithm as
one of the planned features in the Project section on GitHub.

~~~
MattPalmer1086
Boyer Moore is OK, but there are better algorithms available. The simpler
Horspool variant is usually a fair bit faster and also easier to implement.
For short strings, e.g. less than 12 character, the ShiftOr algorithm is hard
to beat. I've spent quite a bit of time writing and profiling different search
algorithms as part of the byteseek project on GitHub. Let me know if you'd
like details on other possible options.

~~~
johnboyer
I would most definitely be interested. I always assumed that certain
algorithms are better suited for certain string sizes, but I was never sure
which ones. The ideal implementation is probably combining multiple algorithms
with ranges of the string size.

~~~
MattPalmer1086
Absolutely true that there isn't a single algorithm that beats all the others.
Of the general string search algorithms which don't use specialized CPU
instructions to obtain speedups, I would recommend:

1\. ShiftOr for short strings. Easy to implement.

This algorithm is not sub-linear like Boyer Moore - it examines every
position, but it uses bit-parallelism to validate a match, making it
outperform algorithms that rely on jumping then separate verification stages,
for short strings.

2\. Variants of Wu-Manber for longer strings. Hard to find a good description,
but not too hard to implement.

Wu-Manber is a search algorithm designed for multi-pattern searching, based on
Boyer-Moore-Horspool and hashes. However, it also performs really well for
single-pattern searching. I have encountered variants of this in various
places, e.g. Lecroq's in "Fast Exact String Matching Algorithms".

These algorithms use a hash of a q-gram to look up how far it's safe to shift.
Q-grams tend to appear less frequently in search patterns vs. single
characters, so you get longer jumps, at the cost of reading more characters
and producing a hash.

3\. Horspool (or Boyer-Moore-Horspool).

This algorithm performs quite well - not as well as ShiftOr for shorter
patterns or Wu-Manber variants for longer ones, but still respectable. It's
essentially Boyer-Moore but only using one of the shift tables, which makes it
much easier to implement.

4\. Qgram-Filtering by Branislav Durian, Hannu Peltola, Leena Salmela and
Jorma Tarhio

For longer patterns, this algorithm outperforms the others mostly. However, it
can be a bit complicated to implement well, and it has some nasty worst-cases
(rarely encountered) where the performance becomes absolutely dreadful. For
that reason I tend not to use it.

There are hundreds of possible search algorithms available now (see
[https://github.com/smart-tool/smart](https://github.com/smart-tool/smart) for
implementations of many of them with a benchmark tool). However, it's hard to
figure out exactly which algorithm is the best given your data and pattern.
For that reason, I tend to keep things simple.

I would just use ShiftOr for short patterns, and another one for longer
patterns. I would tend to use a Wu-Manber variant there, but Horspool would
probably give acceptable performance.

The only other consideration is the time it takes to build the pattern
indexes. For short searches, or if you have to re-build the pattern index on
each repeated search, it can actually be quicker to just do a naive "check the
string at each position" search, since it doesn't require building anything
before searching.

~~~
burntsushi
In my experience, it is difficult to beat a very simple heuristic in most
settings: when provided a string, pick the most infrequently occurring byte in
that string, and use that as your skip loop in a vector optimized routine such
as memchr. If you guess the right byte, you'll spend most of your time in a
highly optimized routine that makes the most out of your CPU.

Picking the right byte can be tricky, but a static common sense ranking
actually works quite well. At least, the users of ripgrep seem to think so.
:-)

For some reason, I've never seen this algorithm described in the literature.
It doesn't have nice theoretical properties, but it's quite practical.

------
jaimex2
So... whats the magic sauce?

------
olliej
I’d be more interested in “this is faster than X” claims if it does a fair
comparison - pushing the implementation out of the header. Otherwise
(depending on operation) inlining ends up significantly throwing off
performance numbers.

That said it’s much easier to make faster than libN string libraries if you
don’t have abi constraints to deal with.

This uses any value struct to hold much of its metadata which causes all sorts
of abi issues - basically, if your own app uses this it can’t expose it to
plugins or anything, otherwise any update could break existing compiled
plugins.

~~~
johnboyer
Most of the STL is header only, which is what the library is compared to in
the benchmarks. Both would be just as likely to get inlined.

------
bnolsen
looks loke the api would really shine if coded with c++ constructors and
destructors. maybe rvalue ref would help too.

------
JeromeBonnet
Hi, John! Why did you chose to store 'size' and 'capacity' in the 'rs_heap'
struct instead of at the beginning of the heap-allocated buffer pointed to by
'buffer'? Did you find it was preferable to have a larger 'rs_heap'?

~~~
masklinn
> Why did you chose to store 'size' and 'capacity' in the 'rs_heap' struct
> instead of at the beginning of the heap-allocated buffer pointed to by
> 'buffer'?

Why would you store the size and capacity as part of the buffer? It makes the
buffer more complex (you need a separate unsized struct or some mess special-
casing the first 16 bytes or so), wastes heap size, requires dereferencing
before you can even check on size & capacity, and makes SSO harder/more
limited.

AFAIK both C++'s std::string and Rust's String store size and capacity on the
stack, separate from the buffer.

~~~
JeromeBonnet
> Why would you store the size and capacity as part of the buffer?

To decrease the memory footprint of your small strings. It can see how some
uses cases where a large proportion of strings are very small would benefit
from it.

I was just curious to know if that design decision was based on analysis of a
real use case, or if it was due to a compatibility issue or something.

~~~
masklinn
> To decrease the memory footprint of your small strings. It can see how some
> uses cases where a large proportion of strings are very small would benefit
> from it.

If you do that you have to spill your SSO to the heap way earlier, rapidstring
would have 15 bytes SSO instead of the current 31, and std::string would be
limited to 7 compared to the current 23~31.

~~~
JeromeBonnet
It is a trade off. I see the length limit is 23 for SSO-23, but it could have
been arbitrarily larger. What is the basis for the determination of those
magic length limit targets?

~~~
masklinn
The size of the pre-existing non-SSO string. SSO-23 is predicated on a stack
layout of (pointer, capacity, size) (as in Clang). That's why SSO-23 applied
to the pre-existing MSVC or GCC std::string yields 31 on-stack bytes, they
have a stack size of 32 bytes.

------
sevensor
Is performance really the first thing we should worry about when it comes to
string libraries for C? Given the dismal history of character arrays in C, I'd
expect safety to be the first thing on the mind of any library implementer,
and the first thing mentioned in the README. Also, the second, third, and
fourth things.

~~~
tonto
In what sense would safety be such a high priority for a high performance
string library? I found this blog post was good at illustrating why
undefined...similar to unsafe...operations exist
[https://nullprogram.com/blog/2018/07/20/](https://nullprogram.com/blog/2018/07/20/)

------
cozzyd
I wish more containers had stack-resident small-size optimization.

------
jazoom
>rapidstring is maybe the fastest string library ever written in ANSI C

