
How to implement strings - qznc
http://beza1e1.tuxen.de/strings.html
======
rwmj
Another idea is the OCaml string representation which is a sort of cross
between C and Pascal. I wrote about it in [1]. It allows O(1) length, can
store \0 in the string, while being backwards compatible with existing
C-string-consuming functions.

It's applicable to C because most realistic implementations of malloc(3) store
the size of the malloc'd area to the nearest word[2] and so the Pascal size
part does not need to be stored explicitly. (Note this would require minor
changes to the C runtime to allow malloc to make explicit the currently
implicit size).

[1] [https://rwmj.wordpress.com/2009/08/05/ocaml-internals-
part-2...](https://rwmj.wordpress.com/2009/08/05/ocaml-internals-
part-2-strings-and-other-types/)

[2] [https://rwmj.wordpress.com/2016/01/08/half-baked-ideas-c-
str...](https://rwmj.wordpress.com/2016/01/08/half-baked-ideas-c-strings-with-
implicit-length-field/)

~~~
Asooka
You can get the size of a malloc'd block with malloc_usable_size. Can't say if
it's actually O(1) though.

~~~
jws
You had my hopes up for a minute there, until I checked portability…

malloc_usable_size() is a GNU extension, so usable on Linux. MacOS has
malloc_size().

One web page [1] says the GNU one is about 10 cpu cycles and the MacOS one is
about 50 cpu cycles, so not the end of the world, but maybe a little long for
string length checking.

I didn’t check any BSDs and I have to wonder what the effect of changing
allocators in Linux has. Certainly they could have a performance impact and
possible a ‘does not implement’ problem.

[1] [https://lemire.me/blog/2017/09/15/how-fast-are-
malloc_size-a...](https://lemire.me/blog/2017/09/15/how-fast-are-malloc_size-
and-malloc_usable_size-in-c/)

------
cakoose
> The character \0 marks the end so we call this zero-termination. [...] The
> only advantage of this representation is space efficency.

Sort of a tangent, but zero termination (aka null termination) can make
parsing from a string slightly simpler.

Without a null terminator, in addition to processing each character, you need
to check if you've hit the end of the string. With a null terminator, the end-
of-input checking is elegantly part of the "processing each character" step.

You can always add a null terminator character to the end of your string in
the length+data representations as well, but if your strings don't come that
way you end up having to make a full copy.

~~~
tyingq
Sds is somewhat nice in that you can work both ways.
[https://github.com/antirez/sds](https://github.com/antirez/sds)

~~~
cakoose
That's similar to C++11 std::string, which has a length indicator and a null
terminator.

(I believe this was often true for std::string implementations before C++11,
but apparently C++11 made it a requirement.)

~~~
romed
C++11 mandatory null termination has the advantage that a call to data() or
c_str() need not store the null byte. On the other hand, mandatory termination
costs one additional store in some of the constructors, and burns one byte of
the capacity for inline data.

~~~
kazinator
How does it burn one byte of capacity for inline data? Are you saying there is
a zero-byte way to encode the length that could instead be used, avoiding that
one byte overhead?

~~~
romed
What I meant was the standard requires the null byte at the end of the string,
which reduces by one the possible length of the string when stored inline. If
the null byte wasn't mandatory, it would be possible to store slightly longer
strings inline, but then the implementation would have to conditionally
materialize the contents to the heap in case of a call to c_str or data.

~~~
saagarjha
Only since C++11. In C++03 no guarantees were given on the time complexity on
c_str, and std::strings could lay their contents our however they wished (and
cause a heap allocation in the process of materializing a C string, if
necessary).

------
mattnewport
Unicode is indeed insanely complex. There is almost no query or transform of a
Unicode string you can do beyond asking its length in bytes that is at all
straightforward. I suspect that very few pieces of software that 'support'
Unicode and do anything non trivial with text actually do so fully correctly.
It would be nice if there was a well defined 'simple' subset that handled the
80% case that could be a reasonable target for the average app to support
fully.

~~~
bradleyjg
> It would be nice if there was a well defined 'simple' subset that handled
> the 80% case that could be a reasonable target for the average app to
> support fully.

Isn't that ASCII?

~~~
mattnewport
Well it ends up that if you don't want to go down the Unicode rabbit hole too
far then yeah, your best bet is probably to stick with ASCII. As an example
from my industry though, it would be nice if I could implement user name entry
and display for a high score table in a simple game and support names in
common European languages without needing to handle all the edge cases of e.g.
mixed left to right and right to left text, combining characters, surrogates,
etc.

I'm far from a Unicode or a languages expert but I'm familiar with one
language with right to left non Latin characters and aware of just enough
Unicode madness to know I don't know enough to handle many edge cases
properly. It would be nice if a regular developer like me could support
something more than plain ASCII but less than the full insanity of Unicode to
accommodate at least some non English users.

~~~
saagarjha
You're basically throwing people with "inconvenient" character sets (i.e.
everything that doesn't use something strongly resembling Latin characters)
under the bus. Sure, you might be able to support Spanish, French, and German,
but you're basically disregarding Japanese, Chinese, and Hindi when doing so
(and possibly even ASCII, since you'd trade some symbols for accents).

~~~
mattnewport
That's not at all what I'm advocating. My point is that the extreme difficulty
of fully supporting Unicode with all of its complex edge cases (like my
examples of mixing left to right and right to left languages) means that the
two most common outcomes are throwing up your hands and giving up supporting
anything but English and just using ASCII or having broken, partial Unicode
support.

I'm suggesting it would be nice to have another option where you could provide
some level of support for non English languages with something you have some
hope of implementing correctly. Applications that correctly handle editing of
mixed left to right and right to left text are rare for example but you could
support Farsi speakers reasonably well in many applications without handling
that scenario.

------
Someone
_”However, none of them are suitable for general strings where we might have
many small ones”_

That’s where I expected to see the Short String Optimization mentioned.

For short strings, it stores the string’s characters in the same memory where,
for longer strings, it stores a pointer to the characters (with ‘short’
dependent on the implementation)

It also tweaks how either get stored to make it possible (and fast) to
determine what is stored in that space. See [https://shaharmike.com/cpp/std-
string/](https://shaharmike.com/cpp/std-string/)

~~~
steveklabnik
We decided against SSO for Rust, incidentally. I’m on my phone and it’s late
so I won’t elaborate too much but it’s not always clear it’s a win, depending
on language semantics.

~~~
stochastic_monk
I imagine I’d be more concerned with performance and locality than memory
savings if every string required a heap allocation and an additional
dereferencing for access.

However, rust is very well-engineered, and I’d be interested in knowing what
factors made its decision a win.

~~~
kibwen
IIRC, there are a variety of factors:

1\. Conventions and prevailing idioms. In C++ copying strings is common, both
unintentionally due to the implicit copy constructor and intentionally as a
means of defensive programming. In Rust, copying strings is relatively
uncommon: Rust has no copy constructors (copying must be done explicitly with
the `.clone()` method), and the borrow checker obviates the need for defensive
copies and therefore passing string slices is overwhelmingly preferred (and
recall that string slices in Rust (`&str`) are 16 bytes, in comparison to 24
bytes for C++'s `std::string`).

2\. Alternative optimizations. For example, in Rust, initializing a `Vec` (and
by extension `String`) does not perform a heap allocation for zero-length
values. IOW, you can call `String::new()` (or even `String::from("")`) in a
hot loop without worrying about incurring allocator traffic. Any hypothesis
that short strings are more common than long strings will likely also
hypothesize that the most common short string length is zero (and this appears
to be borne out in practice; this optimization is _really_ important for
`Vec`, so important that, for Servo, it often makes `Vec` faster in practice
than their own custom `SmallVec` type which strictly exists only on the
stack), so this optimization goes a fair ways towards satisfying the "I
actually do have many short strings and I'm not just copying around the same
string a lot" use case.

3\. Weighing trade-offs. The pros of SSO are better locality and less memory
usage, with the cons of larger code size and additional branches on various
operations.

Putting it all together, this gives the Rust devs reasonable incentive to take
the conservative path of not implementing SSO for the default string type.
That's not to say that we'd be incapable of finding Rust code that would
benefit from SSO, or that C++ chose their default incorrectly (different
contexts allow different conclusions), or that the Rust devs will always have
this stance (if performance benefits of SSO were to be clearly demonstrated on
Rust code in the wild in such a definitive way as to justify changing the
default behavior, then I don't think they couldn't find a way to make it
happen).

~~~
dwaite
String implementations are low enough level that they get influenced quite a
bit by memory management behavior, sometimes in surprising ways.

G++ used to have CoW strings, but I believe they dropped them because for
common cases the atomic reference counting was more expensive than potential
copying.

Inlining strings save space and allocator traffic at the expense of a more
complex string implementation and (depending on the platform/implementation)
strings that do not start word aligned.

There are other flags you might want to put on an underlying text sequence:

\- that the text is exclusively ASCII in UTF-8 strings. This means the
character count and indexing can be optimized to constant-time operations

\- latin-1 in UTF-16 strings. This means you can do an alternate 8-bit
encoding for space savings, as well as constant-time count/indexing.

\- mark that the string represent a single grapheme cluster (aka a printable
character). This may allow you to combine unicode text processing to a single
data structure.

\- mark that the representation is a slice. In this case, you'd typically
store an offset rather than a capacity. This again helps you combine higher
level types on top of a single data structure and text processing libraries.

These features (and especially combinations of them) impact the hot path for
text processing. If your language is already doing custom allocation (such as
using a copy constructor), the allocator savings are nil, so it becomes a
space vs code trade-off.

~~~
eridius
FWIW the C++ standard actually prohibits CoW strings, and any C++ standard
library that implements CoW strings is doing so in defiance of the standard.
AIUI this prohibition was completely accidental, and I believe it's due to the
rules around which string operations do and do not invalidate iterators.

~~~
stochastic_monk
GCC updated their implementation to become compliant, probably for these
reasons.

------
sharpercoder
I personally think we should make a plentyfold of datastructures respresenting
strings. Sure, I want a geenric fits-all structure to represent text as
provided in many libraries and frameworks, such as .NET's String class.
However, I have many situations where using that same String class is pure
overkill and prone to security bugs. For instance, I want a DigitsOnlyString,
representing only the numbers [0-9]. But I also want a NumbersOnlyString,
representing the digits [0-9] but also all other number characters there are
in Unicode. The same for e.g. emoji's, latin characters ([azAZ]), extended
latin ([azAZ], plus a subset of diacritics) Chinese characters, and I can go
on for quite some more examples (Password anyone? Don't allow a lot of Unicode
edgecases). These classes will have the benefit of really efficient encoding
in bits, being more secure, more clarity to programmers and most important of
all clear reasoning. I bet 80% of String uses can be moved to a more
constrained type.

~~~
honestlyidk
All those are is a set of verification functions tied to strings ... we really
dont need native data types representing this. Im not sure why you are not
able to find a lib/function that easily verifies such things. Programing
languges are the basic foundation which should be concerned with performance ,
organization and making sure freedom is given to the programmer. These are all
specific edge cases for the programmer to define not for the programing
language to worry about.

Would you rather they speed up the compiler/interperter or add edge case
string classes? Its not like there isnt a trade off. This is a bottom of the
barrel concern.

~~~
SquishyPanda23
> Programing languges are the basic foundation which should be concerned with
> performance , organization and making sure freedom is given to the
> programmer.

This is one view, but there are others. For example, instead of focusing on
performance and freedom, the programming language could focus on safety or
formal properties.

E.g. there should never be code that compiles but is unsafe, or which can have
certain classes of common bugs.

------
kazinator
> The only advantage of this representation is space efficency.

Ignorant nonsense. There are other advantages.

Advantage: If _s_ is a non-empty string, then s + 1 is the rest of that
string. s + 1 is an extremely cheap operation: incrementing a pointer.

This allows recursion on strings. For instance, strchr can be coded like this:

    
    
       char *strchr(const char *s, int ch)
       {
          if (*s == ch) /* including ch == 0 case */
            return s;
          if (*s)
            return strchr(s + 1, ch); /* tail call */
          return NULL;
       }
    

A variety of useful algorithms can be implemented in this simultaneously
elegant and efficient manner.

Advantage: null terminated strings can use exactly the same external
representation as internal. They can be sent over the network as-is or written
to files. There is no question of what format is this header, how wide is this
length field and so on. Strings of wide characters just have a simple issue:
how wide (16 or 32), and what endian.

~~~
saagarjha
Of course, most _actual_ strchr implementations tend to be heavily vectorized
hand-tuned assembly with macros to select the correct streaming extension
supported by the hardware. Or, for less powerful systems, something like

    
    
      char *strchr(const char *s, int ch) {
      	char *p = s;
      	while (*p && *p != chr);
      	return p;
      }

~~~
kazinator

      s/less powerful systems/systems with lesser compilers/
    

With TCO, I'd expect the same code from the recursive version.

The recursive version has the benefit of being pure; it doesn't mutate
anything. Speaking of which, HN ate your ++.

------
_kst_
The first sentence:

"The C programming language defines a string as char* ."

No, it doesn't. A string is by definition "a contiguous sequence of characters
terminated by and including the first null character". A char* value may or
may not be a pointer to a string.

(Confusing arrays and pointers is perhaps the most common mistake people make
when talking about C.)

~~~
CodeArtisan
A string in C is an implicit type; it's a specialization[1] of an array of
characters. Another example of implicit type are Lists in Lisp (specialization
of nested pairs).

[1]
[https://en.wikipedia.org/wiki/Specialization_(logic)](https://en.wikipedia.org/wiki/Specialization_\(logic\))

~~~
kazinator
A string in C is entirely built upon convention; it exists in the words of a
document only.

Common Lisp has a _list_ class that you can specialize a CLOS method on.

Also:

    
    
      [1]> (typep nil 'list)
      T
      [2]> (typep '(a) 'list)
      T
      [3]> (typep 3 'list)
      NIL
    

And:

    
    
      [4]> (subtypep 'cons 'list)
      T ;
      T
      [5]> (subtypep 'null 'list)
      T ;
      T
    

So, kind of apples and oranges.

~~~
CodeArtisan
I was referring to John McCarthy's definition of a list. IIRC, Common lisp's
list type includes both proper and unproper lists. (typep (cons 1 2) 'list)
evaluates to true but (length (cons 1 2)) throws a type error (not a list).

------
perlgeek
> There are two relevant lengths of Unicode strings: memory size in bytes and
> number of grapheme clusters.

There is also the number of codepoints, which is relevant when you implement
Unicode-related algorithms.

And then there is display length (which counts full-width characters as two
characters, even in monospace fonts, and zero-width characters as zero).

Finally, if you think grapheme clusters are the right way to count characters
in a string (and it often is), then you should also allow string indexing on
grapheme cluster level.

------
aasasd
I lowkey wonder how much memory is consumed, on average, by the extra string
info in map-heavy languages like JS and Python, and how much CPU time is spent
on strings there. While C doesn't need too many strings and mostly throws
around numbers, a Node.js app may easily be dealing with thousands of them,
not even counting HTML and templates.

(And now I also wonder if any compilers and VMs precompute hashes for map and
object keys.)

~~~
jmaa
Slightly related: Lua will intern every string ever created, that is two
strings are equal if and only if their pointers (as represented internally)
are identical. This results in slow string creation, but very fast equality
checking and lookups, as it converts the problem to comparing integers.

------
denormalfloat
Out of all the string types I have used, I think Go has the best of them.
Since it's a GC'd language, you don't have to pay the refcount atomic
overhead. The representation is two only words long. Strings are immutable,
making them suitable map keys. Strings don't carry any encoding information;
they are plain byte-arrays.

------
ademarre
This looks like good information about the memory representation, but we would
be better off without string as a data type at all, at least not as a concrete
type.

Strings as they commonly work are leaky abstractions which for decades have
bred confusion among programmers over how to handle text properly.

 _Text /character encoding bugs are a matter of type safety._ I believe most
of the text encoding bugs in the history of software could have been avoided
if we used our type systems to handle text encodings explicitly.

UTF-8, UTF-16, UTF-32, Windows-1252, ISO-8859-1, US-ASCII, et al.; these
should be our essential concrete data types. The concept of a string should
just be an abstraction over them for polymorphism.

At no point in the lifecycle of a piece of textual data should there ever be
any question over how it is encoded. So why don't we let our type systems do
this bookkeeping for us?

~~~
svat
I have written some programs dealing with Indic scripts, and when working with
strings I prefer to see them only as a sequence of Unicode codepoints; it's a
clean abstraction compared to the encodings which are messy (and half of them
don't even support the scripts I care about).

While text is in memory inside your programming-language runtime, you don't
care about the encoding; for all I know it may not even be one of those
encodings. You only have to deal with encodings when dealing with the
"external world": when you fetch (or send) bytes from (or to) disk / network /
the user. These are anyway out of the scope of the type system, so the type
system wouldn't help you: the encoding needs to be communicated to you out-of-
band.

(See
[https://nedbatchelder.com/text/unipain.html](https://nedbatchelder.com/text/unipain.html)
for the “Unicode sandwich” approach: convert to bytes at the boundaries of
your system.) So using a type system to specify encodings isn't really
helpful; what a type system _can_ help you with is clearly distinguishing
between bytes and text (separate types), and forcing you to specify an
encoding when converting between them.

~~~
ademarre
> _I prefer to see them only as a sequence of Unicode codepoints_

You can still have this with encodings as data types.

Encoding is definitely a concern at the boundary, but to say that it is a
concern _only_ at the boundary is too simplistic. What if you want to save the
cost of converting encodings from that which was input at the boundary? And at
some point data will have to leave your application, and this often requires
munging text in a number of encodings.

> _So using a type system to specify encodings isn 't really helpful_

Imagine an error at compile time when you accidentally try to append an
ISO-8859-1 string to a UTF-8 string.

------
emmelaich
> he only advantage of this representation is space efficency.

Not even close to true. Simplicity, future proofing are two more at least.

Reference to PDP-11 assembler is a furphy.

You only have two ways to store _anything_ at the lowest level: 1. length +
data or 2. data with sentinel value.

If K&R chose option 1 we would have versions of strings with 8 bitlengths, 16
bit lengths, ..... all the while incurring a base load of inefficiency within
any program. (In fact I'd warrant programs would use data+sentinel internally
anyway.)

There are now many many "safe string" libraries for C. Use them if you like.
The fact there are so many and get so little use tells you something.

Is length+data safer? It's easy to lie on the wire, so I don't think so. If
much.

BTW, one way to provide future proof (length, data) format is to encode the
length with UTF-style encoding. So the length field would have enormous range.

------
ishi
This could be a good question for interviews. A good candidate should be able
to come up with several solutions for implementing strings, and explain the
pros and cons of each solution.

~~~
sifoobar
A good candidate for what?

I've been through enough interviews on both sides. The only thing that will
tell you anything of importance is paying candidates to perform real tasks.

If doing algorithms on a whiteboard is part of the job; then sure, go for it.
But I suspect very few of the really good coders out there would make much of
an impression in that setting.

~~~
macintux
I think if you modify it to simply state that having a candidate think out
loud about mechanisms for storing strings is a useful exercise in evaluating
their problem solving ability, it’s a pretty good idea.

I’m rarely concerned about the right answer in an interview, because those are
generally searchable.

~~~
sifoobar
But it's still pretending, solving canned problems with right/wrong answers is
not the same as doing the real thing. You'll get completely different
behaviors out of most people. I wouldn't be surprised if plenty of whiteboard
surfers and sweet talkers turn out to be less than ideal under pressure when
there are no clear answers and no authority to back them up.

~~~
macintux
Thankfully I’m rarely hiring for positions where someone is developing
software with a gun to their head.

------
bigtimber
Perhaps you could include a comparison with BSTR implementation used
extensively in Windows COM interfaces?

~~~
int_19h
HSTRING (from WinRT) is probably more interesting at this point.

------
deepsun
How do I store \0 in a C string?

~~~
colejohnson66
And work with the standard library’s string functions? You can’t.

------
jchw
Go slices sound similar, though they are always mutable.

