Hacker News new | past | comments | ask | show | jobs | submit login
How to implement strings (tuxen.de)
138 points by qznc 4 months ago | hide | past | web | favorite | 100 comments

Another idea is the OCaml string representation which is a sort of cross between C and Pascal. I wrote about it in [1]. It allows O(1) length, can store \0 in the string, while being backwards compatible with existing C-string-consuming functions.

It's applicable to C because most realistic implementations of malloc(3) store the size of the malloc'd area to the nearest word[2] and so the Pascal size part does not need to be stored explicitly. (Note this would require minor changes to the C runtime to allow malloc to make explicit the currently implicit size).

[1] https://rwmj.wordpress.com/2009/08/05/ocaml-internals-part-2...

[2] https://rwmj.wordpress.com/2016/01/08/half-baked-ideas-c-str...

>It's applicable to C because most realistic implementations of malloc(3) store the size of the malloc'd area to the nearest word[2]

This is not so true anymore, and I'd be very careful with that sort of tricks. While the default memory allocator on most distributions (ptmalloc2) still probably works like that, there are other memory allocators in use like jemalloc/tcmalloc, which store metadata for a group of blocks (e.g all equal-sized blocks in a page) rather than for every single block.

Like Asooka wrote, you can malloc_usable_size(), if available. But have in mind that malloc_usable_size() returns the size of the block that malloc has given you, which can be _larger_ than the size you requested, as many memory allocators round block sizes up.

I actually wrote a memory allocator that works that way: https://github.com/ricleite/lrmalloc

You can get the size of a malloc'd block with malloc_usable_size. Can't say if it's actually O(1) though.

You had my hopes up for a minute there, until I checked portability…

malloc_usable_size() is a GNU extension, so usable on Linux. MacOS has malloc_size().

One web page [1] says the GNU one is about 10 cpu cycles and the MacOS one is about 50 cpu cycles, so not the end of the world, but maybe a little long for string length checking.

I didn’t check any BSDs and I have to wonder what the effect of changing allocators in Linux has. Certainly they could have a performance impact and possible a ‘does not implement’ problem.

[1] https://lemire.me/blog/2017/09/15/how-fast-are-malloc_size-a...

> The character \0 marks the end so we call this zero-termination. [...] The only advantage of this representation is space efficency.

Sort of a tangent, but zero termination (aka null termination) can make parsing from a string slightly simpler.

Without a null terminator, in addition to processing each character, you need to check if you've hit the end of the string. With a null terminator, the end-of-input checking is elegantly part of the "processing each character" step.

You can always add a null terminator character to the end of your string in the length+data representations as well, but if your strings don't come that way you end up having to make a full copy.

Saying the "only advantage" is space efficiency is definitely weird. Null-termination means that a string can't contain \0, which has both good and bad consequences. Maybe what the author means is that the technique of just running through the string until you hit a null has been an unmitigated disaster from a security perspective. (Buffer overflows often result from a strcpy() which receives a longer string than expected as an argument.)

See this proposal which takes that problem as its starting point: https://cr.yp.to/proto/netstrings.txt

More generally, the implication that space efficiency is the justification for null-termination is probably not historically true. It was a form of in-band signalling, just like getchar() returning EOF.

When null-termination is used in network protocols, the data to be sent has to be prepared in order to remove/escape any nulls. This process is called byte stuffing, and naturally it could result in serious inefficiency if a long string of nulls occurred in the input data: https://en.wikipedia.org/wiki/Consistent_Overhead_Byte_Stuff...

Sds is somewhat nice in that you can work both ways. https://github.com/antirez/sds

That's similar to C++11 std::string, which has a length indicator and a null terminator.

(I believe this was often true for std::string implementations before C++11, but apparently C++11 made it a requirement.)

C++11 mandatory null termination has the advantage that a call to data() or c_str() need not store the null byte. On the other hand, mandatory termination costs one additional store in some of the constructors, and burns one byte of the capacity for inline data.

How does it burn one byte of capacity for inline data? Are you saying there is a zero-byte way to encode the length that could instead be used, avoiding that one byte overhead?

What I meant was the standard requires the null byte at the end of the string, which reduces by one the possible length of the string when stored inline. If the null byte wasn't mandatory, it would be possible to store slightly longer strings inline, but then the implementation would have to conditionally materialize the contents to the heap in case of a call to c_str or data.

Only since C++11. In C++03 no guarantees were given on the time complexity on c_str, and std::strings could lay their contents our however they wished (and cause a heap allocation in the process of materializing a C string, if necessary).

std::string has a byte length,an allocation capacity, a nil terminator and the content.

A string implementation will typically store this as a structure with size + capacity + content pointer, with the content having a nil terminator. You could technically put capacity as a prefix of the content as well (I believe one of the apple GC string representations stores capacity as well as a reference count in the content block, and uses the reference count to determine exclusive ownership for inline mutation.)

These may have two other representations: - size, capacity and content all initialized to zero is the only representation of the empty string

- a embedded flag (such as a tagged contents pointer) indicates that instead the string holds inlined data.

In this inlined case on a 64-bit platform, you have 24 bytes to work with. You store the length in the LSB byte of the content pointer (making sure not to interfere with the tagging), and capacity is hard coded and omitted. You then have 23 bytes to work with, which would be 23 bytes of content - but you need to remember to add space for a nil byte at the end of the inlined content, bringing your capacity down to 22.

> std::string has a byte length,an allocation capacity, a nil terminator and the content.

This is only true post C++11, I believe. C++03 was much more permissive about how std::strings were laid out internally.

Unicode is indeed insanely complex. There is almost no query or transform of a Unicode string you can do beyond asking its length in bytes that is at all straightforward. I suspect that very few pieces of software that 'support' Unicode and do anything non trivial with text actually do so fully correctly. It would be nice if there was a well defined 'simple' subset that handled the 80% case that could be a reasonable target for the average app to support fully.

Perl 6 has invented its own normalisation called NFG which normalises at the grapheme level by creating synthetic code points for multi-char graphemes where necessary. This vastly simplifies operations on Unicode strings and gives semantics that are intuitive - producing results you would expect from the visual appearance of a string.

It feels to me like Unicode was designed for font renderers and other such software rather than programs that have to deal with Unicode input and output.

If you're a font renderer it makes sense to have separate codepoints for each grapheme, and it'd be more complex to split a single codepoint cluster into the individual components that need to be drawn. Having separate graphemes also allows reuse (though as the article shows, there's plenty of visual, non-semantic duplication).

But as a result, the operation "length of string in terms of what a user would consider as separate characters or grapheme clusters" is a hard problem that basically requires all the core aspects of a font renderer other than the actual display code.

Which is fine, and probably reasonable, but dear lord does it make it difficult to use.

> It would be nice if there was a well defined 'simple' subset that handled the 80% case that could be a reasonable target for the average app to support fully.

Isn't that ASCII?

Well it ends up that if you don't want to go down the Unicode rabbit hole too far then yeah, your best bet is probably to stick with ASCII. As an example from my industry though, it would be nice if I could implement user name entry and display for a high score table in a simple game and support names in common European languages without needing to handle all the edge cases of e.g. mixed left to right and right to left text, combining characters, surrogates, etc.

I'm far from a Unicode or a languages expert but I'm familiar with one language with right to left non Latin characters and aware of just enough Unicode madness to know I don't know enough to handle many edge cases properly. It would be nice if a regular developer like me could support something more than plain ASCII but less than the full insanity of Unicode to accommodate at least some non English users.

You're basically throwing people with "inconvenient" character sets (i.e. everything that doesn't use something strongly resembling Latin characters) under the bus. Sure, you might be able to support Spanish, French, and German, but you're basically disregarding Japanese, Chinese, and Hindi when doing so (and possibly even ASCII, since you'd trade some symbols for accents).

That's not at all what I'm advocating. My point is that the extreme difficulty of fully supporting Unicode with all of its complex edge cases (like my examples of mixing left to right and right to left languages) means that the two most common outcomes are throwing up your hands and giving up supporting anything but English and just using ASCII or having broken, partial Unicode support.

I'm suggesting it would be nice to have another option where you could provide some level of support for non English languages with something you have some hope of implementing correctly. Applications that correctly handle editing of mixed left to right and right to left text are rare for example but you could support Farsi speakers reasonably well in many applications without handling that scenario.

There’s Latin-1 which gets you full coverage on many European languages and almost complete coverage on several others (e.g. French uses œ, which it doesn’t have but French readers will be able to understand if oe is substituted).

The problem with Latin-1 and other 1-byte options is that, unlike ASCII, they aren’t forward compatible with utf-8, which is the emerging de facto string exchange encoding. For a stand alone video game, maybe that doesn’t matter but for anything network enabled it can be a big issue.

No. Computers are in use by about 3 billion people right now. Only a minority of them use only ASCII characters in their day-to-day writing. Turns out the US and UK comprise only a small fraction of the world’s population.

But for many of the companies that the people on HN happen to work for the majority or even overwhelming majority of customers.

And therein lies a terrible misconception: the world does not speak English or German or Mandarin or French, but a horrible mix of all these languages. Eventually, almost any system will have to deal with that.

Simple example and a current pet peeve of mine while staying in the US: my name is spelled with an ü. I will likely try to enter that in your web form, because it is part of my legal name. A lot of systems happen to "sanitize" that input when it is passed across some invisible internal boundaries and it becomes a u. Now that system has actually changed my name. ü and u are completely different letters. The proper conversion actually is the transcription ue - two letters!

If that were - say - the address for a letter to Germany, it might well return to sender because there is noone with the (altered) name living at the given address.

The unicode encoding formats are relatively simple and quite elegant (UTF-8 in particular manages to get an impressive set of capabilities with a relatively simple format).

> I suspect that very few pieces of software that 'support' Unicode and do anything non trivial with text actually do so fully correctly.

Why do you suspect this? Nearly all software that works with Unicode does so using pre-existing libraries (either a language's standard library, or something like libiconv).

I've read so many stories over the years of Unicode edge cases being broken in major applications and frameworks. I don't know the current state of all of these but Chrome had all kinds of Unicode bugs for years, standard Windows text boxes didn't correctly support combining characters, many consoles and edit boxes don't support mixed right to left and left to right text properly, many widely used languages seem to lack good standard library support for correctly manipulating strings with combining characters, although its hard to find clear explanations of what the right thing to do even is.

Most of what you describe sounds like applications that don't bother to use unicode-aware string manipulation routines, period, rather than applications that use buggy unicode handling.

A lot of the problems I'm describing revolve around text input / editing / display rather than straight string manipulation.

Let's say I want to type 'My name is Matt.' in Farsi which is right to left. Transliterated that is 'esme (name) man (my) Matt ast (is)' and you might think I'd type that by typing the words in that order.

اسم من Matt است

The above is what I get if I type the words in that order while switching between Persian / English keyboards but it's not really what I'd expect as a user. In terms of right to left word order the above says 'ast Matt esme man' which is not the order I typed and is not correct. Now I'm not an expert in Farsi, Unicode or multi-lingual text entry and I don't know if the problem here is user error, browser implementation, Windows, or what. I know there's something about directional override characters in Unicode and a complex algorithm for dealing with text with mixed directions. I discovered while trying to write this post that it's been exploited in file names as a malware vector. I just know that this stuff is more complicated than I have time to deal with as a programmer on most projects I've worked on, so if Unicode is supported I don't even know if it's working correctly or how to test it in many cases.

Try to select, copy and paste the above text into a text editor and then move the cursor around with the cursor keys in it. Is the behavior what you'd expect? Is it correct? I don't know enough to be able to tell honestly.

”However, none of them are suitable for general strings where we might have many small ones”

That’s where I expected to see the Short String Optimization mentioned.

For short strings, it stores the string’s characters in the same memory where, for longer strings, it stores a pointer to the characters (with ‘short’ dependent on the implementation)

It also tweaks how either get stored to make it possible (and fast) to determine what is stored in that space. See https://shaharmike.com/cpp/std-string/

We decided against SSO for Rust, incidentally. I’m on my phone and it’s late so I won’t elaborate too much but it’s not always clear it’s a win, depending on language semantics.

I'm sure Steve can provide better sources than me, but the discussion here (and at the first link from the reddit page) look relevant: https://www.reddit.com/r/rust/comments/2slcs8/small_string_o...

Edit: this was supposed to be a response to the other response to Steve’s comment. On my phone now, can’t easily edit.

I imagine I’d be more concerned with performance and locality than memory savings if every string required a heap allocation and an additional dereferencing for access.

However, rust is very well-engineered, and I’d be interested in knowing what factors made its decision a win.

IIRC, there are a variety of factors:

1. Conventions and prevailing idioms. In C++ copying strings is common, both unintentionally due to the implicit copy constructor and intentionally as a means of defensive programming. In Rust, copying strings is relatively uncommon: Rust has no copy constructors (copying must be done explicitly with the `.clone()` method), and the borrow checker obviates the need for defensive copies and therefore passing string slices is overwhelmingly preferred (and recall that string slices in Rust (`&str`) are 16 bytes, in comparison to 24 bytes for C++'s `std::string`).

2. Alternative optimizations. For example, in Rust, initializing a `Vec` (and by extension `String`) does not perform a heap allocation for zero-length values. IOW, you can call `String::new()` (or even `String::from("")`) in a hot loop without worrying about incurring allocator traffic. Any hypothesis that short strings are more common than long strings will likely also hypothesize that the most common short string length is zero (and this appears to be borne out in practice; this optimization is really important for `Vec`, so important that, for Servo, it often makes `Vec` faster in practice than their own custom `SmallVec` type which strictly exists only on the stack), so this optimization goes a fair ways towards satisfying the "I actually do have many short strings and I'm not just copying around the same string a lot" use case.

3. Weighing trade-offs. The pros of SSO are better locality and less memory usage, with the cons of larger code size and additional branches on various operations.

Putting it all together, this gives the Rust devs reasonable incentive to take the conservative path of not implementing SSO for the default string type. That's not to say that we'd be incapable of finding Rust code that would benefit from SSO, or that C++ chose their default incorrectly (different contexts allow different conclusions), or that the Rust devs will always have this stance (if performance benefits of SSO were to be clearly demonstrated on Rust code in the wild in such a definitive way as to justify changing the default behavior, then I don't think they couldn't find a way to make it happen).

String implementations are low enough level that they get influenced quite a bit by memory management behavior, sometimes in surprising ways.

G++ used to have CoW strings, but I believe they dropped them because for common cases the atomic reference counting was more expensive than potential copying.

Inlining strings save space and allocator traffic at the expense of a more complex string implementation and (depending on the platform/implementation) strings that do not start word aligned.

There are other flags you might want to put on an underlying text sequence:

- that the text is exclusively ASCII in UTF-8 strings. This means the character count and indexing can be optimized to constant-time operations

- latin-1 in UTF-16 strings. This means you can do an alternate 8-bit encoding for space savings, as well as constant-time count/indexing.

- mark that the string represent a single grapheme cluster (aka a printable character). This may allow you to combine unicode text processing to a single data structure.

- mark that the representation is a slice. In this case, you'd typically store an offset rather than a capacity. This again helps you combine higher level types on top of a single data structure and text processing libraries.

These features (and especially combinations of them) impact the hot path for text processing. If your language is already doing custom allocation (such as using a copy constructor), the allocator savings are nil, so it becomes a space vs code trade-off.

FWIW the C++ standard actually prohibits CoW strings, and any C++ standard library that implements CoW strings is doing so in defiance of the standard. AIUI this prohibition was completely accidental, and I believe it's due to the rules around which string operations do and do not invalidate iterators.

GCC updated their implementation to become compliant, probably for these reasons.

Worth noting that rust is hit by the additional branches worse than C++ due to the lack of move ctors/move assignment operators. No way to safely store an interior pointer to the buffer as a result.

That said, as mentioned the other features do make this hurt less for many cases.

To add to the other great replies, I should have also mentioned that String is a standard library type like any other; just because we didn’t implement it for String doesn’t mean that a package can’t implement a string with SSO if it’s important for the use-case. That’s also a factor.

From what i recall of the thread, (1) there was one thing in the String API that looked like it wouldn't play well with SSO, and (2) nobody provided really solid data showing that SSO was a consistent win.

The discussion seemed to me to support the conclusion that Rust will not introduce SSO right now, but the conclusion the maintainers actually drew was that Rust will not introduce SSO ever. Of course, that could always be revisited!

It can’t be revisited for String because the method is there, in stable, now.

From what i remember, the method wasn't impossible to implement with SSO, it would just be painful. But, exactly as you have, people took away the conclusion that SSO was impossible.

Yeah, we still could, but it would involve breaking compatibilities with a bunch of other guarantees, so it’s not really feasible IMHO.

libstdc++ and libc++ are both suboptimal in using available bits for SSO, but SSO23 [0] provides a string type which is and provides an interface compatible with std::string.

[0]: https://github.com/elliotgoodrich/SSO-23

Optimal in using bits, but possibly at a loss of performance (it needs to subtracts two numbers to extract string length)

(possibly because a) my observation that It needs an extra operation may be incorrect (I didn’t count all the binary ops in both algorithms) and b) I have no idea whether that makes any significant difference on modern hardware)

If there is loss of performance, it might be enough to offset the loss of ability to do SSO on strings of length 23 (how common are those? I would guess less than one in every million strings has that length)

Also, severely nitpicking: C++03 doesn’t say anything about the time complexity of c_str (https://softwareengineering.stackexchange.com/a/124784), so in that version of the language, we can do a tiny bit better, and use the wasted bits in the last byte to store a 24th character of the string, if that character wouldn’t make us decode the last byte as a valid string length.

C++11 said c_str must be constant time, so that wouldn’t work there anymore.

> C++11 said c_str must be constant time, so that wouldn’t work there anymore.

How so? Any SSO scheme should allow you to determine whether a string is short or not in constant time. Copying a 24-character string into a 25-byte zero-terminated buffer takes constant time, so c_str could simply do the copy.

You need to find space to copy the data to, too. If you allocate that space, your allocator needs to be constant time. I don’t know of any even somewhat competitive allocator with that property (a bumping one that gives up when it runs out of addresses would work, but isn’t competitive)

You could set aside a buffer to use, but you would potentially need one for every string in your program of which the compiler cannot prove that c_str won’t be called on it.

I think the only way to do that would be to use SSO and allocate a memory buffer, and make it possible to find that buffer ∞ O(1) from the ‘SSO’ version. That’s doable, but I wouldn’t call it an optimization anymore.

Also, doing that and keeping the operation constant time would mean you would have to spend an equivalent amount of time on non-short strings. That can be done, but not in a performant way.

Greetings, fellow language lawyer! Pedantic you are, but this time you were not quite pedantic enough :)

I think you're conflating two different meanings of "constant-time". The one the C++ standard cares about is "the time this takes can be upper-bounded by some constant that doesn't depend on the string length or other stuff."

There is a different idea of constant time that's used in crypto when people worry about timing side channels: "the time this takes should be exactly the same, every time."

So, no you need not "spend an equivalent amount of time" on long strings. As you pointed out, you DO have to make sure your memory allocator is good enough. But actually this is easier than you admit: you don't need constant time guarantees for arbitrary allocations; you only need them for allocations of EXACTLY 25 bytes. In practice, you might get this for free, since many allocators try to optimize small allocations.

But the other problem you alluded to sinks this proposal, even in C++98. Because, like you said, you have only two choices:

1. pre-allocate emergency buffer, which defeats the point, or

2. allocate "just in time" which might FAIL if you run out of memory

Since C++ does not allow c_str to ever fail, you can't rely on dynamic allocation to implement it. Which is a bit of a pity since in practice, if allocations that small are failing you have bigger problems anyway most likely.

> Copying a 24-character string into a 25-byte zero-terminated buffer takes constant time

Thats not constant time, thats linear/O(N) time.

It is constant because O(24) = O(1).

Do you know if this is going anywhere? Doesn't look like it has been touched in a few years.

Thanks for the feedback. You are right. I added some text about that and mentioned you at the bottom.

C++ implementations seem to store the capacity in the string object instead of the next to the referenced memory as I described in the article. Do you know why?

I would guess compiler writers have data that indicates that the advantage “longer strings can be stored using SSO” beats the disadvantage “every short string is longer”.

It also means accessing the capacity field doesn’t involve accessing another cache line, but I can’t think of situations where that matters, as every time code needs to access capacity, the CPU will soon do something with the pointed to object (if only to free it. Or are there real-world allocators that don’t look at memory ‘near’ to-be-freed objects?)

Equality comparisons might short-circuit if two strings differ in length (they do in Java). I’d never thought about it before, but I’m guessing that only makes sense if you’re assume normalization already happened.

Good example.

AFAIK, C++ strings are little more than sequences of char, and string equals compares those sequences. It won’t notice when the two strings happen to be equivalent under some Unicode normalization.

The article does mention short string optimization: “One clever optimization is to store small strings directly instead of allocating and referencing a chunk.“

In fact it links to the same post, so perhaps it was added in response to your comment?

This is essentially the same optimization as storing small range integers as "fixnums" whereas the larger ones are heap- allocated "bignums".

I personally think we should make a plentyfold of datastructures respresenting strings. Sure, I want a geenric fits-all structure to represent text as provided in many libraries and frameworks, such as .NET's String class. However, I have many situations where using that same String class is pure overkill and prone to security bugs. For instance, I want a DigitsOnlyString, representing only the numbers [0-9]. But I also want a NumbersOnlyString, representing the digits [0-9] but also all other number characters there are in Unicode. The same for e.g. emoji's, latin characters ([azAZ]), extended latin ([azAZ], plus a subset of diacritics) Chinese characters, and I can go on for quite some more examples (Password anyone? Don't allow a lot of Unicode edgecases). These classes will have the benefit of really efficient encoding in bits, being more secure, more clarity to programmers and most important of all clear reasoning. I bet 80% of String uses can be moved to a more constrained type.

All those are is a set of verification functions tied to strings ... we really dont need native data types representing this. Im not sure why you are not able to find a lib/function that easily verifies such things. Programing languges are the basic foundation which should be concerned with performance , organization and making sure freedom is given to the programmer. These are all specific edge cases for the programmer to define not for the programing language to worry about.

Would you rather they speed up the compiler/interperter or add edge case string classes? Its not like there isnt a trade off. This is a bottom of the barrel concern.

> Programing languges are the basic foundation which should be concerned with performance , organization and making sure freedom is given to the programmer.

This is one view, but there are others. For example, instead of focusing on performance and freedom, the programming language could focus on safety or formal properties.

E.g. there should never be code that compiles but is unsafe, or which can have certain classes of common bugs.

That’s a valid view, but it would definitely be powerful if the language’s type system could express restricted character sets, right? You can already do something like this with e.g. Ada’s constrained subtypes, like encoding the length of a string or the range of a number into the type, making it statically impossible to accept invalid input. That’s really cool!

For what purpose would you disallow certain unicode sequences in a password? I'd expect the user's input to be treated as a sequence of bytes which gets passed directly to your hashing function.

> The only advantage of this representation is space efficency.

Ignorant nonsense. There are other advantages.

Advantage: If s is a non-empty string, then s + 1 is the rest of that string. s + 1 is an extremely cheap operation: incrementing a pointer.

This allows recursion on strings. For instance, strchr can be coded like this:

   char *strchr(const char *s, int ch)
      if (*s == ch) /* including ch == 0 case */
        return s;
      if (*s)
        return strchr(s + 1, ch); /* tail call */
      return NULL;
A variety of useful algorithms can be implemented in this simultaneously elegant and efficient manner.

Advantage: null terminated strings can use exactly the same external representation as internal. They can be sent over the network as-is or written to files. There is no question of what format is this header, how wide is this length field and so on. Strings of wide characters just have a simple issue: how wide (16 or 32), and what endian.

> Ignorant nonsense.

I could do without that - it's hardly constructive.

Of course, most actual strchr implementations tend to be heavily vectorized hand-tuned assembly with macros to select the correct streaming extension supported by the hardware. Or, for less powerful systems, something like

  char *strchr(const char *s, int ch) {
  	char *p = s;
  	while (*p && *p != chr);
  	return p;

  s/less powerful systems/systems with lesser compilers/
With TCO, I'd expect the same code from the recursive version.

The recursive version has the benefit of being pure; it doesn't mutate anything. Speaking of which, HN ate your ++.

> There is no question of what format is this header, how wide is this length field and so on.

It's not that simple. You either have forbidden characters then, or escaping - and with the latter you don't know how wide the content is. Null-termination in fields is a common security issue.

Forbidden characters can occur in other string representations, like length + data.

Escape sequences are higher level syntax being represented by the string; they don't belong in the string representation.

A transfer encoding like UTF-8 isn't a string representation either.

Manipulation of syntaxes and encodings stored in character strings can be buggy. The representation does contribute to the risk. E.g. if a UTF-8 decoder doesn't reject overlong forms, then a null character can be encoded in the middle of the data so that to the buggy function or program, it looks like a different string from one that was validated earlier.

The first sentence:

"The C programming language defines a string as char* ."

No, it doesn't. A string is by definition "a contiguous sequence of characters terminated by and including the first null character". A char* value may or may not be a pointer to a string.

(Confusing arrays and pointers is perhaps the most common mistake people make when talking about C.)

And in C++, the type of a string literal expression is not const char* , but in fact const char (&)[], ie a reference to an array of chars. I was surprised to learn this as I always assigned a literal to a const char* , relying on pointer decay without knowing it.

In C, the type of a string literal is

where n is the length of the literal plus 1.

In C++, it's "array of n const char", or

    const char[n]
It's not of a reference type. See the ISO C++11 standard, 2.4.15 [lex.string].

(In both C and C++, array expressions "decay" to pointer expressions in most contexts. But for example

    sizeof "hello"
is 6 (the size of the string), not the size of a char* pointer.

A string in C is an implicit type; it's a specialization[1] of an array of characters. Another example of implicit type are Lists in Lisp (specialization of nested pairs).

[1] https://en.wikipedia.org/wiki/Specialization_(logic)

A string in C is entirely built upon convention; it exists in the words of a document only.

Common Lisp has a list class that you can specialize a CLOS method on.


  [1]> (typep nil 'list)
  [2]> (typep '(a) 'list)
  [3]> (typep 3 'list)

  [4]> (subtypep 'cons 'list)
  T ;
  [5]> (subtypep 'null 'list)
  T ;
So, kind of apples and oranges.

I was referring to John McCarthy's definition of a list. IIRC, Common lisp's list type includes both proper and unproper lists. (typep (cons 1 2) 'list) evaluates to true but (length (cons 1 2)) throws a type error (not a list).

I guess you could think of it as an "implicit type". I prefer to think of it as a data format rather than a data type. (It doesn't satisfy the meaning of "type" as C defines the word.)

char[n] (for constant n) is an array type. An object of type char[n] may or may not contain a string.

> There are two relevant lengths of Unicode strings: memory size in bytes and number of grapheme clusters.

There is also the number of codepoints, which is relevant when you implement Unicode-related algorithms.

And then there is display length (which counts full-width characters as two characters, even in monospace fonts, and zero-width characters as zero).

Finally, if you think grapheme clusters are the right way to count characters in a string (and it often is), then you should also allow string indexing on grapheme cluster level.

I lowkey wonder how much memory is consumed, on average, by the extra string info in map-heavy languages like JS and Python, and how much CPU time is spent on strings there. While C doesn't need too many strings and mostly throws around numbers, a Node.js app may easily be dealing with thousands of them, not even counting HTML and templates.

(And now I also wonder if any compilers and VMs precompute hashes for map and object keys.)

Slightly related: Lua will intern every string ever created, that is two strings are equal if and only if their pointers (as represented internally) are identical. This results in slow string creation, but very fast equality checking and lookups, as it converts the problem to comparing integers.

> (And now I also wonder if any compilers and VMs precompute hashes for map and object keys.)

I don't think i've ever come across one that does. But then, in most situations where they could, they can do something even more efficient anyway. Fast JS compilers replace an object's hashtable with a dense array of fields, accessed by index. Lots of language implementations do method lookup in a vtable, rather than a dictionary.

Out of all the string types I have used, I think Go has the best of them. Since it's a GC'd language, you don't have to pay the refcount atomic overhead. The representation is two only words long. Strings are immutable, making them suitable map keys. Strings don't carry any encoding information; they are plain byte-arrays.

This looks like good information about the memory representation, but we would be better off without string as a data type at all, at least not as a concrete type.

Strings as they commonly work are leaky abstractions which for decades have bred confusion among programmers over how to handle text properly.

Text/character encoding bugs are a matter of type safety. I believe most of the text encoding bugs in the history of software could have been avoided if we used our type systems to handle text encodings explicitly.

UTF-8, UTF-16, UTF-32, Windows-1252, ISO-8859-1, US-ASCII, et al.; these should be our essential concrete data types. The concept of a string should just be an abstraction over them for polymorphism.

At no point in the lifecycle of a piece of textual data should there ever be any question over how it is encoded. So why don't we let our type systems do this bookkeeping for us?

I have written some programs dealing with Indic scripts, and when working with strings I prefer to see them only as a sequence of Unicode codepoints; it's a clean abstraction compared to the encodings which are messy (and half of them don't even support the scripts I care about).

While text is in memory inside your programming-language runtime, you don't care about the encoding; for all I know it may not even be one of those encodings. You only have to deal with encodings when dealing with the "external world": when you fetch (or send) bytes from (or to) disk / network / the user. These are anyway out of the scope of the type system, so the type system wouldn't help you: the encoding needs to be communicated to you out-of-band.

(See https://nedbatchelder.com/text/unipain.html for the “Unicode sandwich” approach: convert to bytes at the boundaries of your system.) So using a type system to specify encodings isn't really helpful; what a type system can help you with is clearly distinguishing between bytes and text (separate types), and forcing you to specify an encoding when converting between them.

> I prefer to see them only as a sequence of Unicode codepoints

You can still have this with encodings as data types.

Encoding is definitely a concern at the boundary, but to say that it is a concern only at the boundary is too simplistic. What if you want to save the cost of converting encodings from that which was input at the boundary? And at some point data will have to leave your application, and this often requires munging text in a number of encodings.

> So using a type system to specify encodings isn't really helpful

Imagine an error at compile time when you accidentally try to append an ISO-8859-1 string to a UTF-8 string.

> he only advantage of this representation is space efficency.

Not even close to true. Simplicity, future proofing are two more at least.

Reference to PDP-11 assembler is a furphy.

You only have two ways to store anything at the lowest level: 1. length + data or 2. data with sentinel value.

If K&R chose option 1 we would have versions of strings with 8 bitlengths, 16 bit lengths, ..... all the while incurring a base load of inefficiency within any program. (In fact I'd warrant programs would use data+sentinel internally anyway.)

There are now many many "safe string" libraries for C. Use them if you like. The fact there are so many and get so little use tells you something.

Is length+data safer? It's easy to lie on the wire, so I don't think so. If much.

BTW, one way to provide future proof (length, data) format is to encode the length with UTF-style encoding. So the length field would have enormous range.

This could be a good question for interviews. A good candidate should be able to come up with several solutions for implementing strings, and explain the pros and cons of each solution.

The prototypical full stack software as a service web developer or the standard in house bespoke application writer that knew how to implement a string wouldn’t tell me anything about how they can design or write real world systems.

If their day to day job is solving hard problems at scale then maybe. But most companies overestimate the complexity of what they are really doing.

What would tell me how they write real world systems - and how I interview - is to give them a simple real world skeleton of a problem with failing unit tests, sit them down to a computer and make them make the tests pass.

If I am hiring an architect level position, I am not going to have then write algorithms on a board. I am going to do the same set of tests as above and have them design a scalable, fault tolerant system and have them draw it out.

A good candidate for what?

I've been through enough interviews on both sides. The only thing that will tell you anything of importance is paying candidates to perform real tasks.

If doing algorithms on a whiteboard is part of the job; then sure, go for it. But I suspect very few of the really good coders out there would make much of an impression in that setting.

I think if you modify it to simply state that having a candidate think out loud about mechanisms for storing strings is a useful exercise in evaluating their problem solving ability, it’s a pretty good idea.

I’m rarely concerned about the right answer in an interview, because those are generally searchable.

But it's still pretending, solving canned problems with right/wrong answers is not the same as doing the real thing. You'll get completely different behaviors out of most people. I wouldn't be surprised if plenty of whiteboard surfers and sweet talkers turn out to be less than ideal under pressure when there are no clear answers and no authority to back them up.

Thankfully I’m rarely hiring for positions where someone is developing software with a gun to their head.

I want to hear a candidates train of thought. I want to see their ability to reason about multiple solutions and weighing their benefits and drawbacks. The problem at hand should be something trivial - no whiteboard algorithms - so that the solutions themselves come easily (or may even be provided), the important thing is to test the ability to reason about things like difficulty of implementation performance, etc.

“Just doing real world stuff” might be another useful but separate part of an interview.

I get what you want, it's the same thing everyone else wants. That doesn't imply it's possible to get it. What we're doing is pretending it works, it's not the same thing.

When I'm working on solving real problems, I'll come up with ideas that make your head spin. Put me in front of a whiteboard and feed me canned problems with right/wrong answers and you won't ever see me again.

I would never ever have s whiteboard or even an editor in an interview. Arguing the difference between zero termination and length prefixing is a ”real problem” of the kind I have to face multiple times on a daily basis.

Also, solving problems without considering all the (or at least multiple) solutions isn’t really a very useful skill.

That said, all this varies by the role. If I was hiring a contractor position or regular web dev then “solving the problem” is an essential skill and solution bikeshedding might not be. But I’m hiring developers for low-ish level work on lib style code with 30 year maintenance windows. It’s radically different from creating a web stack with maybe a 5 or 10 year total lifespan.

A good candidate for ... maintaining the legacy flagship product, in which N software components each contribute their own string implementations, and there is N+1th module implementing conversions from any of those strings to any other.

God bless his/her soul :)

Perhaps you could include a comparison with BSTR implementation used extensively in Windows COM interfaces?

HSTRING (from WinRT) is probably more interesting at this point.

How do I store \0 in a C string?

And work with the standard library’s string functions? You can’t.

Go slices sound similar, though they are always mutable.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact