This is the problem with C++ (and in a certain part, with C) "Everybody" has the...

roel_v · on Dec 5, 2014

"Lessons learned, their strings work"

Except that they don't. Either they are 'complete' but massive and thus slow, or they start as 'array of byte' and then their designers spend 10 years implementing a more 'complete' string type that is still fast enough and end up as #1 anyway.

Of course the C++ way where there is no string type that everyone uses sucks too, it's just that strings are almost impossible to get 'right' because there is no real 'right' and so many special cases that aren't apparent at first sight.

kibwen · on Dec 5, 2014

I think this bears trumpeting: strings are hard! It's easy to gloss over their issues via garbage collection and pervasive heap allocation, but once you're in a domain where you care about stack allocation and avoiding copies you start running into difficult tradeoffs (above and beyond even the question of string encoding, which is a different beast altogether).

Speaking as a dynamic language programmer who's trying to break into systems programming, it took a long time for me to fully appreciate the difficulties around strings. And now that I do, I'm perpetually paranoid about how many allocations my Python programs are doing behind the scenes...

ajuc · on Dec 5, 2014

If you care about performance use std::performant_but_tricky_string. 99% of code doesn't care about string-related performance, but needs string anyway.

crpatino · on Dec 5, 2014

If you care about performance at all, strings are going to slow you down in unexpected and counter-intuitive ways.

Far better if you pass around text data as binary buffers (with metadata describing encoding, please), and only convert those to strings once they are ready to be consumed by the user (which is typically not where performance bottlenecks show up anyways)

ajuc · on Dec 5, 2014

If you care about performance a lot, it may well be that most of your critical paths are in numeric code, and strings are only used to read input and write output. So you should just use strings unless profiling shows problems there.

Dewie · on Dec 5, 2014

I wonder how cavalier people treat strings when they feel like "primitive values" in the language - like Java having the (+) operator for concatenation.

I don't agree with the view that "costly" (it's all relative, but anyway) operations should look "costly" (i.e., be a relative eyesore). But I don't doubt that it can affect one's mindset.

rwmj · on Dec 5, 2014

Chrome is a web browser, not a real time operating system.

MrBuddyCasino · on Dec 5, 2014

It boggles the mind how many different ways to represent Strings there are in C++, and String handling in general is the major reason I'll never touch it (or C) with a 10 feet pole. I'm interested to hear though how the String handling in Java is broken. This is everything I have to know:

- String

- CharSequence

- char[] / Character[]

- StringBuilder

Done. Finito. Strings are immutable, the GC will clean up after me. Equality works, Unicode works. Only downside: more memory needed, but nobody except the embedded guys care anymore.

nostrademons · on Dec 5, 2014

Except that:

- In CJK languages, unicode codepoints can't be represented in 2 bytes. They take up 2 'char' objects; the first is a surrogate code point.

- In the presence of surrogates, charAt() and length() give wrong answers. Their indexes refer to the number of 'char' objects up through that point, not the number of Unicode codepoints. If there is a surrogate codepoint present anywhere before your index in the string, you will be off.

- To help get around this, the Java APIs added codePointAt and codePointBefore. These are still broken; the indexes are based off of chars, not codepoints.

- To get around this, we have codePointCount and offsetByCodePoint. Finally, these are semantically correct. However, they give up O(1) string indexing.

Did you know all of this? There's more, too, where calling CharSequence.subSequence causes a memory leak because you're pulling out a small portion (say, 10 bytes) of a large backing buffer (say, 1MB), and storing it in a persistent object prevents the GC from collecting the buffer, since a portion of its data is still live. This has caused real memory leaks in Google servers, enough that they had to educate the whole developer base about the pitfalls of it. But I figured I had enough to pick on with Java's broken Unicode handling to illustrate the grandparent's point: strings are hard. If you think you understand them, you're probably not aware of some of the trade-offs between performance, correctness, multilingual support, and developer APIs.

MrBuddyCasino · on Dec 5, 2014

I was aware of the substring issue and the fact that CJK languages require 2 chars, but I might have used codePointAt() instead of offsetByCodePoint(), to be honest. So I agree unicode string indexing is not as easy as it should be, due to a stupid API - but if thats all, I'm not sure that passes for 'hard'. It is also not something inherent to strings in general, the language designers simply messed this up.

Btw., I think they fixed the memory leak in recent VMs by removing that optimization.

nostrademons · on Dec 5, 2014

The point is that these are trade-offs. The reason for the stupid API is because by getting precise multilingual handling, you give up either O(1) string indexing or representing characters in less than 3 bytes/char. If you naively use offsetByCodePoint on megabyte-long strings, you may find your performance slows to a crawl.

The reason for the memory leak is that you can either have fast substring() or accidental memory leaks when substrings are stored in persistent data structures. Not both. If they removed the optimization, then suddenly str.substring(1) takes O(1) time and on a megabyte string allocates a full copy of the entire megabyte. In some other use-case, the full copy is far worse.

There are others, too. Idiomatic Java uses '+' for string concatenation, but this is O(n^2) time when done repeatedly. (Except that HotSpot optimizes out the repeated allocations when all concatenations appear in the source code, but it still can't do anything about loops.) To get around this performance pitfall, there is StringBuilder. Now Java programmers need to know 2 APIs. Well, more like 6, considering there's also CharSequence, ByteBuffer, StringBuffer, and char[].

C++ has always had the philosophy of "The programmer knows their requirements better than we do". This is why it is complex; because problems are complex, and it is designed to solve many problems. It's very rare for a language designer of a popular language to actually be stupid; it's very common for a language to get used outside of the domains that the language designer optimized for.

jerven · on Dec 5, 2014

Even if you use 3 bytes per char you are not guaranteed O(1) string indexing as you will still have surrogate pairs :( So you need to normalize these first... or deal with them in searching :(

String in more modern languages are standardised in one form and widely used in all libraries. C and C++ are the exceptions to this.

For example I hardly ever see CharSequence implementations in Java code. It is the most underused interface I know of, because for most people String is good enough. In idiomatic Java code the performance issue discussed here does not occur. Strings being final and passed by reference (or more correctly the reference is copied by value) the same string does not get reallocated all over the place.

The C++ problem is that std:string does not suffice for the common case and we end up with QStrings that are difficult to convert to boost:string.

In Java land when String does not meet the perf needs we implement a custom CharSequence but these are easy to convert to Strings if needed (easier than QString to Boost string).

So we end up with most Java programers knowing just StringBuilder, String and char[]. ByteBuffer is really about bytes and easy to convert to a CharBuffer.

The Java Language has some downsides in standardising on UTF-16 for its common String type (Moving on from UCS-2). But even if had chosen extended ASCII like Oberon it would at least be consistent everywhere.

I believe that javac replaces + with StringBuilder calls. And in loops the multiple StringBuilder allocs are removed and replaced by appending to one StringBuilder.

The Java Strings are far from perfect in UniCode terms but a whole lot better than std::String and its 16+32 friend.

MrBuddyCasino · on Dec 5, 2014

You made some good points, but I'm not sure all of these are necessary trade offs.

Lets say that the issue of code points not fitting into 16 bit chars requires a separate indexOf() API, to preserve O(1). They could have solved this one with better method naming, or at least improved documentation, so people are aware of the fact. The String#charAt() Javadoc is not really useful, unless you fully understand the implications of: "If the char value specified by the index is a surrogate, the surrogate value is returned." Also, if you are handling CJK strings, is there actually a need to split them by charAt()? The runtime could just tag such strings and use the fastest indexing method, falling back to the slow one for those.

Concatenation: true, and that the jit can't optimize most loops by using StringBuilder is a little embarrassing. Abstractions are always leaky, but this one I think could be improved, so that more cases would be made faster by the runtime.

Of course language designers aren't usually stupid, but its not like they all were created in some cozy Languages Workshop, where they could take their time to ripen, being guarded over by benevolent gardeners. See PHP, see Javascript, and of course Java, too. We don't have to accept accidental complexity as a given, there are useful abstractions, and we should use them. Rust and C# are better than previous attempts, imho, because they got proper funding and are designed by people who know what they're doing. And the whole field of software is better for it!

If you are Google, you are in the unique position of having top talent and huge amounts of servers. In that case, the trade-off to use C++ is probably the right one. People can handle its complexity, and it pays off due to increased performance, because the code runs on a million servers. But this simply isn't true for 99% of the business, and using C or C++ is not the optimal choice for them.

roel_v · on Dec 5, 2014

Hehe, ignorance truely is bliss I guess. Just ask yourself: what size is a char in Java? To unearth 90% of the problems with strings in any language, ask two things: first, what size is char? Any secondly, what is the length of a string? If you cannot talk about these things for an hour, you don't really understand how computers deal with strings.

MrBuddyCasino · on Dec 5, 2014

Actually, no. 90% of the problems with strings in any language is somebody screwing up the character encoding, usually out of ignorance.

The fact that every object in Java has some overhead and probably needs some padding for alignment is utterly irrelevant. But since you asked:

- 8 bytes generic object overhead per String

- 4 bytes for the char[] ref

- 12 bytes for the char[] itself, if non-null, plus probably 4 bytes padding

- 12 bytes for length, offset and hash code int fields (3*4 bytes)

- and 2 bytes per character stored

So I guess 40 bytes for the empty string should be about it. Happy?

My customer's servers have usually 8Gb or RAM, 16Gb is becoming the norm. Nobody cares anymore. Maybe its important in your field, I don't know. No mine though.

So did I ever need to know this piece of trivia? No.

Did I have to fix someone else's code which was relying on the platform default encoding? Lots of times.

PP: actually in C, you get the usually security nightmares on top of the encoding stuff.

vardump · on Dec 5, 2014

> - and 2 bytes per character stored

No, 2 bytes per code unit. A single UTF-16 character requires one or two code units. So a single "character" (code point in Unicode terminology) is either 2 or 4 bytes in Java. Additionally, a single Unicode character can require multiple code points.

pkolaczk · on Dec 7, 2014

What many C++ programmers often forget is that most of those overheads exist in C++ as well. They are only less visible. String length? Check. Char array header? 8-16 bytes reserved by the allocator and additional hidden length field created by c++ compiler. Array pointer: another 4 or 8 bytes. I guess the only thing c++ saves is for the fact the string object itself can be stack allocated. That's just 8 bytes for the object header saved.

roel_v · on Dec 17, 2014

Very late reply, but to clarify, what I meant was not 'was is the size of a string' but 'what is sizeof(char)'. Meaning, if your char is a fixed size it can't represent all characters or is wasteful in 95% of all cases (8, 16 or 32 bit) or if it's variable length the formal complexity goes up for a number of often-used operations.

Dewie · on Dec 5, 2014

> My customer's servers have usually 8Gb or RAM, 16Gb is becoming the norm.

Does the cache sizes not matter? Honest question.

MrBuddyCasino · on Dec 5, 2014

Thats not so easy to answer. It starts to matter a lot when you get into very high performance architectures (think disruptor, anything requiring lots of mechanical sympathy, highly contended memory access etc.), but usually you're waiting for the database or the network anyway.

Supposing you meant memory bloat compared to C, increased developer productivity is almost always more important. Things like memory access patterns can be important when you are interested in optimizing hot loops, but not generally.

ahomescu1 · on Dec 5, 2014

> Only downside: more memory needed, but nobody except the embedded guys care anymore.

As an Android user, I care. Android needs 2GB of memory to run the apps that iOS only needs 1GB for (running Android on a tablet with 1GB has been painful for me).

MrBuddyCasino · on Dec 5, 2014

Mobile borders on embedded imho, and I agree that Java wasn't the smartest choice there. I'm on IOS, so I don't really know, but didn't Google push for 512MB minimum RAM with Android 4.0, to better serve the emerging markets?

pjmlp · on Dec 5, 2014

> Lessons learned, their strings work

It is worse than that, because the other alternative systems programming languages that eventually lost to C and C++ also had their string type.

Just C did not, and C++ followed along due to its compatibility story.

angersock · on Dec 5, 2014

And here is where C shines compared to C++:

At least in C you know that strings are garbage, and you never pretend for a second that everything will be fine. C++ will claim "oh ho ho, we have a string type, don't worry!" and then people will get burned.

pjmlp · on Dec 5, 2014

If you read the discussion, you would that people are getting burned by converting all the time between char* and std::string, because Google isn't using the same string types across all APIs.

I rather have secure strings, even if every now and then they require some attention to.

_3u10 · on Dec 5, 2014

So in C# (on the .NET VM) when you allocate a string that's more than 65K when does it get freed?

I wouldn't call that 'working'.

csl · on Dec 5, 2014

I guess the underlying problem is that C++ doesn't have a standardized ABI. So you get things like QStrings, etc.

pjmlp · on Dec 5, 2014

It is a standardization problem.

When C++ got introduced, it used only C standard library, so everyone wrote their own library.

C++ was already 10 years old when C++98 was a thing and compilers still needed to catch up.

You don't re-write old code just because, so this cruft shows everywhere.

ajuc · on Dec 5, 2014

It's not a matter of ABI. Even source code level compatibility would solve most of the problems. I am happy to recompile as long as I don't need to convert on every api boundary.