Hacker News new | comments | ask | show | jobs | submit login
The Horror in the Standard Library (zerotier.com)
830 points by aw1621107 on May 6, 2017 | hide | past | web | favorite | 216 comments

OMG, as I was reading this I thought, "man, this reminds me of a bug I ran into with std::string back in 2000", A few sentences later, and this is also about std::string and the STL.

Mine was different though, after tracking down a memory leak that was happening with the creation of just new empty string, I discovered in the stdlib that there was a shared pointer to the empty string with a reference count of how many locations were using it (ironic that this was intended to save allocations). It turned out this was on Intel and we had what was rare at the time, a multi-processor system. It turned out that the std::string empty string reference count was just doing a vanilla ++, no locking, nothing, variable not marked volatile, nothing.

A few emails with a guy in Australia, a little inline assembly to call a new atomic increment on the counter, and the bug was fixed. That took two weeks to track down, mostly because it didn't even cross my mind that it wasn't in my code.

From that point on, I realized you can't trust libraries blindly, even one of the most used and broadly adopted ones out there.

> you can't trust libraries blindly, even one of the most used and broadly adopted ones

There is a corollary to development and debugging. When things break in mysterious ways, we tend to go through a familiar song and dance. As experience, skills and even personal networks grow, we can find ourselves diving ever further in the following chain.

1. "It must be in my code." -- hours of debugging

2. "Okay, it must be somewhere in our codebase." -- days of intense debugging and code spelunking

3. "It HAS TO be in the third party libraries" -- days of issue tracker excavations and never-before-enabled profiling runs

4. "It can't possibly be in stdlib..." -- more of the same, but now profiling the core runtime libraries

5. "Please let this not be a compiler bug" -- you become intensely familiar with mailing list archives

6. "I will not debug drivers. I will not debug drivers. I will not debug drivers."

7. "I don't even know anyone who could help me figure out the kernel innards."

8. "NOBODY understands filesystems!"

9. "What do you mean 'firmware edge case'?"

And the final stage, the one I have witnessed only one person ever achieve:

10. "Where is my chip lab grade oscilloscope?"

Apart from bullheadedness, this chain also highlights another trait of a good developer. Humility.

From my experience this is normal for embedded development; particularly for consumer electronics. Part of the reason some developers in this space have to wear so many hats is that the pace in consumer electronics is unforgiving. I don't think my current employer is unusual either.

Hopefully, the number of frameworks at the top, and the size of your individual programs are relatively small (so that 1-3 aren't nightmares by themselves).

In my experience, 4-5 are seldomly the problem (thanks Linaro!). I suspect the ratio of C to C++ is significantly larger in embedded systems though.

In general, PowerPC/MIPS/ARM toolchains and drivers are not as mature as x86/AMD64. 6-8 tend to occur because CPU vendors usually have their own "blessed" toolchains and BSPs that have diverged from their upstream projects. Fortunately, this means that it's often the case that someone else has already fixed the problem. It's just as often that a driver has not been tested for your use-case since the last time that particular driver's infrastructure was refactored inside the kernel. Or... you wrote the driver and made the mistake (or it might be something from 9/10).

9-10 happen because we're often using hardware that is new and has not had all of its errata discovered yet.

When products need to ship, we're regularly going through this stack. I've seen every one of these, even in just the last 4 years.

Can confirm. I had a trippy experience where I had on one monitor some RTL+simulation for our chip up for view, on another I had the PCB schematic I had helped design, and on my third I had the GUI and embedded toolchain development environments up, and on my desk I had an oscilloscope measuring that PCB running that firmware. It was basically rolling through the list and really fun!

Indeed. I too had once the dubious pleasure of having an oscilloscope on my desk, between two computers and a prototype.

I've done the oscilloscope thing, though it was only a vanilla couple-of-hundred Mhz scope on some pins that were bugging me, and not the full deal: "Gang way, we're cracking the lid of that thing and going in." That sounds exciting, I'd love to see it.

Once, I used a chunk of ice to cool down a chip, and that made it work. The hardware guys were unimpressed. But hey, they've got cans of Chill and they use them a lot, and this software guy took a while to realize the reason the board worked in the morning and was dead by lunch, and worked for a little while again after lunch, was temperature related.

There were some devs who tracked down a nasty bug in a processor's TLB. I only heard about that one, wish I had been there. I only had to deal with the fallout in the hypervisor. Note: If you have to spend 20ms hunting down and killing lies with all interrupts turned off and everything basically stopped in its tracks, you are no longer a real-time operating system.

Heh. It could be telling that I had to look up the expansion for TLB. CPU cache implementation... holy crap.

My ex-coworker has done the vanilla scope thing too and has a 400MHz scope at home. For some reason people like this are not too uncommon in Finnish oldskool[tm] IT scene. I remember how he isolated a latency and concurrency bug to an expensive interrupt handler. Rewriting isolated parts of core kernel code to make a really tricky problem go away was one of his more hardcore skills.

I'm not even near his level. My own experience is limited to slightly nibbling the edges of file system and block cache behaviour. It's a brave person who dares dive into that code. Not me.

But I do know one person who regularly works with decapped chips. He works for a company who do extremely low-level hardware investigations. Now that's hardcore.

Cache bugs are one of the fun ones. You think you're losing your mind and the people around you would probably agree. A couple of weeks go by, your spouse is ready to fire you, your boss wants to divorce you, and every waking moment is full of race conditions. Four-way stops on the drive to work are a source of stress and you punch buttons in the elevator and worry about firmware bugs. Then you get to your desk and there's the setup, a laughably small board for all the trouble it's made, and it's time for single combat, Sherlock Holmes style.

When you find the problem it's usually a blinding flash of realization that illuminates a tiny, eensy bit of code that you tweak and make right in a couple of minutes. Invariably the mistake was pretty stupid. The glory moment is over quickly because you know all the test cases will pass and that you've just nailed another one.

You've got bragging rights during one lunch, but that's it. It's off to more mundane bugs in the mortal world, and you feel a little sad.

I need to do hardware again.

I remember a 3G network signalling simulation I worked on back in about 2002. We ran it on a rack of custom servers. The CPU load was pretty hellish, and the only way we could get it to run reliably without segfaults was to install gaming cooling systems and underclock the CPUs ... ran like a charm then!

At the late 90's and earlier 2000's it was commonplace to fix OS panics by opening the computer and pointing a fan at it.

I would try it even before going for some harder software problems, because it's so easy.

Ever squeaked the chips on an Amiga?

People couldn't import computers into my country at the Amiga's time.

I had a local-made spectrum clone, it didn't overheat, but I lost a multimeter on its power supply.

> 10. "Where is my chip lab grade oscilloscope?"

11. "Shit, where do I borrow a spectrum analyzer and a set of near field probes? These things cost an arm and a leg!"

Yes, STM32F1 MCUs generate inference that jams GPS receivers. No, it's not documented anywhere.

And here's the documentation for future generations! :P

12. "Try our new 7nm fab they said. You'll be ahead of 10nm competition with few issues. Now, gotta call engineers at the fab to see if it's materials or production messing my custom stuff up. (sighs)"

You'd be surprised how much you can find out just with a $10 TV tuner dongle and a piece of coax with a short section of the outer braid trimmed off at once end.

You're describing my college tv antenna. Coax is cheap. An actual digital antenna is like... 50 ramen equivalents.

50 ramen = $10 USD

RTL-SDR is the Arduino of EMC work. :)

any MCU can "jam" a GPS receiver if the board is laid out improperly or without enough shielding

A bare, free-floating STM32F103 with literally nothing but a LiPo battery connected with two wires, running the blinky.c demo, will completely jam many GPS receivers when placed next to the antenna.

11. "Hm maybe I should check my code again... ah there's the bug"

"oh, this config shouldn't be linked to /dev/null..."

    if (featureFlags[HN_DEBUG_HIER_FLAGS] = null) {
    /* Who won't this trigger!?!?!?
...oh god, kill me now.

I've done the oscilloscope thing, but it was for IoT stuff - debugging broken a I2C communication with Arduino (8-bit 16MHz ATMega CPU).

The Arduino software stack is not huge, there is no operating system involved. Our application is the only thing that runs on bare slow hardware with very limited memory. But this also makes debugging harder. The IDE is limited, you debug over serial output. You have to reflash the flash-memory after every re-compile, which can take a minute.

Building a IoT system for very specific tasks that has to run reliable for years without interruption, I would still use a 8-bit tiny ATMega CPU (e.g. Arduino), and to control this tiny CPU and do some networking stuff with a control center using a 32bit ARM CPU (e.g. RPi).

> The IDE is limited, you debug over serial output. You have to reflash the flash-memory after every re-compile

uh, you know that AVRs have debugWIRE (smaller parts) or JTAG (bigger parts)?

The furthest that I have got down the list was trying to bring up the first prototype of a board that had been designed with too long traces on the PCB between the SoC and DRAM. If you tried to read a location in memory you got the value of the page table entry for that address rather than the address contents.

I once had to debug a poorly-designed board where the CPU would lock up if you did a DRAM burst write with at least 3 of the 5 highest bits of the word set (yes, I narrowed the test case down that far). A quick look at the layout confirmed that those traces were routed directly under the crystal oscillator without any form of ground shielding...

(We ended up underclocking the CPU by about 20% because there wasn't enough time for a redesign. Sigh. It's a miracle the thing even worked in the first place...)

... then your power supply goes marginal (because it will) and well . . . never been there :-)

I've had the opportunity spend about a week debugging incorrect configuration of SDRAM by BSP team. At first I blamed third party library with no source code available. Then it occurred to me that my initial SDRAM tests were doing word-by-word access. Third party library used memset which was optimised to use DMA for bulk transfers, which failed to write subsequent words in the same transaction.

An easy, one bit change in configuration registers of SDRAM fixed it. A week well spent!

Similar: my new driver crashes the machine. A couple days debugging. Triple-check every register value. All good. It doesn't crash when I single-step! A couple more days debugging. Finally get it: the machine crashes when two ports are enabled close enough in time. Go talk to the hardware guys. “Yeah, we know, power traces are fixed in the next rev.”

I feel like "I can reliably make the bug disappear by turning on my debug harness" is a reliable sign that things are about to get weird.

Ooofff, that list made my stomach churn, more stuff of nightmares! All debugging post-mortems of this level should be written in Lovecraftian style.

... it's not widely known, whispers attribute it to a transcription error, unsure when it started, copied through ancient manuscripts, that the Dead Thing that lies dreaming at the bottom of the ocean, is actually named ... C++hulhu

I have also seen developers far too keen to blame the library before exhausting the most likely case that the issue is caused by the local code (step 1 and 2), or at any rate is fixable in it.

It's a well-known syndrome. The classic motto for it is "'SELECT' isn't broken" https://blog.codinghorror.com/the-first-rule-of-programming-...

This too. For every one time it's the parent article, 99 times it's my code.

Regarding #10: Oh Lord, I've been there too many times to count.... One of the more memorable time was with an old timing distribution system. The thing would pretty much just send out clock pulses to networked machines and this cost a lot of money to do properly (very abridged here). This particular one was acting 'funky' and came back in. In testing it, we go really weird behavior. True to your list, I think we went 1, 2, 5, 6 (no drivers, per se), 7 (for about 5 days), 8, 9, 10. At 10, we finally plugged in the o-scope and started debugging the PCB vias and connections themselves. Things were getting really wacky now. The Faraday cage that was the testing room had to be re-grounded, we thought, as the wires themselves were still carrying current even when the power was dis-connected. One of the guys brought in his old hand-held impact hammer to drive a new copper stake into the peeled up linoleum and through the foundation of the building. Still, we got strange results. Like really strange results that, to us, were worthy of a Nobel Prize, as we had thus far proved to ourselves that physics herself was broken inside the lab. For reference: a lot of people worked in there, so having stuff about in all kinds of dis-repair was typical. I remember, long after the pizza had gone cold and the Mt. Dew was flat, I was looking up at the ceiling of the room. I saw an old RF horn hanging from the roof, kinda held together with the connecting wires. 'Hey, if that thing was on, would it do anything?' The other techs' eyes all lit up. Turns out, one of the guys was doing something with the horn for some other test. He had left for an extended backpacking vacation and accidentally still had the thing on. The broadcasting from the horn was adding the small amount of current to all our wires, thereby causing the whole box to go out of whack just enough to cause all the issues. At about 4 am, we finally got the box re-configured, the original problem from the customer solved, and all of it packed up and ready to overnight out to the customer for when the UPS store opened at 7am, about 3 hours from then. The poor guy got back from vacation to that mess of an email inbox and many meetings. It was an honest mistake, and he bought us all 12 packs for the trouble. Still, when you think you have proved that physics is broken, I think that will qualify as step 12.

11. "Do we have the IP core for this?"

12. "Where is my electron microscope?"

..13 "We're gonna need some time on the FIB workstation"[1]

[1] http://www.electronicdesign.com/eda/fib-circuit-edit-becomes...

That's the one I should've thought of. I said fab but who trusts them to know what's on it!? Haha.

This is the worst thing I've ever found, still not solved: https://github.com/crystal-lang/crystal/issues/4127

You must be kidding me. We hit a bug with the mysql2 gem where the client randomly crashes in libmariadbclient (but not libmysqlclient) only on debian (using Arch Linux and OS X for dev, but exact same versions of everything) and only for database names of length 25. And 28, but we cannot reproduce it on the repro docker image we made. And only if there are enough aliases in the query (could be as low as 5 but could need as much as 20+). And only if 'active_record' is require'd, but even when it's not used at all. And never ever under GDB or Valgrind, making it the perfect heisenbug.

That's a lot of stars to align there, but when they do, hell breaks lose just often enough to be sure it's not completely random, and obviously this hit one of our most finicky customers, and only in production because of course "#{customer}_production".size == 28 (and not 25, because nah that'd have been too easy to be able to reproduce the bug right away).

[0] bug: https://github.com/brianmario/mysql2/issues/822

[1] repro: https://github.com/adhoc-gti/mysql2_pointer_bug

I've only seen level 5 personally, with a C# compiler bug that would omit totally valid `else` branches.

My favorite level 5 was a bug in clang that caused it to occasionally emit incorrect code when calling a vararg function. However, the bug was harmless when combined with clang's vararg function prologue. When calling a vararg function compiled with gcc, the clang bug would cause gcc's prologue to jump to a quasi-random address vaguely nearby and continue execution in the middle of some other function. That was great fun. I wrote it up here:


Care to elaborate? Which C# compiler?

The official one, I think it was in .NET 3, but it was a few years ago at an old job, so I'm a bit hazy on the details.

Basically we had a bug where a whole conditional branch was being skipped, and we traced it down to the branch being omitted entirely from the compiled IR.

And no, it was nothing fancy, just something like:

    if (customer.country == "US") {
    } else {
The whole `else` branch was simply missing from the compiled program.

If I remember correctly, we got around it by doing something like:

    bool isUsCustomer = customer.country == "US"
    if (isUsCustomer) {
    } else {
Anyway, the point is that the compiler fucked up it's handling of if/else statements, but only at that specific part of the code, leading to a few wasted days of effort tracking down the problem.

It can get even more "fun" with Java. Your code can start running through an interpreter, then after a while suddenly be transformed by a JIT engine. The interpreter and the JIT engines (there's more than one JIT engine) have different bugs. The optimizations made by the JIT engine can depend on the data which went through your method before the JVM decided to optimize it.

I'm not finding it right now, but I recall seeing a few weeks ago a presentation with several of these sorts of bugs in a recent version of Java (all reported and fixed): after a number of iterations, it suddenly starts returning wrong results.

Sounds like optimization going haywire, deducing that the statement under question would constantly evaluate to this term. Its valid to optimize a else statement out- if it will never be reached (Dead Codepath Optimizing out). Was there something akin to this in the statement?

> Was there something akin to this in the statement?

It probably was the optimiser at fault, but there wasn't anything special about this conditional, and certainly nothing that _should_ have caused the optimiser to throw away the else branch.

If memory serves right it was comparing a string field of an object to a static string, like `someObject.foo == "some string"`.

Sorry, I don't buy it. I've seen countless cases where developers conclude that "compiler has a bug" and it never ultimately did. There are also cases where they never bother to figure it out, change the code a bit and continue with their lives thinking they've found a bug in the compiler. But they didn't.

> Anyway, the point is that the compiler fucked up it's handling of if/else statements, but only at that specific part of the code

It would have to be specific. Put it this way: if this was a general bug and the "else" was always omitted, how long would it last before being found and fixed?

Related, if you were to say to me "I found the issue, the compiler isn't correctly handling if/else statements" Then my first thought would be about your medication not about the compiler.

> if you were to say to me "I found the issue, the compiler isn't correctly handling if/else statements" Then my first thought would be about your medication not about the compiler

And yet, it happened :)

And the senior engineers at the company looked at it and confirmed it was a compiler bug. Their best guess was that something about that part of the code was putting the compiler in a funny state, causing it to skip that particular `else` branch.

We reported it to Microsoft, but never heard anything back.

went through these kinds of stages a few times in my career.

once it led me to discover a leak in a major travel website's purchase flow caused by Java's Thread class, related to thread groups.

most recently, I was writing some Linux auth code in C, reached a point where I could rule out my code, and found a bug in sudo. freaking sudo.

(also related to groups, though the Linux user kind, not Java threads.)

For devs using higher level languages it is more like:

1. "It must be my code" -- minutes of debugging

2. "It must be in our codebase" -- hours of debugging

3. "Third party library or framework" -- If library use a different library, if framework accept the bug and work around it whilst cursing framework choice.

This one goes to eleven: 11. "On some setups, clients get a corrupted stack when swapped back in from kernel."


This is the most concise list of debugging layers I have seen. I'm mostly commenting so that I get this into my comment feed and can locate it easier.

In that case, feel free to bookmark this one: http://bostik.iki.fi/aivoituksia/random/developer-debugging-...

This got more attention that I thought possible, so I decided to pull it out as an item all by itself.

Spot on. The experience level of a developer is directly proportional to how fast he/she assumes that the problem is in someone else's code. ;-)

That one depends a lot on who your coworkers are too.

I was thinking more of the code in the OS, compiler, run-time, etc. I should have made that more clear in my comment.

I feel lucky to only have reached item 4; I wouldn't feel confident beyond it, anyway.

I don't think anybody is ever confident beyond level 1. The worst part is those "why the fuck does everything work when I plug a logical analyzer?" moments.

I'm usually well into #3 before I realize I'm in too deep

> It turned out that the std::string empty string reference count was just doing a vanilla ++, no locking, nothing, variable not marked volatile, nothing.

The whole implementation is smells of silliness, because there is no need to track how many references there are to a global null string, which need not even be dynamically allocated.

I know; I've often wondered if this was changed, never went back to look.

Probably a few years ago, around C++11 time. Where the standard made it not possible to have a Copy-on-Write implementation of string.

The problem is forgetting that dynamic memory usage is not "free" (as in "gratis" or "cheap"). In fact, using std::string for long-lived server processes doing intensive string processing (e.g. parsing, text processing, etc.) is already known to be suicidal since forever, because of memory fragmentation.

For high load backend processing data, you need at least a soft real-time approach: avoid dynamic memory usage at runtime (use dynamic memory just at process start-up or reconfig, and rely on stack allocation for small stuff, when possible).

I wrote a C library with exactly that purpose [1], in order to work with complex data (strings -UTF8, with many string functions for string processing-, vectors, maps, sets, bit sets) on heap or stack memory, with minimum memory fragmentation and suitable for soft/hard real-time requirements.

[1] https://github.com/faragon/libsrt

Maybe not enitrely related, but I had worked on a Java project where any new object creation was prohibited !

Had a VERY VERY hard time unlearning and catching up to that "paradigm" but I now have a much better perspective on "automatic memory management" .

>> The problem is forgetting that dynamic memory usage is not "free"

Totally Agree.

Are you still developing this project?


I encountered exactly the same issue few years ago in UIDAI in one of our large scale biometric matchers and the resolution was exactly the same. After a week of debugging I found that the libstdc++ allocator was the culprit. I found [1] and confirmed the same, which helped in fixing this issue.

The thing that was more interesting (or sad) was to know that the GCC developers didn't expect the multithreaded applications to be long running.

"Operating systems will reclaim allocated memory at program termination anyway. "

[1] https://gcc.gnu.org/onlinedocs/libstdc++/manual/mt_allocator...

Notes about deallocation. This allocator does not explicitly release memory. Because of this, memory debugging programs like valgrind or purify may notice leaks: sorry about this inconvenience. Operating systems will reclaim allocated memory at program termination anyway.

Wow. This is worth a Linus Torvalds-level rant. Whoever accepted this code into the source tree needs to be put on GNU's version of a performance improvement plan.

This is actually a fine thing to do, i don't remember where i read it, but a good analogy for trying to free memory at program exit is like trying to clean the floors and walls of a building right before it is demolished.

This is also why Valgrind separates reachable and unreachable memory and only considers unreachable memory as leaks.

How do you know the program is even supposed to terminate? Maybe it's a server, as in this case. Maybe the code will be reused someday as a subfunction or library within a larger application, instead of being launched and terminated directly by the OS. Maybe it's an embedded application in a 24/7 factory somewhere. Maybe it's on its way to the Kuiper Belt. Or maybe it's just supposed to stay up and running for longer than the average Windows 10 update period.

In any case, hiding this sort of behavior in a way that sucks down days of debugging time on the part of one expert programmer after another, after another, after another, is terrible engineering.

Freeing memory at program exit may be unnecessary, but being unable to free memory while the program is running is terrible.

On iOS and android you are expected to free whatever unused memory you can when you are notified of a low memory situation.

It's OK to do, but it makes memory analysis difficult. If your app exits with a lot of allocated memory, it's hard to tell what's a real leak and what's not.

This behaviour is one of those cache-memory-leaks. Where even though memory is reachable it is still effectively leaked because it's soaked up by some data structure and not released to the rest of the system.

So it's not a traditional leak but because memory usage would continue to grow it causes the long lived process to choke itself and die.

A good example is something like doing a "cp -a". To preserve hardlinks you'll need a mapping, trying to free that at exit can and will take time. This was an actual bug in the coreutils.

I have to wonder why the C standard library didn't include a Pascal-style mark/release allocator. A naive implementation wouldn't be much faster than free()'ing a bunch of allocations manually, but the general idea offers possibilities for optimization that otherwise aren't available to a conventional heap allocator.

Because (m)alloc predates mmap by a decade or so [1], and you expressly don't want stack-like allocation semantics when using malloc (otherwise you'd just put it on a/the stack).

[1] And you cannot have more than one heap without mmap. Without mmap, you only have sbrk = the one heap. On UNIX and those that pretend to be, anyway.

Raymond Chen often uses this analogy on the blog Old New Thing.

Years ago I read in the Perl documentation ( http://perldoc.perl.org/perlfaq3.html#How-can-I-free-an-arra... ):

> On most operating systems, memory allocated to a program can never be returned to the system. ... Some operating systems (notably, systems that use mmap(2) for allocating large chunks of memory) can reclaim memory that is no longer used ...

Because you can't return unneeded memory to most operating systems (or because it used to be that you couldn't return unneeded memory, even if that has changed recently), it isn't a surprise that by default GCC's free() and operator delete -- which are meant to be cross platform -- don't try to return that memory. Instead it's all free list management.

I do think it's silly for operators new/delete to have a separate free list from malloc()/free().

>I do think it's silly for operators new/delete to have a separate free list from malloc()/free().

They don't. This is a custom, simple, non-default pool allocator for standard containers (i.e nothing to do with new)

Thanks for the correction. But it's just as silly to create an allocator for the standard containers, when the allocator consists of little more than free list management, given that malloc/free and new/delete already do that.

It's especially silly for a library (programs may want custom memory management and libraries really shouldn't go out of their way to make that harder), and the fact that it's the standard library doesn't make it less silly.

GCC's std::allocator also doesn't do that (not for at least a decade, IIRC). It's a non-default allocator, which nobody is forced to use. It's entirely optional. The default std::allocator just uses new/delete.

Isn't this just a Free List allocator?


Make them work on gnome?

Looks like this is just one of possible "non-default" extension allocators, that can be selected during libstdc++ compile time: https://gcc.gnu.org/onlinedocs/libstdc++/manual/memory.html#...

Facts and up-to-date documentation, how novel.

I myself ran across this same scenario many years ago with a similar amount of hair pulling and eventually concluding that the GNU libstdc++ allocator wasn't reusing memory properly. Unfortunately, I was never able to pare down the application to the point that I had a reproducible test case to report upstream.

GLIBCPP_FORCE_NEW was the solution for the near term and since I was deploying on Solaris boxes I eventually switched to the Sun Forte C++ compiler.

It really bugs me that this problem still exists. :-/

It really bugs me that nobody links bug on bugtracker or filed it. I know I'm asking a lot and being an ass.

No, that's a reasonable expectation. Unfortunately, developers often don't appreciate bug reports and become defensive. File a bug report and you will be expected to provide a reproducible test case, or your bug will get closed. This isn't always possible, and the reporter might not be able to dedicate the time to do this.

I used to file lots of bug reports, because I know that as a developer, I'd want them. Every bug report is valuable, even if it's not reproducible. It happened to someone, so it surely also happened to 10 other people who did not report it, it's important.

Unfortunately, bug reports often get responses like "cannot reproduce", "please provide more details which is oh, about a day or two of work". Well, not everyone has a day or two to spend on a bug report, especially if you (like me) hit software bugs regularly.

These days I very rarely file bug reports. It just isn't worth the effort, as most developers do not appreciate the bug report. The dominating perception is that bug reports are annoyances that need to be closed ASAP, and if it isn't easily reproducible, it doesn't exist.

So I'm not surprised that nobody opens bugreports for bugs like the one described by the OP. Not easily reproducible? Rarely occurs? It's highly probable that nobody would care.

BTW, I ask users of my software to please do report bugs. Every bug report is a valuable data point, even if it isn't reproducible. And I do appreciate the effort that it takes just to file a bug report.

The problem with libstdc++ is that, in my experience, 90% of bug reports are user error. That means non reproducible bug reports have almost no value.

I know you were probably just throwing out the “90%” statistic, but if you take that as a given, the implication that 10% of the libstdc++ bugs on file are legit is a worrisome notion in and of its own right. I don’t want to be responsible for triaging those bugs (and nor do you, I am guessing; this being why the reports are valueless) but the fact that this is the bug rate in this, a gold-standard library in common, ubiquitous use… well as far as I can see, this is the context in which that the OP’s article should be considered.

… I build all of my C++ projects with Clang and link them against libc++, so I don’t know if I am dodging a very high-caliber bullet (so to speak) or if the other shoe will drop at some point, and I will find myself going down the OP’s rabbit-hole of library-bug investigation.

You are right, it's probably a lot less than that.

The bug quality is higher than you might expect, because you have to register with bugzilla, and most bug reports are with development versions, as bugs are shaken out of new features. There are very few bugs in released versions, and where those bugs exist they are often of the form "stupid type where I have redefined & doesn't compile, while the standard technically days it should", where most users would never got them. Wrong answer or crash bugs in releases are extremely rare, although they could be rarer -- the test suite has less coverage than I would personally like.

Yeah: people run into issues and somehow feel "obviously everyone else ran into this same issue and it was never fixed so they must not want to fix it" without it ever really occurring to them that maybe almost no no one runs into this problem and no one knows it exists, or even that it could be their code that is broken. We all know "I compiled my code with -O0 and it started working" is almost never due to "there is a bug in the code optimizer", and this one sounds extremely similar. If people think they have found a bug, and they want to feel righteous and haughty over how it never got fixed, they really need to be demonstrating it is actually a known bug (and maybe you were being sarcastic, but I will just state flatly: being bothered when people do not does not make you the ass).

I searched GCC Bugzilla and found one result[0] for GLIBCPP_FORCE_NEW.

[0] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=13823

Edited to add the following text.

There is one result[1] for GLIBCXX_FORCE_NEW.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65018

Default bugzilla search doesn't include resolved, verified, and closed bugs. That said, I only found one slightly relevant, which was resolved invalid, saying the env var should be used: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=10183

And that's just due to not freeing the pools, not whatever is the complex problem here.

I think these days it goes by GLIBCXX_FORCE_NEW

All the technicalities aside the writing style of the author is amazing. I would have never thought that someone can create such an intense narrative with 'malloc' as the main character

I thought the metaphor was pretty tortured, but I generally appreciate anything that colors up dry technical writing so much that I give it a pass.

> Nothing made any sense until we noticed the controller microservice's memory consumption. A service that should be using perhaps a few hundred megabytes at most was using gigabytes and growing... and growing... and growing... and growing...

Not identifying this until many hours after symptoms were impacting users sounds like a pretty big monitoring blind spot.

Yes, although to be fair blindspots are always obvious in hindsight!

Did you report the issue upstream with a patch? The solution to "the standard library is broken" is to fix the standard library, no? It's all free software after all.

I'm not the post author, unfortunately, so I really have no knowledge of the specifics of what is going on.

It seems like the libstdc++ maintainers are aware of the issue at least, so that's a start. It'd be nice if some of the mentioned discussions/complaints were linked, though, so we could see what has already been said/done.

The post author has an active handle on HN (api)

Oh my. What a fun read. Sort of like reading about a horrific car accident. You read it and you think "Oh thank God… That can't possibly be real…"

And then you realize that it is.

It's a known issue according to what I've read. It's never been fixed. This may be due to the fact that it's hard to trigger and reproduce.

Can you link to some of the things you've read? What exactly is the "known issue"?

From the documentation (linked by arunc above)[0] "Notes about deallocation. This allocator does not explicitly release memory. Because of this, memory debugging programs like valgrind or purify may notice leaks: sorry about this inconvenience. Operating systems will reclaim allocated memory at program termination anyway. If sidestepping this kind of noise is desired, there are three options: use an allocator, like new_allocator that releases memory while debugging, use GLIBCXX_FORCE_NEW to bypass the allocator's internal pools, or use a custom pool datum that releases resources on destruction."


That doesn't explain the «Some allocations were much larger than anything ZeroTier should need.» part.

Yes it does. The secondary C++ allocator is doing its own pooling on top of the primary allocator.

It's not a known issue, but GLIBCPP_FORCE_NEW has had no effect on libstdc++ code for more than a decade, so I wonder which prehistoric version you're using. Even the modern GLIBCXX_FORCE_NEW doesn't do anything for the default std::allocator implementation, which always uses new/delete unconditionally (since 2005).

Is there an open bug report? I have read the docs which specifically say "we don't care about leaked memory", but the brokenness being documented is different from it being a "known issue".

And, to be blunt, "it's never been fixed" should be written as "it's not been fixed yet" -- because you have an opportunity to fix it. That's how free software works after all.

> The solution to "the standard library is broken" is to fix the standard library, no? It's all free software after all.

Doesn't the author make that case at the end?

They make the case that it's broken with very strong language. They don't say "here is a patch" or even an outline of a patch (or a link to a bug report even), just histrionics.

> GNU libstdc++ is broken. This is pretty unforgiveable. [...] Adding wheels to the wheel is sometimes forgiveable when dealing with closed systems that you can't fix but libstdc++ and glibc are both open source GNU projects.

I think there's a disconnect in the author's mind with regards to how free software projects work. If something is broken, you as a user are empowered to fix it. I mean, the author even went through the trouble of reading the libstdc++ code and figuring out what happened inside new (which is likely enough information to write a preliminary patch). At the very least open a bug report about it (or link to an existing one).

Unlike proprietary software, there are many ways to improve free software and ranting online is rarely one of them. I get that this bug screwed them over big time and that they are angry about it. But converting that emotional energy into something useful would help many people other than themselves.

tl;dr: If you find yourself ranting about "why wasn't X fixed before" in a free software project, it might be helpful to realise that you have the opportunity to be the person who fixes it.

You can certainly send a patch, but there's no guarantee they'll accept it. Some core projects seem to be highly opinionated. Look at GCC's attitude to plugins or good error messages. Or Linux and gr-security. Or Linux and stable driver ABIs.

There are plenty of things that people would like to change but can't because the maintainers disagree. Not saying that is necessarily bad but you are stupid if you think the answer to everything is "well did you write a patch?".

> Not saying that is necessarily bad but you are stupid if you think the answer to everything is "well did you write a patch?".

I'm not sure why this tone is necessary. If someone just rants about a problem without even _trying_ to submit a bug report with a proposed patch that strikes me as laziness not foresight.

As a maintainer myself, I am well aware that maintainers will reject code if it disagrees with our view of a project. But how on earth do you expect us to know there is a problem without reporting a bug (the only reason I asked whether the author wrote a patch is because they went through the trouble of debugging the problem so probably are in a good place to write a patch anyway)? And if you decide to write an angry and ranting blog post rather than interact with us, we aren't going to be very nice to you either.

> Look at GCC's attitude to plugins or good error messages.

Which is?

A patch isn't necessary for it to be fixed, but a bug report generally is. A blog post linking to unofficial copies of documentation from 2004 doesn't count.

If its broken - it is commented out, or opt in via parameter. It is not shipped - AS IS and then when after weeks and weekends the bug is found, you dont just pose yourself at the wall of the crater - lean down and yell. "Well this is just great, you discovered a cavern. With the sweat of your brow, this could be a nice house one day. Allmost like the one we promised to deliver in the first place."

Things like this is why I was happy to see the LLVM project write their own C++ standard library. libstdc++ has always seemed a bit hacky and fragile to me. It's great to have an option which is a more modern, clean codebase.

Have you tested to see if this works better with LLVM libc++?

Alternatively, things like this is why it is great when everyone works together to improve one lib rather instead of forming their efforts on two essentially-identical ones. You act like libc++ is somehow simply better and probably doesn't have bugs :/. I have been doing C++ work now for over two decades and let me set that record straight in your head: incredibly basic stuff has been pretty extremely broken in libc++. One horror story that wasted way way too much of my life is that the copy that Mac OS X 10.7 seriously shipped with a build of libc++ where std::streambuf failed to check EOF conditions correctly. Despite most of my projects being compiled simultaneously by numerous versions of both gcc and clang (to target various weird configurations), I seriously don't remember the last time I ran into a bug in gcc and libstdc++... it was pre-2006 for sure... but I continue to run into annoying issues with clang and libc++. The correct way to read "modern" when applied to "codebase" is "untested". And hell: while I am totally willing to believe there is a bug here, this post doesn't have a fix and doesn't even seem to have led to a bug report. This is like saying "I compiled my code using -O0 and it started working, so clearly this is a bug in the optimizer", which we should all know is a dubious statement at best.

Alternatively, having a choice of more than one thing tends to cause competition to kick in, which often results in better quality than a single solution that everyone pretty much has to use.

The "make malloc faster" part was done over a decade ago with the followup from ptmalloc2 (the official glibc malloc) to ptmalloc3. But it added one word overhead per region, so the libc people never updated it to v3. perf. regression. They rather broke the workarounds they added. And now they are breaking emacs with their new malloc.

>> Most operators in C++, including its memory allocation and deletion operators, can be overloaded.

Have I mentioned lately how much I hate C++?

Great read.

"Then I remembered reading something long ago" is when experienced programmers are worth their weight in gold.

Interestingly for recent versions of GCC (>=4.0) GLIBCXX_FORCE_NEW is defined for libstdc++, not GLIBCPP_FORCE_NEW.


I'm a bit confused here.

>> Most operators in C++, including its memory allocation and deletion operators, can be overloaded. Indeed this one was.

Okay, well, firstly - the issue here seems to be a problem with the implementation of std::allocator, rather than anything to do with overloading global operator new or delete. Specifically, it sounds like the blog author is talking about one of the GNU libstdc++ extension allocators, like "mt_allocator", which uses thread-local power-of-2 memory pools.[1] These extension allocators are basically drop-in extension implementations of plain std::allocator, and should only really effect the allocation behavior for the STL containers that take Allocator template parameters.

Essentially, libstdc++ tries to provide some flexibility in terms of setting up an allocation strategy for use with STL containers.[2] Basically, in the actual implementation, std::allocator inherits from allocator_base, (a non-standard GNU base class), which can be configured during compilation of libstdc++ to alias one of the extension allocators (like the "mt_allocator" pool allocator, which does not explicitly release memory to the OS, but rather keeps it in a user-space pool until program exit).

However, according to the GNU docs, the default implementation of std::allocator used by libstdc++ is new_allocator [3] - a simple class that the GNU libstdc++ implementation uses to wrap raw calls to global operator new and delete (presumably with no memory pooling.) This allocator is of course often slower than a memory pool, but obviously more predictable in terms of releasing memory back to the OS.

Note also that "mt_allocator" will check if the environment variable GLIBCXX_FORCE_NEW (not GLIBCPP_FORCE_NEW as the author mentions) is set, and if it is, bypass the memory pool and directly use raw ::operator new.

So, it looks like the blog author somehow was getting mt_allocator (or some other multi-threaded pool allocator) as the implementation used by std::allocator, rather than plain old new_allocator. This could have happened if libstdc++ was compiled with the --enable-libstdcxx-allocator=mt flag.

However, apart from explicitly using the mt_allocator as the Allocator parameter with an STL container, or compiling libstdc++ to use it by default, I'm not sure how the blog author is getting a multi-threaded pool allocator implementation of std::allocator by default.

[1] https://gcc.gnu.org/onlinedocs/gcc-4.9.4/libstdc++/manual/ma...

[2] https://gcc.gnu.org/onlinedocs/gcc-4.9.4/libstdc++/manual/ma...

[3] https://gcc.gnu.org/onlinedocs/gcc-4.9.4/libstdc++/manual/ma...

Call me confused, too. When I was once following up the allocator in the STL because of some performance issues, it was (more or less) going directly to malloc.

I've searched now the code for GLIBCXX_FORCE_NEW, and it seems it is used in the pool_allocator and the mt_allocator [1].

String uses std::allocator [2].

I agree, the blog entry seems to missing some information to reproduce the issue. It looks to me, that the author was jumping to a conclusion, which confirmed his initial "insight". Not surprising, if you are under pressure and working over the whole weekend and nights. Who hasn't been there.

What bug me, that the standard answer seems quite often, that the whole thing is broken, and we have to switch to a complete different implementation, and/or re-write it from scratch.

[1] https://github.com/gcc-mirror/gcc/search?p=1&q=GLIBCXX_FORCE...

[2] https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-...

I too was curious about what distribution they were using; specifically which GNU libstdc++ packages. It seems like a drill down to that specific distribution's rpm/deb/etc packaging scripts to see their ./configure options would also be important.

Post doesn't actually say what was broken, or indeed prove the location of broken-ness. Just that it went away with a different compile option.

Exciting writing, but lacking a point.

Yeah, the post is just accusation with very indirect and unverifiable evidence.

I'd be nice if author created repro or actually spotted the bug.

Right now it's just a moaning. Quite probably, he's shooting into his leg himself in one of the numerous odd ways.

this conclusion might be wrong. the code in question while it might not be allocating/freeing memory it might be stumbling on memory blocks and corrupting mem management structures. Turning the flag on might be fixing the issue by mere luck because memory allocations, locations and structures would be different

Valgrind would've caught those, wouldn't it? Or, maybe the C++ layer prevents it from catching that since it's application-level, which is at least another reason to remove it.

The tools are good but not magic. It's always possible that they miss something (e.g. when multithreading, the bug might not manifest since the timings are all different when running in valgrind). But this is a red flag:

"Nothing worked. It's leaking but it's not. It's leaking but the debugger says no memory was lost. It's leaking in ways that are dependent on irrelevant changes to the ordering of mundane operations. This can't be happening."

This is a red flag for heap corruption - or multithreading bugs. (Stack corruption is usually a crash and a wrong stack trace). As it's not trivially reproducible, it's probably a multithreading bug. It's also easy to imagine that OP wrote a scripty-input generator in C++ to run through valgrind which gives single threaded inputs. So running under valgrind it won't be detected. So it's never fun to solve and ever since I changed languages, my multithreaded debugging skills have become a bit rusty. Hey-o!

But it would be good if he amended his post to reduce the vitriol aimed at GCC. It's demonstrably false that GCC's default allocator holds a cache. It makes OP look stupid to people in the know; and makes GCC look bad to people not in the know. No one wins.

Memory fragmentation due to dynamic non fixed size data structure and multithreading is an old foe. That may not be fixable in c/c++

Worker A allocates dynamic stuff. Algo take a segment (0+sof(str)(ofA) + n) Work B Allocates to create same kind of data structure (fragment of a JSON) [ofA, OfB] Wk A resume allocating, boundary of [0, ofA] exceeded, no free contiguous space up or down [Ofb, OfC] allocated Wk C enters wants to alloc, but sizeof(string) make it bigger than [0, OfA] so [ofD, ofE] asked .... and the more concurrent workers the more interleaving of memory allocation go on with fragmented memory.

Since malloc are costly the problem known, a complex allocator was created with pools of slab and else, probably having one edge case, very hard to trigger having phD proven really complex heuristic.

CPU power increase, more loads more workers, interleaving comes in, edge case gets triggered.

And C/C++ makes fun of fortran with its fixed size data structures embracing any new arbitrary size arbitrary depth data structure for the convenience of skipping a costly waterfall model before delivering a feature or a change in the data structure and avoiding bike shedding in committees.

Human want to work in a way that is more agile than what computers are under the hood.


Always allocated fixed size memory range for data handling, and make sure it will be enough. When doing REST make sure you have an upper bound, use paging/cursors, which require FSM, have all FP programmers say mutable are bads, sysadmin say that FSM are a pain to handle when HA is required, and CFO saying SLA will not be reached and business model is trashed, and REST fans saying that REST is dead when stateful.

Well REST is a bad idea.

Multi-threading ain’t particularly popular on Linux, hence issues like this one in STL.

But it’s fixable. In windows, they’ve implemented the solution in Windows XP (opt-in), and in Vista they use it by default:


The good thing about fixed size memory ranges is it decreases debugging time. It also allows deterministic behaviour for your modules.

Amazing write-up. Informative and gripping in its prose.

This article was an extremely long rant about "something has a if but templates are hard so I have no clue what it is". The author figured out a workaround to the issue, but we still don't know what the bug was, and the strongly worded conclusion that it is a bug in libstdc++ isn't even defended well (as this is similar to concluding "there is a bug in the compiler's code optimizer" from "I compiled my code with -O0 and it started working")... I can't really see calling this "informative" :(.

I'm curious, what is usually the cause of code starting to work when it is compiled with -O0?

* Optimizer bugs. Thankfully these are less common nowadays. It the "good old days" of new compilers this happened quite a bit.

Embedded code:

* Missing "volatiles" which allow the optimizer to optimize out "unused" loads and stores to hardware or multitasking shared variables.

* Race conditions (e.g. unsynchronized access to multitasking shared variables). Making the code run slower changes the access pattern, often times obscuring the bug.

Crap code that hit undefined behaviour every other line (like accessing dead temporaries or freed memory, assuming stack layout, out of bound accesses, etc.).

In my experience, it often means variables that are used before they are initialized, or dangling pointers.

I don't have an answer to that, but I have seen - and worked with - several codebases which work fine when compiled with -g, but will crash (in good cases) or behave irrationally (in bad cases) without.

The crashing ones at least are easy. Somewhere a list or variable-argument array is missing the NULL terminator...

This was on Windows with VC++, but same deal. The code that some cow-orkers had written was copying strings like "LAX" and "ORD" into `char airport[3]` using strcpy(). In debug builds, VC was allocating a whole 32-bit word, but in release builds it was packing everything on the stack. Write to it, and the terminating null ends up overwriting a byte of the next variable on the stack. Urgh. Of course, these were several hundred line functions, so the strcpy and the subsequent use of the trashed variable were a long ways away.

Yes, I'm pretty sure the bug is not in the c++ library, but in his code.

It was a great, gripping write-up. It also corroborated why I told api he was better off using a subset of C or safe language that generated it for software like this. I told him there were tons of ways to analyze or make safe C subsets but almost nothing available that will get similarly great results on C++ code. This was a good example of where its complexity and style of sneaking in abstractions bit him in the ass in a way that might be easier to spot in C, Ada/SPARK, a Wirth language, etc. C++ style is safer on average but highly-robust code is better in restricted, analyzed C if not a safe language.

(Original blog author here.)

That's a nice idea, and we've considered "minus minus"ing the ZT core as part of an embedded port. But code like this that shleps a lot of structures around and works with JSON is eye gougingly painful to write in C and the chance of a worse and possibly exploitable memory bug is much higher.

This is the first time we have encountered an actual problem with C++ compilers or runtimes.

You don't write those parts in C alone. You use something that shows the C is safe automatically, use tool that generates secure C from specs (eg Nail), and/ use safe language that compiles to C. This way, you get benefits of C ecosystem without risks of totally using C.

I'm not sure if the other debug tools mentioned offer this, but AQTime Pro:


has an allocation profiler that can be used to track down this sort of problem. You can take allocation snapshots while the application is running to see where the allocations are coming from (provided that you can run AQTime Pro against a binary with debug symbols/info).

I'm not affiliated with the company - just a happy customer that has used them for years with Delphi development.

Delphi... Now that's a name I haven't heard in a long time

It's still kicking. ;-)

Upvoted for the Lovecraft and pulp horror lit references, and starting with "It was a dark and stormy night ..." :-)

Great writing, great read.

I don't understand the point of this article... if you think there's a bug in the library, fix it. Don't write a melodramatic blog post lamenting how horrible it is in the hope somebody else will do it for you.

This isn't particle physics, it's code: we don't have to guess, we can look at it and see how it works.

One thing, not even mentioned in the article is the component-scouting & profiling-phase, which completely failed. You do not go all in with a project on a crucial library, that you did not profile with the real workload.

One small prototype, never run under full load, with mock up classes- not even the size of future classes, mock up operations(not even close to the real workload) and sometimes not even the final db-type attached. Yeah, hard to see the future, but why not drive the test-setup to the limits and go from their?

Instead the whole frankenservice is run once for ten minutes and declared by all involved worthy to bare the load of the project.

Here is to lousy component scouting and then blaming it on the architect.

This was a nice write up, however I didn't follow how memory fragmentation was related to a memory leak. Can someone explain? I understand that alternate memory allocators would help with the fragmentation issue but how does the choice of allocators affect memory leakage?

When you have a rapid sequence of large and small allocations, your memory starts looking like swiss cheese. To fit a large allocation you need a contiguous block of free memory, but all the available blocks are too small, so it gets placed at the end where the free ram starts. Then another allocation is put right after it, the big chunk gets freed, and a small allocation is put in its place. Now instead of one big block of free memory you have a slightly smaller block, which is slightly too small to fit a large allocation, which has to be put at the end, and...

I ran into this issue once debugging a performance issue with a CAD file parser. It made a lot of copies of large and small chunks, and some CAD files would cause catastrophic fragmentation. Switching out the allocator for a smarter one fixed the problem. A few versions of delphi later they put that allocator in by default, so that now it is not prone to fragmentation anymore.

I'm guessing the cpp thing is a holdover from the days when the glibc maintainer was less than entirely helpful. There has been actual improvements in glibc in this area lately so hopefully these kinds of hacks will slowly go away.

The pooling behaviour of the libstdc++ std::allocator was only the default from 2004 until late 2005, so it went away more than a decade ago.

Is C++ now going the way of PHP, where to have an actual working program you have to disable all the defaults in some mysterious but crucial ritual?

James Mickens, move over. There's a new sheriff in town.

the idea that we should always blame ourselves first has merit. but frankly some bugs, just p e r s i s ttt.

like this one, with fputcsv in PHP. https://bugs.php.net/bug.php?id=43225

It was only yesterday that I was reading another discussion from hacker news about problems with Gnu C library.


That's a different library.

People forget that C++ is just a tool, like a screwdriver or a hammer. A good carpenter knows when it's time to take a metallurgy class and resmelt his hammer, because its composition is not correct for being a hammer.

If occasionally when you try to use a hammer to hit a nail, the hammer swings the other way and hits you, that tool has violated some essential assumptions about how tools work, and the "it's just a tool" reasoning might not apply as clearly any more. :)

I am pretty sure there has never been a carpenter who resmelted his hammer.

I'm not sure "resmelting" is even a thing. I am sure that the GP is joking.

Guys, yes! I was joking. I was directly riffing on the common claim that C++ is "just a tool". This exact phrase, in quotes, has over a hundred thousand hits on Google[1], many of them immediately comparing it to a hammer or a screwdriver. The usual context is that C++ isn't really "unsafe" -- it's just a tool, how you use it is up to you.

In the case of the standard library being broken, it is like a tool that is broken. Resmelting might not be a word but the idea of the hammer not being cast correctly and needing to be re-cast is ridiculous. In essence I was making fun of the "there's nothing wrong with C++, just like there's nothing wrong with a hammer" common claim. I was taking it to an extreme. (I thought it was funny.)

[1] https://www.google.com/search?q=%22C%2B%2B+is+just+a+tool%22

I have no idea why I thought you were serious. Poe's law?

It's not a thing. Smelting means purifying from ore by melting. Unless you turn your hammer back into iron ore, there's no way to "resmelt" it

If you melt metall down again, you get a very brittle, carbon reduced new version, whos properties depend upon the crystallization temperaturcurve?

I think that is the joke...

So where's the bug report with repro test code?

Maybe it'll get fixed now that a post saying "libc++ is broken" got hackernewsed

libstdc++ and libc++ are different libraries; this post is talking about libstdc++. They're like elephants and elephant seals.

"They're like elephants and elephant seals."

Yes, yes they are.


The correct response is to file a bug report, not write a clickbait-y article.

Yeah malloc() is pretty terrible in glibc by modern standards. For some workloads it just can't keep up and ends up fragmenting space in such a way that memory can't be returned to the OS (and thus be used for the page cache) and you end up in this performance spiral.

I always deploy C++ server on jemalloc. Been doing it for years and while there's been occasional hicks up when updating it has provided much more predictable performance.

Actually from my understanding, it's libstdc++'s allocator that is causing the issue, not malloc.

A big reason the small object optimization exists in libstdc++ containers is because system malloc() is not fast enough.

We're not talking about another optimization (small object / locality) as his issue was caused by libstdc++ alloc pools which would not need to exist in the first place if system malloc was better. So libstdc++ reinvents end up reinventing the wheel poorly.

As the author mentioned, when he disabled the optimization behavior GLIBCPP_FORCE_NEW he ended up burning more CPU via system (glibc) malloc(). Once he added jemalloc on top of GLIBCPP_FORCE_NEW, this pretty much evened out with previous behavior runtime performance.

The conclusion towards the end of article: > The right answer to "malloc is slow" is to make it faster.

By default libstdc++ stopped using the pooling allocator in 2005: https://gcc.gnu.org/r106665

That's one year after the ancient, bitrotted, unofficial copy of the libstdc++ documentation that the blog post links to, but still ancient history.

This is correct. Glibc malloc works fine, though jemalloc is faster in highly multithreaded code and seems to be slightly more memory efficient.

This has nothing to do with glibc malloc()

It has. It was never updated to ptmalloc3

Actually it is C's malloc and free that is "broken". malloc() takes a size parameter, but free() doesn't. This imbalance means it can never be maximally efficient. Whatever GNU stdlibc++ is doing is probably, on balance, a net win for most programs.

It's not exactly roses in C++ either of course. You can do better than the standard library facilities. Andrei Alexandrescu gave a great, entertaining, and technically elegant talk on memory allocation in C and C++ at Cppcon 2015 that is well worth watching


Most allocators will be able to pretty efficiently recover the size of the block you are freeing. And you can count on developers not getting the size right, for that to be a common error, and for everyone to cobble together their own, wildly different and probably slower ways of tracking sizes. So it doesn't really help.

malloc/free aren't a great API, but for other reasons (namely, that you want things like multiple arenas, good control over synchronization, decent debugging and introspection, leak-tracking, tagging for figuring out what a block really is when things get smashed, block enumeration, small block pools, placement for cache alignment, and . . . you get the idea).

> you can count on developers not getting the size right, for that to be a common error, and for everyone to cobble together their own, wildly different and probably slower ways of tracking size

You have to do this anyway. You either know the size of the thing you allocated the memory for or, if it's a block, you need to keep track of the size for bounds checking purposes.

There are no circumstances in which you call malloc() in which you don't need to know the keep the size in your application.

It's not just about C. In C++, for example, you might be deleting a base-typed pointer to a derived instance, that can be one of several options that you don't know at compile time.

Doesn't matter, since you call the destructor for the derived class through the vtable. The function you end up calling knows the size of your object and its exact layout. The size at this point is compiled statically in to your program.

The destructor does not deallocate memory in C++, since it is called just the same for objects allocated in other ways.

It's also not a given that such an object even has a vtable. It's perfectly legal for it to not have one, and the memory is still supposed to be deallocated in full (only the base class dtor is invoked then, but if derived class additions are trivial, it's not necessarily a problem).

Now, yes, you could add a separate vtable slot for the deallocation function. Or just store the object size directly in the type info (that's usually attached to the vtable). But this is really just a way to optimize size storage for objects that already have a word utilized for shared type descriptor like a vtable.


1) There doesn't have to be a vtable.

2) If there is a vtable available, it's the wrong one (look carefully at the treatment of vtable pointers in destructors when inheritance is involved).

2) The object has been completely destroyed before the deallocator is called, and it's unclear whether the vtable pointer is available (the standard doesn't seem to make this guarantee). In any event, the deallocator is statically determined and cannot be a virtual call.

I will note that vtables are just an implementation detail, and that you can successfully implement virtual calls with other mechanisms, which are also not required (by the C++ standard) to remember object sizes. So you can replace the concept 'vtable' with 'abstract mechanism by which the set of members appropriate for this class is determined' -- use token-based dispatch, for example -- and still have a compliant implementation.

[I helped write a few object runtimes in the late 80s and early 90s -- and man, C++ gets gnarly -- and I've shipped 8 to 10 allocators of various types in commercial products over the years]

Knowing a size and being forced to retain a size are different things. Here's a trivial example (imagine a system to process compressed audio data):

(1) receive packet: compute actual size, allocate buffer, expand data into it, start DMA to a speaker or something

(2) DMA-done: free the buffer

There, I didn't need the size on the free. There are many, many many similar cases, in fact these cases probably dominate.

Both your speaker driver and your application code still needed to keep track of the buffer size. calling free(size) on your buffer would require no extra overhead

The C++ delete[] operator doesn't take a size parameter either. This is neither here nor there, and unrelated to the problem the blog post is talking about.

> The C++ delete[] operator doesn't take a size parameter either

Yes it does. See overload 6 here


Note that the size parameter is ignored by the standard library implementation. It is intended for use by user-defined implementations.

> Note that the size parameter is ignored by the standard library implementation

Only because they all call malloc() and free() under the covers. It was however an ABI break, and one the C++ community thought worth doing.

If you replace your malloc with jemalloc you can easily wire this up to sdallocx(). The C++ standard library obviously can't do this out of the box because it can't assume you are using jemalloc

I believe that if you malloc/free so much that those 120 cycles become a major issue, then perhaps doing dynamic allocation in this context isn't a good idea.

Furthermore, with Alexandrescu's solution, I'm not so sure that the 120 cycles that you gain here are not burned over there because your allocator has to construct a two-members structure and your program has to go through an extra indirection to access the actual memory block.

If free() took a size, what would that buy you?

You are aware that malloc implementations tend to stick the size just before the part returned to the caller, right? eg. let's say you store size at p, return p+4 to caller, then have free() subtract 4 again to get at this "header"... So I'm guessing that's not your suggestion because free() wouldn't work at all without that kind of hack. Or more broadly, free() or realloc() and others already need to have some way to determine the size of the allocation based on the pointer, so they track that somehow in a way opaque to the caller.

So then what...? You want programs to be able to give back a prefix of the buffer? or...?

With the status quo, say you call malloc() once. After a long while you call free(). Now the first thing free() needs to do is figure out the size, which is stored "before" the pointer. Until that load is complete, it can't know how large the allocation was, so it can't prefetch other necessary data, e.g. the free list for a given chunk size. This could add 120 cycles of latency to get a value that the caller probably already had in L1 cache or could compute with a single multiply.

Or a call to strlen() which makes your 120 cycles of latency start to look good ...

Edit to add: I've used both malloc()/free() and a custom memory allocation API that required the size to be passed in. I found the second API to be much more of a pain to use over the long term. Besides, it wastes memory because the memory API will have to track the size anyway to detect misuse of the API (or else blindly trust that the right size is passed in and hilarity ensues when it's not ...).

Well you could use free() whenever it's practical and free_sz () when you can get the performance.

I actually added this line:

> Or more broadly, free() or realloc() and others already need to have some way to determine the size of the allocation based on the pointer

When I remembered I'd read a malloc implementation somewhere which actually didn't store the size in a header, but bakes in assumptions that you can determine the size of an allocation based on its address. So my naive thing isn't the only option, and I think if you have the kinds of concerns you raise there could be mitigations to be had.

They only store the size in a header because free() needs to know the size on deallocation. Think about all the C string functions that take buffer length parameters and what an outcry there would be if we eliminated all the size parameters and placed string lengths as headers before the buffer itself.

It's terrible for performance because you're giving up control, even if it is a little safer overall.

That'd be silly. Among many other problems, it would mean you can't do things like take a substring of a const buffer via pointer arithmetic, because you'd need to add a header in the middle.

It's better to pick exactly one of: (1) embrace the true nature of allocations and figure out lengths yourself or (2) if you are less comfortable with that, use some other language which doesn't expose any of this.

> That'd be silly. Among many other problems, it would mean you can't do things like take a substring of a const buffer via pointer arithmetic, because you'd need to add a header in the middle

Right, and now you understand​ why free should have always accepted a size parameter, just like malloc.

No, I don't. Doing zero-copy substrings of a read only buffer and an allocator chopping up a writable buffer into multiple chunks are very different scenarios.

Not for this purpose. It's all about knowing where the boundaries are.

I take it you didn't read the fine article.

Yes I did. My point is the C++ standard library has more information about your allocations than malloc() or free() does so it makes sense for it to attempt to do something smarter.

Since the 'fine article' didn't really get in to any interesting details, i'll leave it at that.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact