Hacker News new | past | comments | ask | show | jobs | submit login
Facebook's std::vector optimization (github.com)
275 points by iamsalman on Aug 30, 2014 | hide | past | web | favorite | 89 comments

If you're interested in these sorts of micro-optimizations, you may find Mozilla's nsTArray (essentially std::vector) interesting.

One of its unusual design decisions is that the array's length and capacity is stored next to the array elements themselves. This means that nsTArray stores just one pointer, which makes for more compact DOM objects and so on.

To make this work requires some cooperation with Firefox's allocator (jemalloc, the same one that FB uses, although afaik FB uses a newer version). In particular, it would be a bummer if nsTArray decided to allocate space for e.g. 4kb worth of elements and then tacked on a header of size 8 bytes, because then we'd end up allocating 8kb from the OS (two pages) and wasting most of that second page. So nsTArray works with the allocator to figure out the right number of elements to allocate without wasting too much space.

We don't want to allocate a new header for zero-length arrays. The natural thing to do would be to set nsTArray's pointer to NULL when it's empty, but then you'd have to incur a branch on every access to the array's size/capacity.

So instead, empty nsTArrays are pointers to a globally-shared "empty header" that describes an array with capacity and length 0.

Mozilla also has a class with some inline storage, like folly's fixed_array. What's interesting about Mozilla's version, called nsAutoTArray, is that it shares a structure with nsTArray, so you can cast it to a const nsTArray*. This lets you write a function which will take an const nsTArray& or const nsAutoTArray& without templates.

Anyway, I won't pretend that the code is pretty, but there's a bunch of good stuff in there if you're willing to dig.


> One of its unusual design decisions is that the array's length and capacity is stored next to the array elements itself.

GNU stdlibc++ does this for std::string so you get prettier output in the debugger. The object itself only contains a char*.

Seems like that would prevent small string optimization (a union of a small char array and the heap char pointer.) That lets me store about 85% of the strings in my assembler without any heap allocation, and is a huge win in my book.

Indeed, they don't do SSO. Reference counted COW, which violates the standard.

When the request for growth comes about, the vector (assuming no in-place resizing, see the appropriate section in this document) will allocate a chunk next to its current chunk

This is assuming a "next-fit" allocator, which is not always the case. I think this is why the expansion factor of 2 was chosen - because it's an integer, and doesn't assume any behaviour of the underlying allocator.

I'm mostly a C/Asm programmer, and dynamic allocation is one of the things that I very much avoid if I don't have to - I prefer constant-space algorithms. If it means a scan of the data first to find out the right size before allocating, then I'll do that - modern CPUs are very fast "going in a straight line", and realloc costs add up quickly.

Another thing that I've done, which I'm not entirely sure would be possible in "pure C++", is to adjust the pointers pointing to the object if reallocation moves it (basically, add the difference between the old and new pointers to each reference to the object); in theory I believe this involves UB - so it might not be "100% standard C" either, but in practice, this works quite well.

I solved an "realloc is really costly" problem by ditching the memory-is-contiguous notion, paying a little more (really just a few cycles) for each access rather than spending tons of time shuffling stuff around in memory. This eliminated nearly all reallocations. The extra bit of computation was invisible in the face of cache misses.

I'm guessing that most customers of std::vector don't really need contiguous memory, they just need something that has fast linear access time. In this sense, std::vector is a poor design.

No, that's quite intentional. If you don't need contiguous memory, you can consider using std::deque. http://www.cplusplus.com/reference/deque/deque/

There are weasel words in the standard that let implementations still be pretty inefficient. The problem is that memory reallocation is a leaky abstraction, and an interface that doesn't make guarantees about memory behavior can't be relied upon at scale.

The implementation of std::deque I just read uses a circular queue, resized to a power of two upon growth. It would exhibit the same bad performance as std::vector.

Are you sure? You're not allowed to invalidate iterators or pointers to elements in a deque, so it shouldn't be reallocating memory (aside from its underlying map of nodes, which will need to grow very rarely).

Libstdc++ describes how it's implemented here: https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-...

Libc++ doesn't have as descriptive a comment but it's impemented basically the same way here https://github.com/llvm-mirror/libcxx/blob/master/include/de...

You might not want to use deque because of how much memory it can use while still small, e.g. libc++s implementation uses a 4KiB page:


In the GNU stdlibc++ implementation the object itself is pretty large (and they use a page size of 512 bytes):


I implemented what you describe using the BSR (BitScanReverse) instruction as base of an indirect addressing of an exponentially growing data size.

Essentially, instead of a direct addressing like V[i], you can do VV[log2(i)][i]. Where log2() is implemented with a single BSR instruction.

The addressing table for the first level is very small in size, and it can be assumed always in cache.

Here the .h/.c source: https://github.com/amadvance/tommyds/blob/master/tommyds/tom...


Did you used something different ?

You might be interested in this paper


Contiguity, a semblance of control over what stays in cache what doesn't and potential for vectorization is precisely the reason that I use std:vector when I do. I cannot emphasize this enough. Getting the cache behavior correct pretty much makes or breaks my code, in the sense it often dominates performance characteristics once I have got the algorithm correct. One can have speedups of several tens of multipliers. You may work on a project where performance is irrelevant, but that does not mean performance is universally irrelevant.

In short std:vector has served me well, I dont want it to lose performance.

There is a major use case for std::vector having contiguous memory, which is passing a reference to the first element of a vector into a function expecting a pointer. This is fine assuming the vector is not empty.

std::vector is supposed to work in the common cases, so you don't need to manually realloc for small/medium sized vectors. If you need to handle a lot of data, avoiding realloc is still an issue and you should pre-compute the length and use vector::resize (or the length-aware constructor).

The latter nicely addresses to the recent HN discussion about "Friendly C"; may I ask what kind of software you are building, incidentally? (...don't tell it's medical :O)

I'm glad to see this catch on and the C level primitives get greater use.

This has been a well known problem in the C++ community for years, in particular Howard Hinnant put a lot of work into this problem. I believe the fundamental problem has always been that C++ implementations always use the underlying C implementations for malloc and friends, and the C standards committee could not be pursaded to add the necessary primitives.

A few years ago I tried to get a reallic which did not move (instead returned fail) into glibc and jealloc and failed. Glad to see someone else has succeeded.

I think the Folly small vector library is much more interesting and can yield better performance (if you hit the sweet spot).


From what I understand, using a "rvalue-reference ready" vector implementation with a good memory allocator must work at least as good as FBVector.

Apparently the libstdc++ people aren't entirely convinced by the growth factor claims:


The bit about special 'fast' handling of relocatable types should be obviated by r-value references and move constructors in C++11/14, right?

I.e. if we want fast push_back() behavior, we can use a compiler that knows to construct the element directly inside the vector's backing store rather that creating a temporary object and copying it into the vector.

emplace_back was added in C++11 which does just that: http://en.cppreference.com/w/cpp/container/vector/emplace_ba...

You can get around a lot of these issues by reserving the size needed up front, or using a custom allocator with std::vector. Not as easy, but still doable.

The reallocation issue isn't fixable this way however...

You can use a feedback directed optimization pass to choose the initial size.

TLDR; The author of malloc and std::vector never talked to each other. We fixed that!

... also most types are memmovables

Are there benchmarks, speedup? Seems strange to leave out that information or did I just miss it?

I don't see the results. Like a graph that shows std::vector vs. folly. I mean isn't that the entire point?

Is it normal to notate powers using double caret notation? (i.e. ^^) I've only ever seen it using a single caret (^), in what presumably is meant to correspond to an ascii version of Knuth up arrow notation (https://en.wikipedia.org/wiki/Knuth's_up-arrow_notation). I found it a bit strange, and confusing in the article having powers denoted using ^^, and had to go back to make sure I wasn't missing anything.

It's likely to distinguish it from XOR, which is the carat operator in many programming languages.

It is still weird. * * is the other established convention used for exponentiation in languages like Python, Ada, m4, and others.

It's math, not computer code. ^ means exponentiatuon. The author was just being weird.

But * * has an entirely different meaning in C/C++.

Only as a prefix operator; * has a pointer-related meaning for prefix and an arithmetic meaning for infix, there's no reason * * shouldn't be the same.

That'd be a syntactic ambiguity (in C-type languages) between "a * * b" being a to the power of b (i.e. "(a) * * (b)") or "a * * b" being a times the dereferencing of the pointer b (i.e. "(a) * (*b)"). You may be able to disambiguate this by prescedence, though it would be very ugly in the lexer (you could never have a STARSTAR token, it would have to be handled in the grammar) and would be terribly confusing.

Understandable, but in every other mathematical context (LaTeX for example) a single caret is used as the exponential operator. I guess it's just a convention that's different in different cultures/crowds.

Yep, this is my biggest issue with C++: you now have lambdas functions and an insane template spec, but you just can not "realloc" a new[] array. Guys, seriously ?

If you need to realloc a fixed size array, souldn't you use a std::vector instead?

You probably should, but the problem is still there because std::vector implementations don't use realloc. They call new[] with the new size, copy over the data and delete[] the old chunk. This eliminates the possibility to grow the vector in-place.

Actually, I don't think it is possible to implement vector in the way you describe (so I'd argue that in fact there is not even a single working implementation that does that): using new[] will cause a default construction of all the elements, but std::vector must copy construct (or, in C++11, move construct, but never assign) the elements from the old array to the new array. AFAIK all implementations of std::vector thereby have a separate phase of allocating memory for the underlying storage of the array (which is done using an allocator, and which essentially cannot be implemented in terms of new) and constructing the elements (which is done using the placement new operator in a loop as it copies the data).

The real issue here is that std::allocator doesn't support realloc. (Note that you couldn't really use realloc as spec'd, though, because in the case where it fails to reallocate in place it blows away the old memory area and copies the data itself, which would have incorrect semantics: this is actually discussed in the article from Facebook as the "notorious design of realloc()" which made the usage of jemalloc required for this task. However, std::allocator could still have features for handling this situation, and implements based on malloc would simply have to not support the reallocate functionality. I imagine if more allocators actually supported the idea of reallocation this could be proposed to the committee.)

That said, the statement that std::vector can't grow the chunk in place is still way too heavy-handed: std::vector will grow its size in place within the memory area allocated to it by deconstructing and constructing elements within the over-sized memory area it has allocated. In general, this memory area grows by doubling in size whenever the capacity is breached, which leads to vector having amortized constant time complexity to insert a new element.

> which is done using an allocator, and which essentially cannot be implemented in terms of new

Actually the standard says std::allocator<T> just calls the global operator new function[0]. When you call construct(), placement-new is then used to construct the objects in place. This makes it easy to replace the memory allocator globally as well as on a per-container basis.

[0] http://en.cppreference.com/w/cpp/memory/new/operator_new

It's the same with realloc: there is no guarantee that it will grow the chunk in place.

No. Modern realloc are efficient, when moving large memory blocks, because they rely on the kernel ability to quickly relocate memory regions without involving memcpy() (through mremap() on Linux).

Edit: shamelessly citing my blog entry on this subject: http://blog.httrack.com/blog/2014/04/05/a-story-of-realloc-a...

This isn't true for either of the common high performance mallocs. tcmalloc doesn't support mremap at all, and jemalloc disables it in its recommended build configuration. glibc will remap, but any performance gains that get realized from realloc tricks would probably be small compared to the benefits of switching to a better malloc implementation.

Fantastic blog post! Now I don't have to start digging myself :). I've always thought realloc could do a few optimizations, glad to find out the details.

guarantee != possibility. There's no guarantee with realloc, but there's no possibility with new[], copy and delete[].

You can't grow the size you allocated with new[] in-place, and because you need to retain the existing data it's not safe to delete[] the old buffer, call new[] and hope that it points to the previous memory address and assume that the existing data remains intact.

A realloc implementation can try to grow the buffer if there's contiguous space, and if it succeeds it doesn't need to deallocate or copy anything. I haven't had a look at realloc implementations so I don't know if that, or other optimizations are done in practice, but I assume that realloc's worst performance case is somewhere around new[], copy and delete[]'s best case.

The copy mechanism in std::vector may also have a significant overhead over realloc if it has to call the copy constructor (or ideally move constructors in C++11) of every object in the vector, although I can imagine a C++ equivalent of realloc doing so too.

The main issue is that you can't say to realloc "if you can't expand in place, just leave it"

This is occasionally annoying in C, but not a great issue. In C++, just memcpying classes leads to awful things happening.

In reply to sibling posts.

- if you need dynamic storage you should use a std::vector instead of new [].

- with std::vector-s, if you want to save the cost of the copy you can redefine the vector allocator to use realloc or even mremap.

Edit @xroche: For example, you can create an allocator that manages an initial chunk of memory (allocated with malloc) and grows it when it's required using realloc inside the 'allocate' call.

Edit2: So I've implemented an allocator that behaves (almost) like a described above but I admit that it's a bit more hacky than I though. Drop me an email if you're interested.

I'm skeptical with the ability to have movable objects with just a dedicated allocator. How do you handle that with only allocate() and deallocate() ?

Edit² @bnegreve: but how do you handle moved blocks then ? And this would involve copying all objects anyway, because this is the std::vector's logic to allocate+memcpy+deallocate when the capacity is reached. The only solution is to ... rewrite std::vector, which is precisely what FB did :)

Why should C++persist in C insecure design decisons?

Besides realloc can force a memory move anyway. Its behavior depends from implementation and memory state.

How realloc is insecure ? The only lacking construct is a "moved" operator, which would re-assign pointer members, for example, when moving this to another location.

Typical security exploit with realloc is attacking code that assumes realloc always success and never moves, thus keeping all pointers around.

Assuming that realloc() always succeeds is one thing, I can believe there's code that makes it. But I'd very much like to look at the code that assumes that realloc() never moves the block. This sounds a remarkably elaborate assumption to make, hardly an oversight due to ignorance.

Lots of enterprise code with off-shoring provide entertainment of code quality.

In Qt, you can mark types as Q_MOVABLE_TYPE and get optimizations from a lot of containers

The factor-2 discussion is quite interesting. What if we could make the next allocated element always fit exactly in the space left over by the old elements?

Solving the equations suggest a fibonacci like sequence, seeded by something like "2, 3, 4, 5". Continuing 9, 14, 23 etc.

There was an academic paper in the 70s on Fibonacci-based allocation strategy... though I can't seem to find it now. So the idea is definitely not new :)

he golden ratio and then rounding down. What's the point of putting 4 into the seed sequence?

Without 4, the sequence would be 2, 3, 5. Then the next value would be 9 by fibonacci. But that's bigger than the 5 we get from adding up all the unallocated pieces (2 and 3).

We could use just 2, 3, 4 as a seed, but we can't use the fibonacci formula before the fourth element is added. Try some different seeds for yourself, it's trickier than you'd think.

For big vectors, if there is obvious way, I always hint vector with reserve() - for example knowing in advance how much would be copied, even if a bit less gets copied (or even if a bit more, at the cost of reallocation :().

Then the teleporting chief would have to shoot the original

As an aside, there was a great Star Trek novel where there was a long range transporter invented that accidentally cloned people.

(I think it was "Spock Must Die")

There's also the ST:TNG episode "Second Chances". http://en.wikipedia.org/wiki/Second_Chances_%28Star_Trek:_Th... where Riker is duplicated. (There's also the good/bad Kirk in 'The Enemy Within', and the whole mirror universe concept, but those aren't duplicates.)

i used to be a big fan of this sort of stuff, but the better solution for many of the problems described is to avoid array resizing.

if std::vector is your bottleneck you have bigger problems i suspect.

reminds me a bit of eastl as well... which is much more comprehensive: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n227...

If you're spending a lot of time changing the size of a std::vector array, then maybe std::vector isn't the right type of structure to begin with...

How is it reasonable to expect that previously freed memory would be available later for the vector to move to?

Greetings Facebook, several decades ago welcomes you. Game programmers figured out the same and arguably better ways of doing this since each version of std::vector has been released. This is but a small reason most of us had in-house stl libraries for decades now.

Most of the time if performance and allocation is so critical, you're better off not using a vector anyway. A fixed sized array is much more cache friendly, makes pooling quite easy, and eliminates other performance costs that suffer from std::vector's implementation.

More to the point, who would use a c++ library from Facebook? Hopefully don't need to explain the reasons here.

> most of us had in-house stl libraries for decades now

I've no doubt there has been many talented programmers for decades, but what about open-sourcing these "in house" libraries if they are so great, to avoid countless other programmers the need to reinvent the wheel? I think it would be more constructive than bashing Facebook for being late to the party.

I personally believe that "decades" would be an exaggeration.

As to why people don't open source their libraries - it's a big hassle and doesn't especially benefit them personally. Indeed if other people are going to have to spend time reinventing the wheel I suppose some might even see it as a competitive advantage. And as a general rule, video games companies don't seem to be much into open source anyway.

(Additionally, if the ones I've used are anything to go by, the libraries aren't so great. They attack specific problems with std::vector, and that's fine, but usually not in any general way. No need to do a full extension of the allocator mechanism if you know you can just call your company's allocator's special try_realloc function, for example - but now of course nobody else can use it. And there usually isn't any impetus to "fix" this, because it isn't actually a problem.)

> More to the point, who would use a c++ library from Facebook? Hopefully don't need to explain the reasons here.

Could you explain them for those of us not in the loop? Does Facebook have a bad reputation for C++?

I would also like expanations, because Facebook actually has a good reputation when it comes to their compiled languages engineers. Their C++ and D engineers built HHVM/Hack, their OCaml engineers built some great analysis tools and much of the supporting code for the HHVM/Hack platform, etc., the list goes on, so I'd like to know why someone would want to avoid their C++ library based on the "Facebook" name only.

Because Facebook also has a reputation of not playing nice with people, the rules, intellectual property, and so on. This is hardly a company anyone should support or trust and if you can't figure that out, I can't help you.

As far as their work on HHVM, it was necessary due to failure by bad technology choices from the start. There's very little interesting about this work unless you somehow love PHP, want to make debugging your production applications more difficult, and refuse to address your real problems. I am 100% sure no one outside of the PHP community cares about anything Facebook has done in C++.

Simply having a large company with lots of developers who might have even had good reputations elsewhere or even be smart doesn't mean much. Having worked in many places with lots of smart developers, I can tell you stories about too many geniuses in the room. Calling Facebook developers engineers is also about as apt as calling janitors sanitation engineers. We're programmers, or developers, or perhaps software architects at best depending on the position. I happen to have an EE and CS degree but given I do programming for a living, I'd hardly call myself an engineer. But we're way off topic :)

> As far as their work on HHVM, it was necessary due to failure by bad technology choices from the start.

HHVM arose out of Facebook's desire to save on server purchasing and operating costs. Facebook could run perfectly well without it on Plain Old PHP, but they'd have to buy and power more servers.

I'd hardly call PHP a "bad technology choice" given the outstanding financial success of many companies that use it.

I think its more of a "lol Facebook" than actual Facebook having a bad reputation. They do have a lot of money to hire talented engineers, it maybe that this solution is poor in the edge case the other post was talking about, but it may also be possible their solution is poor for Facebook. Facebook also contributes A LOT to open source projects and you can see their employee email domain all over open source projects.

I could also probably speculate their thinking along the lines of backdoors or something. But Facebook does have a good reputation with open source projects, so I wouldn't be too worried about that. (As in no more worried than some other random project)

(As someone who started doing professional game development in 2000, I will concur that every single point mentioned in this article, including the usage of an IsRelocatable template trait, is common knowledge in the game developer community.)

Yeah, you would need to provide an explanation. Facebook's folly is the best, cleanest set of low-level c++ libs I've ever seen.

The fact the something has been done in industry 'x' does not negate the value of this article/library.

> Game programmers figured out the same and arguably better ways of doing this since each version of std::vector has been released. This is but a small reason most of us had in-house stl libraries for decades now.

OTOH, game programmers don't share this kind of knowledge on public fora as freely as Internet companies do. It may be "old news" to you, but it was unpublished. Knowledge is great, but dissemination is even better.

Actually, game companies as well as various indie developers have widely distributed this knowledge. Some examples of places to look/things to search for implementations or similar/better:

-GDC Slide Decks -Game Programming Data Structure/Performance Books -Game Industry Publications - ex: Game Programming Gems -EA STL -Game Dev Sites -Any available source to various engines/games in the last many years should have more than a few inside

Moreover, this is also a common thing in a lot of embedded and high-performance development in general.

Just because someone didn't make a github project and post it on HN doesn't mean it doesn't exist. This seems to be a trend with web developers - reinventing what exists without properly checking what is there, and doing so usually poorly. I say this as someone who has done game development and web development at various points of my career (currently web by day, game by night) so I am not trying to be condescending, just pointing out an unfortunate trend.

Downvote if you want, but I see the first game dev who replied already confirmed what I said.

I agree that there are other public fora than company-sponsored blog posts, but I don't see your industry contributing its accumulated knowledge back to the library developers, either (assuming you're relying on open-source C++ libraries).

If this truly was a well-known issue long ago, it should also have been fixed long ago by game developers contributing back to the open-source developers who have made their work possible so that the rest of us didn't have to rediscover these performance issues.

If you want to avoid this kind of criticism, be a part of the solution outside your own industry walls; don't just brag post hoc about how you solved it in your castle for yourselves.

I went and actually checked the first 6 links in this search, and none of them really address the problems brought up in the OP. If you're going to be arrogant about this issue, you should at least be correct.

EA open sourced parts of their in-house STL. A Github clone can be found at:


I haven't looked at it myself but it might be an interesting start for those interested.

Isn't Facebook itself an STD vector?

Funny :)

Show me a programmer who is trying to reoptimize the STL, and I'll show you a programmer who is about to be laid off.

The guy who tried this at EA didn't last long there.

Games and other high performance users generally used to stay away from the STL. Sony had their own version of STL which addressed some of these issues.

> The guy who tried this at EA didn't last long there.

Paul Pedriana, the man responsible for the lions share of the EASTL work, started at Maxis before it was acqired by EA in 1997. EASTL has been in continuous development for more than 10 years.

The STL is not optimized at all in this case, this is precisely the point. And like it or not, but Facebook has talented engineers to do that.

The STL isn't a library that can be optimized - it's a interface definition with expected complexity requirements. By it's nature (i.e. not tied to a platform) it doesn't have specific benchmark numbers. Specific implementations (e.g. MSVCRT, the GC++ implementation, the clang implementation) can be, and are.

> ... Rocket surgeon

That's a new one. Usually it's rocket scientist or brain surgeon. What exactly does a rocket surgeon do? :)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact