If you're interested in these sorts of micro-optimizations, you may find Mozilla's nsTArray (essentially std::vector) interesting.
One of its unusual design decisions is that the array's length and capacity is stored next to the array elements themselves. This means that nsTArray stores just one pointer, which makes for more compact DOM objects and so on.
To make this work requires some cooperation with Firefox's allocator (jemalloc, the same one that FB uses, although afaik FB uses a newer version). In particular, it would be a bummer if nsTArray decided to allocate space for e.g. 4kb worth of elements and then tacked on a header of size 8 bytes, because then we'd end up allocating 8kb from the OS (two pages) and wasting most of that second page. So nsTArray works with the allocator to figure out the right number of elements to allocate without wasting too much space.
We don't want to allocate a new header for zero-length arrays. The natural thing to do would be to set nsTArray's pointer to NULL when it's empty, but then you'd have to incur a branch on every access to the array's size/capacity.
So instead, empty nsTArrays are pointers to a globally-shared "empty header" that describes an array with capacity and length 0.
Mozilla also has a class with some inline storage, like folly's fixed_array. What's interesting about Mozilla's version, called nsAutoTArray, is that it shares a structure with nsTArray, so you can cast it to a const nsTArray*. This lets you write a function which will take an const nsTArray& or const nsAutoTArray& without templates.
Anyway, I won't pretend that the code is pretty, but there's a bunch of good stuff in there if you're willing to dig.
Seems like that would prevent small string optimization (a union of a small char array and the heap char pointer.) That lets me store about 85% of the strings in my assembler without any heap allocation, and is a huge win in my book.
When the request for growth comes about, the vector (assuming no in-place resizing, see the appropriate section in this document) will allocate a chunk next to its current chunk
This is assuming a "next-fit" allocator, which is not always the case. I think this is why the expansion factor of 2 was chosen - because it's an integer, and doesn't assume any behaviour of the underlying allocator.
I'm mostly a C/Asm programmer, and dynamic allocation is one of the things that I very much avoid if I don't have to - I prefer constant-space algorithms. If it means a scan of the data first to find out the right size before allocating, then I'll do that - modern CPUs are very fast "going in a straight line", and realloc costs add up quickly.
Another thing that I've done, which I'm not entirely sure would be possible in "pure C++", is to adjust the pointers pointing to the object if reallocation moves it (basically, add the difference between the old and new pointers to each reference to the object); in theory I believe this involves UB - so it might not be "100% standard C" either, but in practice, this works quite well.
I solved an "realloc is really costly" problem by ditching the memory-is-contiguous notion, paying a little more (really just a few cycles) for each access rather than spending tons of time shuffling stuff around in memory. This eliminated nearly all reallocations. The extra bit of computation was invisible in the face of cache misses.
I'm guessing that most customers of std::vector don't really need contiguous memory, they just need something that has fast linear access time. In this sense, std::vector is a poor design.
There are weasel words in the standard that let implementations still be pretty inefficient. The problem is that memory reallocation is a leaky abstraction, and an interface that doesn't make guarantees about memory behavior can't be relied upon at scale.
The implementation of std::deque I just read uses a circular queue, resized to a power of two upon growth. It would exhibit the same bad performance as std::vector.
Are you sure? You're not allowed to invalidate iterators or pointers to elements in a deque, so it shouldn't be reallocating memory (aside from its underlying map of nodes, which will need to grow very rarely).
Contiguity, a semblance of control over what stays in cache what doesn't and potential for vectorization is precisely the reason that I use std:vector when I do. I cannot emphasize this enough. Getting the cache behavior correct pretty much makes or breaks my code, in the sense it often dominates performance characteristics once I have got the algorithm correct. One can have speedups of several tens of multipliers. You may work on a project where performance is irrelevant, but that does not mean performance is universally irrelevant.
In short std:vector has served me well, I dont want it to lose performance.
There is a major use case for std::vector having contiguous memory, which is passing a reference to the first element of a vector into a function expecting a pointer. This is fine assuming the vector is not empty.
std::vector is supposed to work in the common cases, so you don't need to manually realloc for small/medium sized vectors. If you need to handle a lot of data, avoiding realloc is still an issue and you should pre-compute the length and use vector::resize (or the length-aware constructor).
The latter nicely addresses to the recent HN discussion about "Friendly C"; may I ask what kind of software you are building, incidentally? (...don't tell it's medical :O)
I'm glad to see this catch on and the C level primitives get greater use.
This has been a well known problem in the C++ community for years, in particular Howard Hinnant put a lot of work into this problem. I believe the fundamental problem has always been that C++ implementations always use the underlying C implementations for malloc and friends, and the C standards committee could not be pursaded to add the necessary primitives.
A few years ago I tried to get a reallic which did not move (instead returned fail) into glibc and jealloc and failed. Glad to see someone else has succeeded.
The bit about special 'fast' handling of relocatable types should be obviated by r-value references and move constructors in C++11/14, right?
I.e. if we want fast push_back() behavior, we can use a compiler that knows to construct the element directly inside the vector's backing store rather that creating a temporary object and copying it into the vector.
You can get around a lot of these issues by reserving the size needed up front, or using a custom allocator with std::vector. Not as easy, but still doable.
The reallocation issue isn't fixable this way however...
Is it normal to notate powers using double caret notation? (i.e. ^^) I've only ever seen it using a single caret (^), in what presumably is meant to correspond to an ascii version of Knuth up arrow notation (https://en.wikipedia.org/wiki/Knuth's_up-arrow_notation).
I found it a bit strange, and confusing in the article having powers denoted using ^^, and had to go back to make sure I wasn't missing anything.
Only as a prefix operator; * has a pointer-related meaning for prefix and an arithmetic meaning for infix, there's no reason * * shouldn't be the same.
That'd be a syntactic ambiguity (in C-type languages) between "a * * b" being a to the power of b (i.e. "(a) * * (b)") or "a * * b" being a times the dereferencing of the pointer b (i.e. "(a) * (*b)"). You may be able to disambiguate this by prescedence, though it would be very ugly in the lexer (you could never have a STARSTAR token, it would have to be handled in the grammar) and would be terribly confusing.
Understandable, but in every other mathematical context (LaTeX for example) a single caret is used as the exponential operator. I guess it's just a convention that's different in different cultures/crowds.
Yep, this is my biggest issue with C++: you now have lambdas functions and an insane template spec, but you just can not "realloc" a new[] array. Guys, seriously ?
You probably should, but the problem is still there because std::vector implementations don't use realloc. They call new[] with the new size, copy over the data and delete[] the old chunk. This eliminates the possibility to grow the vector in-place.
Actually, I don't think it is possible to implement vector in the way you describe (so I'd argue that in fact there is not even a single working implementation that does that): using new[] will cause a default construction of all the elements, but std::vector must copy construct (or, in C++11, move construct, but never assign) the elements from the old array to the new array. AFAIK all implementations of std::vector thereby have a separate phase of allocating memory for the underlying storage of the array (which is done using an allocator, and which essentially cannot be implemented in terms of new) and constructing the elements (which is done using the placement new operator in a loop as it copies the data).
The real issue here is that std::allocator doesn't support realloc. (Note that you couldn't really use realloc as spec'd, though, because in the case where it fails to reallocate in place it blows away the old memory area and copies the data itself, which would have incorrect semantics: this is actually discussed in the article from Facebook as the "notorious design of realloc()" which made the usage of jemalloc required for this task. However, std::allocator could still have features for handling this situation, and implements based on malloc would simply have to not support the reallocate functionality. I imagine if more allocators actually supported the idea of reallocation this could be proposed to the committee.)
That said, the statement that std::vector can't grow the chunk in place is still way too heavy-handed: std::vector will grow its size in place within the memory area allocated to it by deconstructing and constructing elements within the over-sized memory area it has allocated. In general, this memory area grows by doubling in size whenever the capacity is breached, which leads to vector having amortized constant time complexity to insert a new element.
> which is done using an allocator, and which essentially cannot be implemented in terms of new
Actually the standard says std::allocator<T> just calls the global operator new function[0]. When you call construct(), placement-new is then used to construct the objects in place. This makes it easy to replace the memory allocator globally as well as on a per-container basis.
No. Modern realloc are efficient, when moving large memory blocks, because they rely on the kernel ability to quickly relocate memory regions without involving memcpy() (through mremap() on Linux).
This isn't true for either of the common high performance mallocs. tcmalloc doesn't support mremap at all, and jemalloc disables it in its recommended build configuration. glibc will remap, but any performance gains that get realized from realloc tricks would probably be small compared to the benefits of switching to a better malloc implementation.
Fantastic blog post! Now I don't have to start digging myself :). I've always thought realloc could do a few optimizations, glad to find out the details.
guarantee != possibility. There's no guarantee with realloc, but there's no possibility with new[], copy and delete[].
You can't grow the size you allocated with new[] in-place, and because you need to retain the existing data it's not safe to delete[] the old buffer, call new[] and hope that it points to the previous memory address and assume that the existing data remains intact.
A realloc implementation can try to grow the buffer if there's contiguous space, and if it succeeds it doesn't need to deallocate or copy anything. I haven't had a look at realloc implementations so I don't know if that, or other optimizations are done in practice, but I assume that realloc's worst performance case is somewhere around new[], copy and delete[]'s best case.
The copy mechanism in std::vector may also have a significant overhead over realloc if it has to call the copy constructor (or ideally move constructors in C++11) of every object in the vector, although I can imagine a C++ equivalent of realloc doing so too.
- if you need dynamic storage you should use a std::vector instead of new [].
- with std::vector-s, if you want to save the cost of the copy you can redefine the vector allocator to use realloc or even mremap.
Edit @xroche: For example, you can create an allocator that manages an initial chunk of memory (allocated with malloc) and grows it when it's required using realloc inside the 'allocate' call.
Edit2: So I've implemented an allocator that behaves (almost) like a described above but I admit that it's a bit more hacky than I though. Drop me an email if you're interested.
I'm skeptical with the ability to have movable objects with just a dedicated allocator. How do you handle that with only allocate() and deallocate() ?
Edit² @bnegreve: but how do you handle moved blocks then ? And this would involve copying all objects anyway, because this is the std::vector's logic to allocate+memcpy+deallocate when the capacity is reached. The only solution is to ... rewrite std::vector, which is precisely what FB did :)
How realloc is insecure ? The only lacking construct is a "moved" operator, which would re-assign pointer members, for example, when moving this to another location.
Assuming that realloc() always succeeds is one thing, I can believe there's code that makes it. But I'd very much like to look at the code that assumes that realloc() never moves the block. This sounds a remarkably elaborate assumption to make, hardly an oversight due to ignorance.
The factor-2 discussion is quite interesting. What if we could make the next allocated element always fit exactly in the space left over by the old elements?
Solving the equations suggest a fibonacci like sequence, seeded by something like "2, 3, 4, 5". Continuing 9, 14, 23 etc.
There was an academic paper in the 70s on Fibonacci-based allocation strategy... though I can't seem to find it now. So the idea is definitely not new :)
Without 4, the sequence would be 2, 3, 5. Then the next value would be 9 by fibonacci. But that's bigger than the 5 we get from adding up all the unallocated pieces (2 and 3).
We could use just 2, 3, 4 as a seed, but we can't use the fibonacci formula before the fourth element is added. Try some different seeds for yourself, it's trickier than you'd think.
For big vectors, if there is obvious way, I always hint vector with reserve() - for example knowing in advance how much would be copied, even if a bit less gets copied (or even if a bit more, at the cost of reallocation :().
There's also the ST:TNG episode "Second Chances". http://en.wikipedia.org/wiki/Second_Chances_%28Star_Trek:_Th... where Riker is duplicated. (There's also the good/bad Kirk in 'The Enemy Within', and the whole mirror universe concept, but those aren't duplicates.)
Greetings Facebook, several decades ago welcomes you. Game programmers figured out the same and arguably better ways of doing this since each version of std::vector has been released. This is but a small reason most of us had in-house stl libraries for decades now.
Most of the time if performance and allocation is so critical, you're better off not using a vector anyway. A fixed sized array is much more cache friendly, makes pooling quite easy, and eliminates other performance costs that suffer from std::vector's implementation.
More to the point, who would use a c++ library from Facebook? Hopefully don't need to explain the reasons here.
> most of us had in-house stl libraries for decades now
I've no doubt there has been many talented programmers for decades, but what about open-sourcing these "in house" libraries if they are so great, to avoid countless other programmers the need to reinvent the wheel? I think it would be more constructive than bashing Facebook for being late to the party.
I personally believe that "decades" would be an exaggeration.
As to why people don't open source their libraries - it's a big hassle and doesn't especially benefit them personally. Indeed if other people are going to have to spend time reinventing the wheel I suppose some might even see it as a competitive advantage. And as a general rule, video games companies don't seem to be much into open source anyway.
(Additionally, if the ones I've used are anything to go by, the libraries aren't so great. They attack specific problems with std::vector, and that's fine, but usually not in any general way. No need to do a full extension of the allocator mechanism if you know you can just call your company's allocator's special try_realloc function, for example - but now of course nobody else can use it. And there usually isn't any impetus to "fix" this, because it isn't actually a problem.)
I would also like expanations, because Facebook actually has a good reputation when it comes to their compiled languages engineers. Their C++ and D engineers built HHVM/Hack, their OCaml engineers built some great analysis tools and much of the supporting code for the HHVM/Hack platform, etc., the list goes on, so I'd like to know why someone would want to avoid their C++ library based on the "Facebook" name only.
Because Facebook also has a reputation of not playing nice with people, the rules, intellectual property, and so on. This is hardly a company anyone should support or trust and if you can't figure that out, I can't help you.
As far as their work on HHVM, it was necessary due to failure by bad technology choices from the start. There's very little interesting about this work unless you somehow love PHP, want to make debugging your production applications more difficult, and refuse to address your real problems. I am 100% sure no one outside of the PHP community cares about anything Facebook has done in C++.
Simply having a large company with lots of developers who might have even had good reputations elsewhere or even be smart doesn't mean much. Having worked in many places with lots of smart developers, I can tell you stories about too many geniuses in the room. Calling Facebook developers engineers is also about as apt as calling janitors sanitation engineers. We're programmers, or developers, or perhaps software architects at best depending on the position. I happen to have an EE and CS degree but given I do programming for a living, I'd hardly call myself an engineer. But we're way off topic :)
> As far as their work on HHVM, it was necessary due to failure by bad technology choices from the start.
HHVM arose out of Facebook's desire to save on server purchasing and operating costs. Facebook could run perfectly well without it on Plain Old PHP, but they'd have to buy and power more servers.
I'd hardly call PHP a "bad technology choice" given the outstanding financial success of many companies that use it.
I think its more of a "lol Facebook" than actual Facebook having a bad reputation. They do have a lot of money to hire talented engineers, it maybe that this solution is poor in the edge case the other post was talking about, but it may also be possible their solution is poor for Facebook. Facebook also contributes A LOT to open source projects and you can see their employee email domain all over open source projects.
I could also probably speculate their thinking along the lines of backdoors or something. But Facebook does have a good reputation with open source projects, so I wouldn't be too worried about that. (As in no more worried than some other random project)
(As someone who started doing professional game development in 2000, I will concur that every single point mentioned in this article, including the usage of an IsRelocatable template trait, is common knowledge in the game developer community.)
> Game programmers figured out the same and arguably better ways of doing this since each version of std::vector has been released. This is but a small reason most of us had in-house stl libraries for decades now.
OTOH, game programmers don't share this kind of knowledge on public fora as freely as Internet companies do. It may be "old news" to you, but it was unpublished. Knowledge is great, but dissemination is even better.
Actually, game companies as well as various indie developers have widely distributed this knowledge. Some examples of places to look/things to search for implementations or similar/better:
-GDC Slide Decks
-Game Programming Data Structure/Performance Books
-Game Industry Publications - ex: Game Programming Gems
-EA STL
-Game Dev Sites
-Any available source to various engines/games in the last many years should have more than a few inside
Moreover, this is also a common thing in a lot of embedded and high-performance development in general.
Just because someone didn't make a github project and post it on HN doesn't mean it doesn't exist. This seems to be a trend with web developers - reinventing what exists without properly checking what is there, and doing so usually poorly. I say this as someone who has done game development and web development at various points of my career (currently web by day, game by night) so I am not trying to be condescending, just pointing out an unfortunate trend.
Downvote if you want, but I see the first game dev who replied already confirmed what I said.
I agree that there are other public fora than company-sponsored blog posts, but I don't see your industry contributing its accumulated knowledge back to the library developers, either (assuming you're relying on open-source C++ libraries).
If this truly was a well-known issue long ago, it should also have been fixed long ago by game developers contributing back to the open-source developers who have made their work possible so that the rest of us didn't have to rediscover these performance issues.
If you want to avoid this kind of criticism, be a part of the solution outside your own industry walls; don't just brag post hoc about how you solved it in your castle for yourselves.
I went and actually checked the first 6 links in this search, and none of them really address the problems brought up in the OP. If you're going to be arrogant about this issue, you should at least be correct.
Games and other high performance users generally used to stay away from the STL. Sony had their own version of STL which addressed some of these issues.
> The guy who tried this at EA didn't last long there.
Paul Pedriana, the man responsible for the lions share of the EASTL work, started at Maxis before it was acqired by EA in 1997. EASTL has been in continuous development for more than 10 years.
The STL isn't a library that can be optimized - it's a interface definition with expected complexity requirements. By it's nature (i.e. not tied to a platform) it doesn't have specific benchmark numbers. Specific implementations (e.g. MSVCRT, the GC++ implementation, the clang implementation) can be, and are.
One of its unusual design decisions is that the array's length and capacity is stored next to the array elements themselves. This means that nsTArray stores just one pointer, which makes for more compact DOM objects and so on.
To make this work requires some cooperation with Firefox's allocator (jemalloc, the same one that FB uses, although afaik FB uses a newer version). In particular, it would be a bummer if nsTArray decided to allocate space for e.g. 4kb worth of elements and then tacked on a header of size 8 bytes, because then we'd end up allocating 8kb from the OS (two pages) and wasting most of that second page. So nsTArray works with the allocator to figure out the right number of elements to allocate without wasting too much space.
We don't want to allocate a new header for zero-length arrays. The natural thing to do would be to set nsTArray's pointer to NULL when it's empty, but then you'd have to incur a branch on every access to the array's size/capacity.
So instead, empty nsTArrays are pointers to a globally-shared "empty header" that describes an array with capacity and length 0.
Mozilla also has a class with some inline storage, like folly's fixed_array. What's interesting about Mozilla's version, called nsAutoTArray, is that it shares a structure with nsTArray, so you can cast it to a const nsTArray*. This lets you write a function which will take an const nsTArray& or const nsAutoTArray& without templates.
Anyway, I won't pretend that the code is pretty, but there's a bunch of good stuff in there if you're willing to dig.
http://mxr.mozilla.org/mozilla-central/source/xpcom/glue/nsT...