We have a kind of varint-based delta decoder at work that can read up to 15 bytes past the end of its buffer, because it uses uint64s to read the buffer faster, than the buffer doesn't encode byte-length information. Since it doesn't know how many bits wide the last entry will be, and since it reads ahead a little, it can end up 16 bytes starting at the last byte and proceeding 15 bytes onward. It doesn't matter that these bytes are filled with garbage, because a well formed program will only use the bits from one of the valid bytes. The outer code knows how many entries to look for, so it just calls the advance function the correct number of times, and those bits past the end don't get used.
This can lead to a pretty subtle crash because, except in msan mode, reading those fifteen bytes won't cause any problem except at the page boundary. This is because you don't corrupt the sentinel bytes, since they're never written. So to detect the bug you either have to run the code in debug mode (which we rarely do, because the code is deep inside a very expensive database system), or you have to get unlucky.
I tried fixing this, but adding a "past the end" test really hurt performance. Instead, we continued with the preexisting solution: copy the data into a buffer fifteen bytes longer than the data, then decode it. T_T
Protobufs are full of varints, and we want to perform only one bounds-check per field. So we ensure that the input buffer always has at least 16 bytes available to parse per field. If there are fewer than 16 bytes remaining, we copy these last few bytes of data into a patch buffer and parse from that instead.
Or perhaps the software equivalent of https://en.wikipedia.org/wiki/Carcinisation? If it's a common problem then it would not be unsurprising that similar solutions emerge among wildly different codebases. Like a cross between a design pattern and duct-tape.
This is just how you write code like this. When you have OS/processor guarantees that “going a little over” the length is OK and lets you vectorize, then it’s a good idea to do it.
Most modern languages no longer have preprocessors, which is a pity.
In cases like these, you want complex computed constants. You want those computed at build time, and not at runtime. I can:
* Run the math on a whiteboard, compute +5 or +15, hardcode that, and have the world explode when the code changes
* Run the math in code, at runtime, and have the system be slow.
* Code up the math, have it run at build time, and have a clear documentation of what the constant is, why, how it's computed, and have it change automatically if the code changes.
There were awesome things done in C++ templates back in the day, with some pretty magical performance. The preprocessor would unroll loops to the appropriate level for a given processor, align things to cache boundaries, and similar.
I think the reason this might have gone away is that with GHz CPUs, few people care as much about performance anymore. Most code is no longer CPU-bound, but programmer-bound, and lower cognitive complexity is more important than better performance.
This sort of requirement is met by compile time computations. I know C++11, D and Rust have them, and I imagine many more languages do as well. The need for this sort of thing clearly has not gone away, we're just moving away from a free-form, textual pre-processing stage.
Its truly very nice :) So much horrible preprocessor stuff goes away. Not that preprocessing cant be done right, its just that the standard C/C++ preprocessor language breaks almost every rule of a maintainable code.
I love compiler sanity checks of math operations with static asserts, and stuff like compile checking for ub and cdb.
constexpr int signed_int_fixed_yet(){ // compiler error
return std::limits<int>::max +int(1);
}
tuplify<PodType>() is wonderful too.
Just a shame it adds so much boilerplate though, like why couldnt every function have the same constexpr if possible property as lambdas, and is so damned slow and costly to compile. I also keep writing half of a nice interface, then running into template recursion depth, or gcc failing to compile correctly/(identically to clang).
Its also still a bit too limited. Like, consider the case where everything is known compile time:
int a=5; // lets say sizeof(a) =4
a<<64; // should be compile error, not ub
for(int i=0;i<64;++i) a<<i; // should be compile error,
// note that either line above can be removed by the compiler. Leaving a=5;
template<class T, int R,int C> Matrix{
std::array<T,RC> data;
// constexper // should be automatic
T operator(int row, int col){
if is_constexpr_in_context(row) // now wouldnt this be nice...
static_assert(row<R,""); // only if row is constexpr argument
return data[rowC+col];
}
};
Matrix<double,3,4> m;
auto a=m(1,2) // row is known compiletime, compiler will warn, should be compile error
m(a+3,0) // also known compiletime(array default init to 0), should warn, should be compile error
for(int r=0;r<3;++r)
for(int c=0;c<6;++c)
a+=m(r,c); // oob known compiletime, should warn...
Its not like this is that hard either, just transform the statements to use m.at<r,c>() and it works, so its just the compiler identifying constexpr context automatically as for lambda and replacing functions with their argument templated equivalents, and autogenerating these functions. It cant fix everything, but the number of errors people made when using such libs would massively decrease, especially since templated matrix classes are almost universally have visible definitions, and it often applies.
> Its also still a bit too limited. Like, consider the case where everything is known compile time: int a=5; // lets say sizeof(a) =4 a<<64; // should be compile error, not ub for(int i=0;i<64;++i) a<<i; // should be compile error, // note that either line above can be removed by the compiler. Leaving a=5;
It is UB for signed integers. There is a lot of stuff from c inside c++ :-).
One of the few things I was truly amazed by when doing some C++ courses during university. Contrived OOP or extremely cumbersome string manipulation wasn't really my thing. But being able to have compile time evaluation of pretty complicated functions without using macros or anything was so cool. Actually thought about wanting it yesterday in Go
As the sibling comment noted; you don't preprocessors for these; what you want is compile-time computation. It's much more principled than text-based preprocessing
You're arguing semantics. I never said "text-based preprocessing," I said "preprocessing." I specifically pointed to C++ templates which is a preprocessing layer with is most definitely NOT text-based.
I don't think a preprocessor can help in this situation, since what's ultimately being read is the data, and the offsets to be read can't be computed without looking at it.
Just put it in the source code, and if all the information required to calculate the constant is present at compile time - why, then the compiler should calculate it, and optimize away all that hairy math! No need for a separate "preprocessing" step, or a hairy template language.
Reading or writing past the end of a buffer in C is undefined behaviour - you cannot therefore write a test in C to check this, as you have no idea how such a test would behave.
There are lots of techniques for this. For example, you can instrument malloc() to allocate 10 extra bytes, put a specific known value there, and confirm it didn't change.
There are much fancier ones too, which make clever use of hardware features like MXP. It's a whole research field.
Find a compiler vendor who offers stronger guarantees than the bare minimum that's in the standard, and run the test suite with their compiler. Undefined behaviour was supposed to offer compiler vendors space to improve, not to be the limit of what was implemented.
Actually it's more likely than not that your vendor already offers those guarantees as a compiler mode. In clang and gcc it's the sanitizers (ubsan, asan, tsan and msan) while in MSVC it's RTC and SDL.
These build modes catch the vast majority of undefined behavior, and with relatively low performance overhead. If you're not running your tests with sanitizers enabled, you're missing out on a nearly free way to locate and eliminate an entire class of bugs.
If the uint64s are aligned, wouldn't it be better to use alignas (to make the buffer aligned as soon as it's allocated)? As far as I see that way you get a longer buffer only when necessary (you never waste memory), plus you could get autovectorized code if the compiler decides that makes sense.
If you mean malloc() by that: malloc will return aligned memory so that you can store all data types, i.e. it is 64-bit aligned at least (so you can store double) but probably has even higher granularity.
It's the downside of malloc not knowing the type you want and returning void*. It has to assume the worst alignment requirements.
Which actually happen to be 128 bytes for SSE instructions.
Is this guaranteed by the standard? worst seems platform dependent, does that mean it will vary with platform? what about x86 vs amd64? could it ever be more than 128 bytes?
uint64_t is 8 bytes, not 16. Why are you overreading up to 15?
How did you conclude that the test hurt performance? I have researched this for over a decade and my conclusion is that the bounds check is for all intents and purposes free, because it is hidden in the latency of the actual memory access (even if it comes from L1 cache). The only way the bounds check can ever hurt performance is if it adds additional memory reads.
That would indicate that you don't have a memory access for the upper bound at all currently, which sounds like you could overread an arbitrary amount, not just up to 15 bytes.
The same is done for decoding camera raw formats. Variable encoding of the number of bits in the stream means you never know when the decode will end. Reading all the file into a buffer longer than needed is fast, whereas the bit pump is performance critical and having to check for end of buffer every time you want to read would be painful.
Saddest thing is we're reading these values from a database, and the database SDK does not have a mechanism for reading values into user-controlled buffers without copying. That is a conceptually neat solution to the problem, but since it's a somewhat strange thing to want to do, we can't do it in our case. Storing the extended values is something I also considered, but there are tens or hundreds of billions of them in production. Couldn't really ask for an extra PB of space for the relatively small performance and elegance benefit you get from not copying these things.
Extra copying is the downside if you don't control what you're reading from. Thankfully memcpy is fast but if your input is very big it's still painful. In rust I just used an API that allows reading from both an input stream and memory so that if you are reading directly from a file the copy is avoided:
Is the extra copying even performed? It either has to be a small thing, or processed in chunks, or oom might be a problem. So its small, in this case, will the compiler really copy, or might it just optimize the operation away?
If you feed it an N byte array it will have to allocate an N+16 array and memcpy. If you're reading from the network/disk/etc then there is no need to copy if you can allocate the larger array directly.
If you want to use this fast decoder it some times better to have it take a different type than a plain array as input. Even if the only difference is arrays of this type have a 15 byte dead space at the end.
Such behavior is not crazy but this decoder does not take a normal array as a type at that point. So describing it as such can result in other programmers providing an array that does not follow the expectations of this decoding function.
-edit
I will say if you have length information you should be able to use the fast method on all but the last chunk without checking on every data chunk.
There’s a few nuances like this with OpenGL and Vulkan that you’ll find in the specs. Normally you don’t encounter them because the compiler by default aligns to certain boundaries. Then someone tries to get clever and uses a custom allocator or a packing directive on a struct, and suddenly your vertex buffer isn’t aligned any more.
What I’m getting at is that it likely wasn’t a driver bug. The other drivers were just more tolerant of the incorrect behavior.
And, of course, you carefully documented the reason for the larger buffer and encoded the value into a constant or other documentable source. One hopes you didn't hardcode '5 bytes' with a comment that says 'more bytes safer' and nothing else :-).
How about reading N-15 fast (assuming varints can be as small as a byte) and the rest with bounds-checking? Is N generally too small for that to be useful?
You might also be able to heuristically compare buffer address modulo the page size to N to know whether you need to be safe or not.
That's kind of spooky, it sounds like we work for the same company, because we have a very similar thing. I am pretty sure that no one at our company knows its details as well as you just described yours...
Maybe someone should open source an efficient yet safe varint writer?
Do we know how much performance gain we experience daily in our OS code/applications/browsing thanks to these kind of optimisations? In other words, what would our daily computer use look like if we stick to not having these kind of optimisations, e.g. all code would use rust instead of c?
Maybe in this case. But how many cases are there where we do the thing in C that is fast “because we know it is safe” even though it might not be and prevented by a higher level language without direct memory management. The question stands, do we know how much day to day benefit we get from these optimisations through the stack we use daily. What would our daily experience look like if we didn’t apply these optimisations. Would it still be workable?
Going past the end of a buffer is illegal in C, just as it is in many other languages. The correct way to write these is in architecture-specific assembly.
Without these kinds of overruns many vectorized algorithms would not be possible, so you'd probably see a noticeable performance decrease.
Not uncommon for things like CABAC or CAVLC, or really anything with a bit buffer. Your memory validation tools should notice if you don’t purposefully over allocate and initialize the final bit of memory to account for this. That should be all that is necessary.
A strategy I've seen in similar scenarios is to apply a kind of inverse-ordered read to the last chunk. Instead of reading bytes x through x+n (your actual data) and x+n+1 through x+15 (outside the edge of the buffer) you instead read bytes x+n-15 through x+n and then apply whatever logic you're doing to mask or exclude the unimportant bytes to the beginning of that chunk rather than the end.
Note: you still have to be careful if your entire input is less than 16 bytes, there can be other alignment issues, etc..., but it _might_ be better in your situation than a similar fix that other comments have mentioned which applies the fast read to all but the last partial chunk and then finishes the tail with some slow cleanup code.
This can lead to a pretty subtle crash because, except in msan mode, reading those fifteen bytes won't cause any problem except at the page boundary. This is because you don't corrupt the sentinel bytes, since they're never written. So to detect the bug you either have to run the code in debug mode (which we rarely do, because the code is deep inside a very expensive database system), or you have to get unlucky.
I tried fixing this, but adding a "past the end" test really hurt performance. Instead, we continued with the preexisting solution: copy the data into a buffer fifteen bytes longer than the data, then decode it. T_T