We have a kind of varint-based delta decoder at work that can read up to 15 byte...

haberman · on Nov 23, 2020

The latest optimized version of the protobuf parser does a remarkably similar thing:

https://github.com/protocolbuffers/protobuf/blob/master/src/...

Protobufs are full of varints, and we want to perform only one bounds-check per field. So we ensure that the input buffer always has at least 16 bytes available to parse per field. If there are fewer than 16 bytes remaining, we copy these last few bytes of data into a patch buffer and parse from that instead.

scaladev · on Nov 24, 2020

(More) permanent link for those reading this a few years later.

https://github.com/protocolbuffers/protobuf/blob/5a7a4a52a79...

asdfasgasdgasdg · on Nov 23, 2020

Possible indication of common ancestry.

waste_monk · on Nov 24, 2020

Or perhaps the software equivalent of https://en.wikipedia.org/wiki/Carcinisation? If it's a common problem then it would not be unsurprising that similar solutions emerge among wildly different codebases. Like a cross between a design pattern and duct-tape.

saagarjha · on Nov 24, 2020

This is just how you write code like this. When you have OS/processor guarantees that “going a little over” the length is OK and lets you vectorize, then it’s a good idea to do it.

woofie11 · on Nov 23, 2020

The concept of a preprocessor is helpful here.

Most modern languages no longer have preprocessors, which is a pity.

In cases like these, you want complex computed constants. You want those computed at build time, and not at runtime. I can:

* Run the math on a whiteboard, compute +5 or +15, hardcode that, and have the world explode when the code changes

* Run the math in code, at runtime, and have the system be slow.

* Code up the math, have it run at build time, and have a clear documentation of what the constant is, why, how it's computed, and have it change automatically if the code changes.

There were awesome things done in C++ templates back in the day, with some pretty magical performance. The preprocessor would unroll loops to the appropriate level for a given processor, align things to cache boundaries, and similar.

I think the reason this might have gone away is that with GHz CPUs, few people care as much about performance anymore. Most code is no longer CPU-bound, but programmer-bound, and lower cognitive complexity is more important than better performance.

scott_s · on Nov 23, 2020

This sort of requirement is met by compile time computations. I know C++11, D and Rust have them, and I imagine many more languages do as well. The need for this sort of thing clearly has not gone away, we're just moving away from a free-form, textual pre-processing stage.

kenniskrag · on Nov 23, 2020

constexpr in c++ is pretty magic. Loops, try catch etc. is possible (https://en.cppreference.com/w/cpp/language/constexpr)

midjji · on Nov 24, 2020

Its truly very nice :) So much horrible preprocessor stuff goes away. Not that preprocessing cant be done right, its just that the standard C/C++ preprocessor language breaks almost every rule of a maintainable code.

I love compiler sanity checks of math operations with static asserts, and stuff like compile checking for ub and cdb.

constexpr int signed_int_fixed_yet(){ // compiler error return std::limits<int>::max +int(1); }

tuplify<PodType>() is wonderful too.

Just a shame it adds so much boilerplate though, like why couldnt every function have the same constexpr if possible property as lambdas, and is so damned slow and costly to compile. I also keep writing half of a nice interface, then running into template recursion depth, or gcc failing to compile correctly/(identically to clang).

Its also still a bit too limited. Like, consider the case where everything is known compile time: int a=5; // lets say sizeof(a) =4 a<<64; // should be compile error, not ub for(int i=0;i<64;++i) a<<i; // should be compile error, // note that either line above can be removed by the compiler. Leaving a=5;

template<class T, int R,int C> Matrix{ std::array<T,RC> data; // constexper // should be automatic T operator(int row, int col){ if is_constexpr_in_context(row) // now wouldnt this be nice... static_assert(row<R,""); // only if row is constexpr argument return data[rowC+col]; } };

Matrix<double,3,4> m; auto a=m(1,2) // row is known compiletime, compiler will warn, should be compile error m(a+3,0) // also known compiletime(array default init to 0), should warn, should be compile error for(int r=0;r<3;++r) for(int c=0;c<6;++c) a+=m(r,c); // oob known compiletime, should warn...

Its not like this is that hard either, just transform the statements to use m.at<r,c>() and it works, so its just the compiler identifying constexpr context automatically as for lambda and replacing functions with their argument templated equivalents, and autogenerating these functions. It cant fix everything, but the number of errors people made when using such libs would massively decrease, especially since templated matrix classes are almost universally have visible definitions, and it often applies.

kenniskrag · on Nov 24, 2020

> Its also still a bit too limited. Like, consider the case where everything is known compile time: int a=5; // lets say sizeof(a) =4 a<<64; // should be compile error, not ub for(int i=0;i<64;++i) a<<i; // should be compile error, // note that either line above can be removed by the compiler. Leaving a=5;

It is UB for signed integers. There is a lot of stuff from c inside c++ :-).

filleokus · on Nov 23, 2020

One of the few things I was truly amazed by when doing some C++ courses during university. Contrived OOP or extremely cumbersome string manipulation wasn't really my thing. But being able to have compile time evaluation of pretty complicated functions without using macros or anything was so cool. Actually thought about wanting it yesterday in Go

jdc · on Nov 23, 2020

Nim too

Ar-Curunir · on Nov 23, 2020

As the sibling comment noted; you don't preprocessors for these; what you want is compile-time computation. It's much more principled than text-based preprocessing

woofie11 · on Nov 25, 2020

You're arguing semantics. I never said "text-based preprocessing," I said "preprocessing." I specifically pointed to C++ templates which is a preprocessing layer with is most definitely NOT text-based.

kevin_thibedeau · on Nov 24, 2020

> Most modern languages no longer have preprocessors, which is a pity.

m4 is a preprocessor for all languages.

asdfasgasdgasdg · on Nov 23, 2020

I don't think a preprocessor can help in this situation, since what's ultimately being read is the data, and the offsets to be read can't be computed without looking at it.

cassepipe · on Nov 23, 2020

Zig got rid of the preprocessor but has comptine

dTal · on Nov 24, 2020

Surely this is what optimizing compilers are for?

Just put it in the source code, and if all the information required to calculate the constant is present at compile time - why, then the compiler should calculate it, and optimize away all that hairy math! No need for a separate "preprocessing" step, or a hairy template language.

woofie11 · on Nov 25, 2020

For the most part, no. Optimizing compilers can't do this.

There are whole classes of problems one runs into from:

- The ability to change variables by reference or through introspection

- Exceptions in intermediate computations

- Etc.

The set of transformations one can do while maintaining guaranteed equivalence is limited.

ben509 · on Nov 23, 2020

Option 4:

* Code up a unit test that replicates end of buffer scenarios and verifies that X bytes is enough and X-1 bytes fails.

zabzonk · on Nov 23, 2020

Reading or writing past the end of a buffer in C is undefined behaviour - you cannot therefore write a test in C to check this, as you have no idea how such a test would behave.

woofie11 · on Nov 25, 2020

Nope. There are plenty of C/C++ tools which do this sort of test.

This was a popular one for a while, back when C++ was still used for major new applications: https://en.wikipedia.org/wiki/BoundsChecker

There are lots of techniques for this. For example, you can instrument malloc() to allocate 10 extra bytes, put a specific known value there, and confirm it didn't change.

There are much fancier ones too, which make clever use of hardware features like MXP. It's a whole research field.

lmm · on Nov 24, 2020

Find a compiler vendor who offers stronger guarantees than the bare minimum that's in the standard, and run the test suite with their compiler. Undefined behaviour was supposed to offer compiler vendors space to improve, not to be the limit of what was implemented.

scatters · on Nov 24, 2020

Actually it's more likely than not that your vendor already offers those guarantees as a compiler mode. In clang and gcc it's the sanitizers (ubsan, asan, tsan and msan) while in MSVC it's RTC and SDL.

These build modes catch the vast majority of undefined behavior, and with relatively low performance overhead. If you're not running your tests with sanitizers enabled, you're missing out on a nearly free way to locate and eliminate an entire class of bugs.

nsajko · on Nov 23, 2020

If the uint64s are aligned, wouldn't it be better to use alignas (to make the buffer aligned as soon as it's allocated)? As far as I see that way you get a longer buffer only when necessary (you never waste memory), plus you could get autovectorized code if the compiler decides that makes sense.

asdfasgasdgasdg · on Nov 23, 2020

The buffer is allocated in general library code, unfortunately. It's not something I can easily adjust.

fefe23 · on Nov 24, 2020

If you mean malloc() by that: malloc will return aligned memory so that you can store all data types, i.e. it is 64-bit aligned at least (so you can store double) but probably has even higher granularity.

It's the downside of malloc not knowing the type you want and returning void*. It has to assume the worst alignment requirements.

Which actually happen to be 128 bytes for SSE instructions.

midjji · on Nov 24, 2020

Is this guaranteed by the standard? worst seems platform dependent, does that mean it will vary with platform? what about x86 vs amd64? could it ever be more than 128 bytes?

asdfasgasdgasdg · on Nov 24, 2020

I don't mean malloc. I mean that the memory is allocated inside the database sdk we are using.

chenxiaolong · on Nov 24, 2020

Android's bionic libc strlen function does a similar read-beyond-the-buffer thing: https://cs.android.com/android/platform/superproject/+/maste...

I first learned about it when valgrind complained at every place I was calling strlen.

akx · on Nov 24, 2020

Same sort of thing with glibc's strcspn: https://stackoverflow.com/q/56341628/51685

fefe23 · on Nov 24, 2020

Two questions.

uint64_t is 8 bytes, not 16. Why are you overreading up to 15?

How did you conclude that the test hurt performance? I have researched this for over a decade and my conclusion is that the bounds check is for all intents and purposes free, because it is hidden in the latency of the actual memory access (even if it comes from L1 cache). The only way the bounds check can ever hurt performance is if it adds additional memory reads.

That would indicate that you don't have a memory access for the upper bound at all currently, which sounds like you could overread an arbitrary amount, not just up to 15 bytes.

phkahler · on Nov 23, 2020

>> The outer code knows how many entries to look for, so it just calls the advance function the correct number of times,

Can't you call it n-1 times and then handle the last 0-16 bytes a little slower?

pedrocr · on Nov 23, 2020

The same is done for decoding camera raw formats. Variable encoding of the number of bits in the stream means you never know when the decode will end. Reading all the file into a buffer longer than needed is fast, whereas the bit pump is performance critical and having to check for end of buffer every time you want to read would be painful.

asdfasgasdgasdg · on Nov 23, 2020

Saddest thing is we're reading these values from a database, and the database SDK does not have a mechanism for reading values into user-controlled buffers without copying. That is a conceptually neat solution to the problem, but since it's a somewhat strange thing to want to do, we can't do it in our case. Storing the extended values is something I also considered, but there are tens or hundreds of billions of them in production. Couldn't really ask for an extra PB of space for the relatively small performance and elegance benefit you get from not copying these things.

pedrocr · on Nov 23, 2020

Extra copying is the downside if you don't control what you're reading from. Thankfully memcpy is fast but if your input is very big it's still painful. In rust I just used an API that allows reading from both an input stream and memory so that if you are reading directly from a file the copy is avoided:

https://github.com/pedrocr/rawloader/blob/c793a132fa03e336f8...

midjji · on Nov 24, 2020

Is the extra copying even performed? It either has to be a small thing, or processed in chunks, or oom might be a problem. So its small, in this case, will the compiler really copy, or might it just optimize the operation away?

pedrocr · on Nov 24, 2020

If you feed it an N byte array it will have to allocate an N+16 array and memcpy. If you're reading from the network/disk/etc then there is no need to copy if you can allocate the larger array directly.

anfilt · on Nov 24, 2020

If you want to use this fast decoder it some times better to have it take a different type than a plain array as input. Even if the only difference is arrays of this type have a 15 byte dead space at the end.

Such behavior is not crazy but this decoder does not take a normal array as a type at that point. So describing it as such can result in other programmers providing an array that does not follow the expectations of this decoding function.

-edit I will say if you have length information you should be able to use the fast method on all but the last chunk without checking on every data chunk.

Negitivefrags · on Nov 23, 2020

We ran into the same problem, only in the AMD GPU driver.

It would very occasionally crash during a draw that reads the last few bytes of a vertex buffer on AMD cards only.

Very annoying to find because it only triggered rarely.

Our solution was the same, just allocate vertex buffers a little longer on AMD cards and don’t use the extra space.

BearOso · on Nov 24, 2020

There’s a few nuances like this with OpenGL and Vulkan that you’ll find in the specs. Normally you don’t encounter them because the compiler by default aligns to certain boundaries. Then someone tries to get clever and uses a custom allocator or a packing directive on a struct, and suddenly your vertex buffer isn’t aligned any more.

What I’m getting at is that it likely wasn’t a driver bug. The other drivers were just more tolerant of the incorrect behavior.

ajarmst · on Nov 23, 2020

And, of course, you carefully documented the reason for the larger buffer and encoded the value into a constant or other documentable source. One hopes you didn't hardcode '5 bytes' with a comment that says 'more bytes safer' and nothing else :-).

huhtenberg · on Nov 23, 2020

Looks suspiciously like an alignment issue, with driver expecting something somewhere to be aligned to 4/8 bytes or be a multiple of dword/qword?

benlivengood · on Nov 23, 2020

How about reading N-15 fast (assuming varints can be as small as a byte) and the rest with bounds-checking? Is N generally too small for that to be useful?

You might also be able to heuristically compare buffer address modulo the page size to N to know whether you need to be safe or not.

choeger · on Nov 23, 2020

That's kind of spooky, it sounds like we work for the same company, because we have a very similar thing. I am pretty sure that no one at our company knows its details as well as you just described yours...

Maybe someone should open source an efficient yet safe varint writer?

asdfasgasdgasdg · on Nov 23, 2020

As far as I could tell the problem was somewhat inherent in the encoding. One thing that could have been done is something like

    for (int i = 0; i<n-15; ++I) {
      dec.FastUnsafeRead(result);
    }
    for (int j = n-15; j < n; ++j) {
      dec.SlowSafeRead(result);
    }

spockz · on Nov 24, 2020

Do we know how much performance gain we experience daily in our OS code/applications/browsing thanks to these kind of optimisations? In other words, what would our daily computer use look like if we stick to not having these kind of optimisations, e.g. all code would use rust instead of c?

saagarjha · on Nov 24, 2020

This optimization has nothing to do with Rust or C and everything to do with processors and memory.

spockz · on Nov 24, 2020

Maybe in this case. But how many cases are there where we do the thing in C that is fast “because we know it is safe” even though it might not be and prevented by a higher level language without direct memory management. The question stands, do we know how much day to day benefit we get from these optimisations through the stack we use daily. What would our daily experience look like if we didn’t apply these optimisations. Would it still be workable?

saagarjha · on Nov 24, 2020

Going past the end of a buffer is illegal in C, just as it is in many other languages. The correct way to write these is in architecture-specific assembly.

Without these kinds of overruns many vectorized algorithms would not be possible, so you'd probably see a noticeable performance decrease.

emerged · on Nov 24, 2020

Not uncommon for things like CABAC or CAVLC, or really anything with a bit buffer. Your memory validation tools should notice if you don’t purposefully over allocate and initialize the final bit of memory to account for this. That should be all that is necessary.

hansvm · on Nov 23, 2020

A strategy I've seen in similar scenarios is to apply a kind of inverse-ordered read to the last chunk. Instead of reading bytes x through x+n (your actual data) and x+n+1 through x+15 (outside the edge of the buffer) you instead read bytes x+n-15 through x+n and then apply whatever logic you're doing to mask or exclude the unimportant bytes to the beginning of that chunk rather than the end.

Note: you still have to be careful if your entire input is less than 16 bytes, there can be other alignment issues, etc..., but it _might_ be better in your situation than a similar fix that other comments have mentioned which applies the fast read to all but the last partial chunk and then finishes the tail with some slow cleanup code.

kazinator · on Nov 23, 2020

> It doesn't matter that these bytes are filled with garbage,

That's only true if the strategy avoids hitting an invalid page.

Reading a single uint64_t which is aligned on an 8 byte boundary will be fine. If any byte of that is within a valid page, the other bytes have to be.

If two uint64_t's are read, then as a whole, they have to be a 16-byte-aligned block for that same assurance to hold.

If the code is using misaligned reads, that itself will crash on some processor architectures, or possibly trigger expensive handling.

vesinisa · on Nov 23, 2020

Did you actually read the comment you replied to to the end?