But in what realistic scenario would a user be able to put in ≥2^31 while not being able to put in 2^32-1 (which probably breaks many more things from innocent increments or similar) or 2^100?
And further this all assumes they used int vs. long. It can be “wrong” in that it only works for arrays of under 2^63 elements without that ever being a possibility.
Production code is often filled with edge case bugs that simply never come up. Works for 100x the expected use case is generally good enough if you’re approaching the limits where 2^31 is a potential issue then you are also approaching the case where 2^32 definitely will be.
Which isn’t a problem as exploiting most software is meaningless. Wow someone can hack software they already have root access to the machine for whatever will we do.
It’s management and developers not treating software where it is meaningful for someone to hack as a different category that’s an actual problem.
So a coworker sends me this CAD file via email, I open it and my computer gets controlled by a botnet, and the file immediately sends itself to all my outlook contacts.
Nah, that sounds impossible. I'm sure it has never ever happened.
Opening 3rd party files is one of those risks I was just talking about.
There’s a user moving around in a game, and there’s opening a workplace file these are inherently different kinds of risks. If your building a flappy bird clone you can know exactly how it’s going to interact with the world.
2^32-1 is almost always a possibility when 2^32 is, but there are many cases where those are possible but 2^100 is not. Basically anything where the value is a count of something rather than raw input fits the bill. How many characters, lines, or files do you support? 2^32 is a totally feasible number in many contexts. 2^100 is physically impossible, there isn’t enough matter.
If you accept 2^32, then code using 32-bit ints is definitely broken on it and thus the OP question of the issue on half that is irrelevant. Which is my point - widening the acceptable input range from 2^31 to 2^32 (or in the case of signed integers, from 2^30 to 2^31; give or take 1 of course) just "fixes" one small case of the actual core issue of nearly any arithmetic anywhere being wrong if you don't explicitly constrain input sizes.
I agree on there not being much difference between 2^30/31/32. But it’s not “nearly any arithmetic.” If your size is an actual data size, then 2^64 is fine.
Right, with 64-bit ints things are a lot nicer. Though you can still run into some issues on "generate this much data" tasks as opposed to "operate over existing data of this size", though perhaps less exploitably so.
That's not valid standard C; gcc and clang give a warning with '-pedantic'. It's valid C++ though.
And IMO it's quite a nice feature, useful sometimes for reducing boilerplate in early returns. It's the obvious consequence if you don't treat void as some extremely-special syntax but rather as just another type, perhaps alike an empty struct (though that's not valid C either ¯\_(ツ)_/¯) that's just implicitly returned at the end of a void-returning function, and a "return;" statement implicitly "creates" a value of void.
In fact in Rust (and probably a bunch of other languages that I'm too lazy to remember) void-returning functions are done via returning a 0-tuple.
Basically dead. The main motivation would be to make it easier to use variably modified types in function parameters, where the (length) identifier is declared after the variably modified type, as in
> void foo(int a[m][m], int m)
Currently you can only do:
> void foo(int m, int a[m][m])
The holy grail is being able to update the prototypes of functions like snprintf to something like:
> int snprintf(char buf[bufsiz], size_t bufsiz, const char *, ...);
However, array pointer decay means that foo above is actually:
> void foo(int (*a)[m], int m)
Likewise, the snprintf example above would be little different than the current definition.
There's related syntax, like
> foo (int m, int a[static m])
But a is still just a pointer, and while it can help some static analyzers to detect mismatched buffer size arguments at the call site, the extent of the analysis is very limited as decay semantics effectively prevent tracing the propagation of buffer sizes across call chains, even statically.
There's no active proposal at the moment to make it possible to pass VM arrays (or rather, array references) directly to functions--you can only pass pointers to VM array types. That actually works (sizeof *a == sizeof (int) * m when declaring int (*a)[m] in the prototype), but the code in the function body becomes very stilted with all the syntactical dereferencing--and it's just syntactical as the same code is generated for a function parameter of `int (*a)[m]` as for `int *a` (underneath it's the same pointer value rather than an extra level of memory indirection). There are older proposals but they all lost steam because there aren't any existing implementation examples in any major production C compilers. Without that ability, the value of forward declarations is greatly diminished. Because passing VM array types to functions already requires significant refactoring, most of the WG14 felt it wasn't worth the risk of adopting GCC's syntax when everybody could (and should?) just start declaring size parameters before their respective buffer parameters in new code.
I hope it is not "basically" dead. I just resubmitted it at the request of several people.
And yes, for new APIs you could just change the order, but it does help also with legacy APIs. It does even when not using pointers to arrays: https://godbolt.org/z/TM5Mn95qK (I agree that new APIs should pass a pointer to a VLA).
(edited because I am agreeing with most of what you said)
Note that GCC does (sometimes) detect the misuse of the "int a[static 3]" case, but maybe that is only when the length is a compile time constant; and possibly only with char arrays.
$ make texe
cc -g -O2 -std=c11 -Wall -Wextra -Wpedantic -Werror -c -o test.o test.c
test.c: In function ‘do_test_formatSmallElem’:
test.c:108:9: error: ‘matSmallElemFormat’ accessing 8 bytes in a region of size 2 [-Werror=stringop-overflow=]
108 | matSmallElemFormat(elem, buffer);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
test.c:108:9: note: referencing argument 2 of type ‘char *’
In file included from test.c:8:
mat/display.h:17:6: note: in a call to function ‘matSmallElemFormat’
17 | void matSmallElemFormat(mElem elem, char buffer[static matSmallElemLen]);
| ^~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
make: *** [<builtin>: test.o] Error 1
> everybody could (and should?) just start declaring size parameters before their respective buffer parameters in new code
I know that was a common opinion pre-C23, but it feels like the committee trying to reshape the world to their desires (and their designs). It's a longstanding convention that C APIs accept (address, length) pairs in that order. So changing that will already get you a score of -4 on the Hard to Misuse List[1], for "Follow common convention and you'll get it wrong". (The sole old exception in the standard is the signature of main(), but that's somewhat vindicated by the fact that nobody really needs to call main(); there is a new exception in the standard in the form of Meneide's conversion APIs[2], which I seriously dislike for that reason.)
The reason I was asking is that 'uecker said it was requested at the committee draft stage for C23 by some of the national standards orgs. That's already ancient history of course, but I hoped the idea itself was still alive, specifically because I don't want to end up in the world where half of C APIs are (address, length) and half are (length, address), when the former is one of the few C conventions most everyone agrees on currently.
A latency improvement would be to have digitCountThresholds be indexed by the lzcnt, instead of the result of the other LUT. Increases the size of that lookup table from 160B to 512B though. Funnily enough I've had this approach in a local copy of Ryu when I was working on cutting it down for my purposes. Unfortunately whatever ended up public had cut this out too.
edit: some chat messages of me at [0]. Some previous unrelated discussion found at [1].
Good point. The other option is to use (1233*(64-lzcnt(x)))>>12 to bypass the first lookup entirely (not sure if this is faster or not), which works since 1233/2^12 is close enough to log_2(10).
Multiplication typically takes 3 cycles, so a total of 1+3+1=5 cycles of latency. A load from L1 takes 4-5 cycles on non-ancient x86 and Apple M1. So roughly on-par, perhaps a bit faster, at the cost of eating some ILP in the odd case that there's not still plenty left (also better if not already in-cache, though only significantly so if using that instead of the lzcnt directly as the threshold table index).
Perhaps not - as it happens at the rename stage and thus only with immediate values, it's all fixed behavior for a given instruction stream; with branching you can of course get dynamically different renamed immediates, but, as that's branchy code, it's not time-constant anyway.
You don't quite need a full 64-bit adder to materialize the proper value, as one argument is only a 10-bit int. So a 10-bit adder, a 54-bit increment, and muxing it with the original depending on the add carry.
And, presumably, the OP shift case here is in fact a case of there not being a built-in immediate adder and thus a need for fixup uops being inserted to materialize it?
Right. Actually it turns out it's 11 bits, since [-1024, 1023] are all supported by the immediate add renamer.
In general I think people are overstating the delay of an additional 64-bit add on register file reads (though I'm not a hardware guy so maybe someone can correct me). There are carry-lookahead adders with log_2(n) == 6 gate delays. Carry-save adders might also be relevant to how they can do multiple dependent adds with 0c latency.
> And, presumably, the OP shift case here is in fact a case of there not being a built-in immediate adder and thus a need for fixup uops being inserted to materialize it?
No, the perf counters show 1 uop dispatched/retired in both the slow and fast cases.
Strict aliasing in particular is very dependent on coding style; I too just turn it off with essentially no harm, but certain styles of code that don't involve carefully manually CSEing everything might benefit a good bit (though with it being purely type-based imo it's not that useful).
Outside of freestanding/kernel scenarios, null treatment shouldn't affect anything, and again can similarly benefit probably a decent amount of code by removing what is unquestionably just dead code. There's the question of pointer addition resulting in/having an argument of null being UB, which is rather bad for offsetof shenanigans or misusing pointers as arbitrary data, but that's a question of provenance, not anything about null.
Global integer overflow UB is probably the most questionable thing in that list, and I'd quite prefer having separate wrap-on-overflow and UB-on-overflow types (allowing having unsigned UB-on-overflow too). That said, even having had multiple cases of code that could hit integer overflow and thus UB, I don't recall any of them actually resulting in the compiler breaking things (granted, I know not to write a+1<a & similar for signed ints), whereas I have had a case where the compiler "fixed" the code, turning a `a-b < 0` into the more correct (at least in the scenario in question) `a<b`!
I do think that it would make sense to have an easy uniform way to choose some specific behavior for most kinds of UB.
While not allowing stepping in the kernel, a large part of rr is indeed intercepting all things the kernel may do and re-implementing its actions, writing down all changes to memory & etc it does (of course for Linux, not Windows). With which the kernel doing an asynchronous write would necessarily end up as a part of the recording stating what the kernel writes at the given point in time, which a debugger could deterministically reason about. (of course this relies on the recording system reimplementing the things accurately enough, but that's at least possible)
You are correct. A time travel debugging solution that supports recording the relevant system call side effects would handle this. In fact, this system call is likely just rewriting the program counter register and maybe a few others, so it would likely be very easy to support if you could hook the relevant kernel operations which may or may not be possible in Windows.
The replay system would also be unlikely to pose a problem. Replay systems usually just encode and replay the side effects, so there is no need to "reimplement" the operations. So, if you did some wacky system call, but all it did is write 0x2 to a memory location, M, you effectively just record: "at time T we issued a system call that wrote 0x2 to M". Then, when you get to simulated time T in the replay, you do not reissue the wacky system call, you just write 0x2 to M and call it a day.
This system call returned and then asynchronously wrote to memory some time later. How does the replay system even know the write happened, without scanning all memory? It can't generally. With knowledge of the specific call it could put just that address on a to-be-scanned list to wait for completion, but it still needs to periodically poll the memory. It is far more complicated to record than a synchronous syscall.
You hook the kernel write. That is why I said hook the relevant kernel operations.
The primary complexity is actually in creating a consistent timeline with respect to parallel asynchronous writes. Record-Replay systems like rr usually just serialize multithreaded execution during recording to avoid such problems. You could also do so by just serializing the executing thread and the parallel asynchronous write by stopping execution of the thread while the write occurs.
Again, not really sure if that would be possible in Windows, but there is nothing particularly mechanically hard about doing this. It is just a question of whether it matches the abstractions and hooks Windows uses and supports.
I don't think rr hooks actual kernel writes, but rather just has hard-coded information on each syscall of how to compute what regions of memory it may modify, reading those on recording and writing on replay.
As such, for an asynchronous kernel write you'd want to set up the kernel to never mutate recordee memory, instead having it modify recorder-local memory, which the recorder can then copy over to the real process whenever, and get to record when it happens while at it (see https://github.com/rr-debugger/rr/issues/2613). But such can introduce large delays, thereby changing execution characteristics (if not make things appear to happen in a different order than the kernel would, if done improperly). And you still need the recording system to have accurately implemented the forwarding of whatever edge-case of the asynchronous operation you hit.
And, if done as just that, you'd still hit the problem encountered in the article of it looking like unrelated code changes the memory (whereas with synchronous syscalls you'd at least see the mutation happening on a syscall instruction). So you'd want some extra injected recordee instruction(s) to present separation of recordee actions from asynchronous kernel ones. As a sibling comment notes, rr as-is doesn't handle any asynchronous kernel write cases (though it's certainly not entirely impossible to).
reply