

Wade not in unknown C++ waters. Shift operators . - AndreyKarpov
http://www.viva64.com/en/b/0142/print/

======
Jabbles
This article makes little mention of the fact that shifting a positive, signed
integer can result in undefined behaviour if said shift "would" cause the
"sign-bit" to be modified. I find that behaviour far more alarming than
shifting negative numbers.

An interesting paper on a related subject:

[https://wiki.engr.illinois.edu/download/attachments/15820390...](https://wiki.engr.illinois.edu/download/attachments/158203905/2012-ICSE-
Numerical.pdf?version=1&modificationDate=1330923113000)

To entice you to read it, a quote:

 _To ﬁnd the time bombs, we altered IOC’s overﬂow handler to return a random
value from any integer operation whose behavior is undeﬁned by the C or C++
standard. This creates a high probability that the application will break in
an observable way if its execution actually depends on the results of an
undeﬁned operation. Perhaps amusingly, when operating in this mode, IOC is
still a standards-conforming C or C++ compiler—the standard places no
requirements on what happens to a program following the execution of an
operation with undeﬁned behavior._

~~~
pmjordan
C and its descendants have some really, _really_ scary integer
overflow/underflow and type promotion behaviour. As a lot of my code lately
runs in the kernel, I'm now extremely paranoid about overflow/underflow bugs,
and it's amazing how much innocuous-looking code can be a security issue. I'm
amazed that even the newer languages like Go don't really tackle this problem.
Signed integer overflow is at least not left undefined, but wrap-around
semantics are just as good at hiding bugs.

I don't know what exactly the best solution is, but I'd probably default to
arbitrary-precision integers which have a highly efficient implementation for
native word-sized values and a safe, non-overflowing fallback that can still
be made to be pretty fast. Repeated growth checks could be eliminated in many
places by some basic static analysis. If you find your code would be faster
without the safety, you can always selectively switch to unsafe semantics.

Basically, I think the current situation of overflow-by-default is a premature
optimisation that's causing far too many bugs.

~~~
Jabbles
I don't have any data on the performance costs of using arbitrary precision
integers in all code, but I'd like to see that if you know of any studies. I
would have thought the overhead could be quite significant as you would have
to keep a length variable for every single integer and check it every function
call...

The use of 64 bit integers may significantly reduce the possible places for
wrap-around to occur, but I think we (C,C++,Go etc. developers) have to
develop a sense of code smell for overflow situations, so that "innocuous-
looking code" stands out. That or static analysis.

~~~
pmjordan
64-bit integers don't really solve the bugs, they just make them slightly less
probable. They basically don't help against malicious input in the general
case.

The most famous example for this kind of bug is probably the binary search
implementation in Java:

[http://googleresearch.blogspot.com/2006/06/extra-extra-
read-...](http://googleresearch.blogspot.com/2006/06/extra-extra-read-all-
about-it-nearly.html)

The interesting thing about this bug and many like it, is that you don't need
_real_ arbitrary-precision arithmetic to avoid it. The result of the
expression (a + b) / 2 can be proven to safely fit into 32 bits if a and b are
known to fit into 32 bits as well - the compiler just needs to make the result
of the intermediate expression (a + b) be stored 33 bits, or to transform the
it into a mathematically equivalent expression that has the same properties as
the original expression with a 33-bit intermediate.

Of course, this only can be applied to a subset of algorithms - I suspect this
quickly devolves into the halting problem in general. But maybe new languages
could specify that such halting problem cases need to be manually determined
(explicitly choose between arbitrary precision arithmetic and truncation). The
simpler cases can be deduced automatically. I guess that means in addition to
uint32_t and uint64_t we'll probably need more specific types, such as
"positive (nonzero) integer in the range 1 to 1 << 32". Which would be tedious
if you had to specify it explicitly all the time, so you'd need good inference
and solid support for generic functions.

------
mdkess
Not really profound, but a good, quick way to think about clock cycles: A 1
MHz machine executes 1 clock cycle in 1 microsecond (ie. 1 MHz is 1,000,000
Hz, and there are 1 million microseconds per second). So a reasonable machine,
say 2 GHz, executes 2000 cycles per microsecond. So when the article says 124
clock cycles to execute a shift, that's really saying that on a 2GHz machine,
the instruction will take about 124/2000 microseconds - which means that 16
shifts or so will eat up an entire microsecond of CPU time. In low latency
environments, that's bad. (Consider - a top tier high frequency trading firm
does about 20 microseconds "round the diamond" - the time from receiving a
market snapshot to the time that a decision is made and a trade is executed
and put on the wire. Granted, they're not on 2GHz machines, but these things
add up).

Also interesting, light travels about 1000 ft in a microsecond.

~~~
jandrewrogers
It is more complicated than this one modern processors. The relationship
between latency and throughput is more complicated. An instruction may require
6 clock cycles (latency) to complete but if you can issue 3 instructions per
cycle then the throughput is 2 clock cycles per operation.

The Intel i7 has 6 operation execution ports per core, 3 of which can be used
for basic integer operations. Shifts execute in a single clock cycle, but
since there are 3 ALUs it is possible to execute 3 shifts per clock cycle.

There are a class of low-latency algorithm optimizations based on making sure
that all operation ports are being used nearly every clock cycle. It is a
special type of fixed, low-level parallelism that most software does not do
very well by default. I've seen intentional and careful ALU saturation
generate 2x performance gains for algorithms. The resulting code often does
not look very different since it is about slightly reorganizing highly
localized dependencies. (This is how hyperthreading works; unused execution
ports in a clock cycle are given to a second thread to use.)

