
How Integers Should Work (In Systems Programming Languages) - rcfox
http://blog.regehr.org/archives/642
======
colanderman
Propagating NaNs?? Gah, at least exceptions provide a backtrace.

What's needed is a dependent type system (like that of ATS or Ada) that allows
the range of allowable values for an integer to be decalred. A halfway-smart
compiler can then statically check (using e.g. abstract interpretation) that
the intermediate results of arithmetic performed on such integers does not
exceed the system integer size, and that the results do not exceed the
declared size of the result variable.

(In fact, if the programmer places runtime checks at the right places, such
range declarations aren't even necessary as they can be inferred by the
compiler.)

Such a system would _statically guarantee_ that no overflow will occur. This
means no spurious hard-to-track-down NaN results at runtime.

~~~
Locke1689
This is interesting but I've never seen a dependent type system (even Coq)
that can do this completely. For example, let's say I'd like to turn a pointer
into an integer (a reasonably common systems operation) so I define a type
addr bounded between 0 and 2^32-1. If I add 100 to this value I don't see how
I could statically prove that no overflow could occur. For that to work,
wouldn't I need some way of defining the type of my address as 0 to
(2^32-1)-100? If the address is returned from a call to malloc how can I prove
that it not only has type addr, but also type addr_100_restricted at compile
time?

~~~
Peaker
To add 100 to a pointer, you would need a pointer to an allocation of at least
100 bytes.

Thus, your pointer would have a type that provides evidence that it points to
at least 100 bytes. This would make pointer arithmetic adding 100 bytes
possible. Something like:

    
    
      addPtr :: Address (size : Nat) -> (x : Fin size) -> Address (size - x)

~~~
Someone
Wouldn't that better be:

    
    
      addPtr :: Address (presize : Nat, size : Nat) -> (x : Fin size) -> Address ( presize + x, size - x)
    

presize = number of 'slots' available before the current value.

size = number of 'slots' available after the current value.

That would allow for later subtraction, too.

~~~
Peaker
Yes, indeed :-)

I did not mean to write an actual implementation, just to make a small point.

In an actual implementation, I'd probably want the type to convey information
about the content stored at the address, too.

------
sirclueless
This was posted as a comment on the article, where his comment system mangled
my shifts thinking I was trying to use HTML:

I'm not sure I like allowing invalid intermediate results. It means that there
are many expressions whose correctness is dependent on a compiler, and more
specifically a compilation target. It's not hard to come up with an example of
an expression that is valid, but only calculable on specific architectures.

If none of the examples that Arseniy came up with satisfy you, think of
something absurd like (x << 20000) >> 19999\. As it stands, no computer
architecture I know of could calculate the proper result. But with a clever
compiler you can derive an equivalent program that will execute correctly. If
you have the clever compiler and allow invalid intermediate results, your
program will execute correctly. If you have the same constraints and a
slightly dumber compiler, then you will trap.

There are probably examples that are flat out impossible to execute without
trapping on particular architectures, but fine on others. As a result,
architecture and compiler details will start bubbling up and affecting program
correctness. The ways in which this will happen will be far more perfidious
(if less common) than the current architecture details program writers often
must consider such as word length.

~~~
repsilat
Most of the time you'd want to use these sorts of integers you wouldn't use
shift operators - as the author said,

> Of course there are specialized domains like DSP that want

> saturation and crypto codes that want wraparound; there’s

> nothing wrong with having special datatypes to support

> those semantics.

More to the point, I think you'd want to standardise on the optimisations a
compliant compiler can/must implement. Algebraic rearrangements might be the
way to do it, but I'm not 100% sure even they're possible. For the example
given:

    
    
        result = result * 10 + cursor - '0';
    

If `result` is unsigned then I think the compiler has to assume that `cursor`
is greater than the character literal for zero. That's obviously true to us,
but I'm not sure that sort of thing could be (reliably) implemented
programmatically, even for simple cases.

~~~
sirclueless
> More to the point, I think you'd want to standardise on the optimisations a
> compliant compiler can/must implement.

That has two problems.

1\. Once you start formalizing optimizations that _must_ occur, you run the
risk that some optimizations will be impossible or have horrible performance
on some compilation targets. For example saying something like "an expression
made up of only 32-bit addition operations must successfully evaluate if the
result fits in a 32-bit integer" might be really painful to implement on an
embedded system with only a few 32-bit registers.

2\. Even if the correctness of an expression can be determined statically and
reliably by the language standard, if the correctness of an expression depends
on optimizations performed, then someone reading a program needs to know all
of those optimizations. For example, it is often quite a laborious task to
tell whether a C expression will evaluate correctly, because C has
occasionally bizarre rules for the promotion of the types of operands.
Fortunately there are some decently reliable heuristics that apply, that let
you assume that an unsigned char added to an unsigned long will probably
promote the unsigned char to a long and do the addition. Now imagine instead
of understanding C's type system, you had to understand the algebraic
reductions available to your compiler. The volume of arcana required to
determine whether an expression will be correctly optimized or not would be
absurd. And heaven forbid you tried to modify the expression. You might add a
simple constant somewhere that breaks an optimization and suddenly the
expression fails to evaluate when x is 2^31 - 1 or something.

When you consider these things, hopefully you will see why it is so much more
preferable to just enforce that all intermediate results must be valid, and
have the program writer do the algebraic reductions necessary to make (x <<
20000) >> 19999 evaluate with no invalid intermediate values.

------
tobiasu
This should be titled "How Integers Should Work (In CPU architectures)".
Programming languages have little to work with currently, and it is very
inconsistent across architectures. Having an IEEE standard for integer math
like we have for FP numbers could (...) help in the long term.

The only explicit "invalid" value I know of is available for floating point
numbers. NULL in C is usually* not an invalid address, even if user-space
programmers often seem to think so. Rather there are no pages mapped at that
address and the kernel hands this page fault to user-space in some way. There
is plenty of hardware where physical memory starts at 0x0. Lots of fun can be
had debugging such a thing.

* IIRC there was a CPU architecture that interpreted "9" as an invalid address, but I can't remember what it was..?

------
modeless
I like the idea of making INT_MIN == -INT_MAX, but I already think null
pointers are a bad idea, and having an invalid integer value might be even
worse. You'd have similar problems as with floating point NaN, where the
unexpected NaN != NaN case can cause subtle bugs.

In any case, processor vendors aren't about to change the way they do integer
math so this discussion, while quite interesting, is completely academic.

~~~
__david__
It's funny I think the opposite: I kind of like the idea of an iNaN, but I've
never been bothered by the asymmetry of INT_MIN and INT_MAX. I also don't
think overflow should necessarily cause an iNaN right away. "INT_MAX + 10 -
20" is legit, even though it overflows twice.

> In any case, processor vendors aren't about to change the way they do
> integer math so this discussion, while quite interesting, is completely
> academic.

I thought so too initially, but really every processor already has a status
register that shows you when overflow happened (otherwise big integer
algorithms would be hard) which is all you really need for this proposal. But
it _does_ require a bunch of tests around and integer arithmetic which makes
it unlikely to be a good candidate for a C language addition (a language whose
main impetus is speed).

~~~
jerf
'"INT_MAX + 10 - 20" is legit, even though it overflows twice.'

In general, maybe so. But in systems programming? What could that possibly
mean that is not a bug? There are answers to that, but since the answers are
going to be <1% corner cases it doesn't justify _defaulting_ to the dangerous
case while requiring one to jump through hoops to get the safe case. If you
desperately need "INT_MAX + 10 - 20", _ask_ for it.

------
ScottBurson
I agree that silent wraparound is evil. Overflow should trap (and throw an
exception if the language has those).

But I don't care for iNaN. I'd rather get the exception at the point the
problem was created.

I'm also not sure what I think about the idea that intermediate results should
be able to be wider than the integer type being computed with. I'd rather be
able to introduce temporaries for subexpressions without changing the meaning
of the program -- this is basic referential transparency.

That said, I'm not completely convinced by the argument that it's too
dangerous for a language to support arbitrary precision by default. Special
types or operators can be used in contexts, like interrupt handlers, where
allocation is undesirable. We have an existence proof that this is workable:
the Lisp Machine operating system.

------
acqq
This is wrong on so many levels it's hard to even start criticizing. TL/DR:
the guy wants more or less the semantics of decimal floating point numbers
(with NaNs and all) in integers, and believes that such "integers" would be
"better" in his words"to make it difficult for those specialized integers to
flow into allocation functions, array indices, etc. That sort of thing is the
root of a lot of serious bugs in C/C++ applications."

In fact, integers that we have in current CPUs are product of let's say at
least 50 years of engineering and they are optimum, even if somebody declares
them "the root of a lot of serious bugs in C/C++ applications." I've got the
news for him: a lot of other and older languages have the same behavior
because that's the local optimum for a hardware implementation and it will
remain so even once we have decimal fp numbers of the same speed as integers
(even if we ignore that actually we won't). Before two's complement there were
other integer variants implemented in hardware and they turned out to be less
functional as two's complement representation.

Then gives the example that if this code

    
    
      result = result * 10 + cursor - '0';
    

were evaluated as

    
    
       result * 10 + (cursor - '0')
    

there won't be overflow, totally ignoring that the languages often have
exactly defined order of evaluation and that if the language has the defined
order of evaluation from left to right for + and - then the only proper
evaluation is only

    
    
      (result * 10 + cursor) - '0'
    

It's as simple as that.

Then he ends with the loop that wouldn't work with floating point NaNs (and
therefore wouldn't fit in NaNs for integers too):

    
    
      for (x = INT_MIN; x != iNaN; x++)
    

because every number is not equal to NaN, so the above loop would never end.

I'd say everything he wrote is not even wrong.

~~~
angersock
The biggest problem is that the author seems to be completely ignorant
(willfully or otherwise) of the issues involved at the hardware level.

They go so far as to point out that high-level languages shouldn't have these
limitations (sure, agreed), but then is baffled that something like C/C++ does
(and should).

C maps more-or-less directly onto assembly, which in turn maps more-or-less
directly onto command bit vectors for an ALU. There is no magic decision in
the industry to use twos-complement, wrapping arithmetic, or fixed-size
integers (which map onto machine words).

This person needs to spend some quality time with low-level architectural
stuff and gain a better understanding of why things are the way they are.

(not that the points are bad, necessarily, for a DSL or something--but stay
out of our systems languages!)

EDIT: Okay, author is clearly somewhat knowledgeable about the field, but
still... these are interesting ideas that don't really make sense for low-
level stuff. It's not a matter of allocating enough memory (for bignums or
whatever)--it's how the mapping occurs to the actual hardware instructions and
datatypes.

~~~
caf
Modifying a 2s complement ALU so that if the overflow bit is set, then every
bit of the result is set to 1 would be trivial. It's just another n OR gates,
and those are pretty cheap (OK, you'd probably want an AND against a control
bit, too).

The author doesn't ignore that this would require a change at the hardware
level - he says _"Processor architectures will have to add a family of integer
math instructions that properly propagate NaN values..."_ \- and he is right
that the change would not be a costly one.

------
ldar15
In Systems Programming Languages? For "writing operating systems"? There are
people who argue we should be writing our operating systems in high level
languages anyway. Assuming the premise is we should be using a low-level
language, then how does the author reconcile that with "but I want the
language to hold my hand when it comes to math".

Choosing "overflow" or "underflow" to mean "I fucked up" is totally arbitrary.
Variables usually indicate values that have a domain - a range of valid
numbers. Saying "I don't want to think about what that it is, but oh if X hits
2 billion and change then warn me when some math fails" is no better than
having it not fail at all. In most cases there's already a problem.

So, simply, writing "OS quality" code means explicitly checking inputs to
ensure they are in the permissible range. Once you know what the range is, you
know if your code needs to go up to 64bit math to handle them.

UPDATE: Some explanation for the downvote would be appreciated.

~~~
nitrogen
There have been many security holes and crashes caused by undetected integer
overflow. The rationale is that detecting this condition would be a useful
step toward preventing that category of bug.

~~~
ldar15
The rationale, then, is that the compiler should catch mistakes that lead to
security holes. On that basis, then, we'll be adding GC memory management, so
we never access freed memory, also strongly defined types - e.g. bounded
integers, and bounded arrays too. Writing the OS an ADA would satisfy this
chap?

"catching security holes" is the compiler version of "think of the children".

------
mmphosis
eighteen quintillion four hundred and forty-six quadrillion seven hundred and
forty-four trillion seventy-three billion seven hundred and nine million five
hundred and fifty-one thousand six hundred and sixteen

    
    
      18446744073709551616

------
kabdib
Remind me not to hire anyone this guy has taught.

Oh. My. God.

[I'm kidding, of course. I learned a lot from teachers who had obvious axes to
grind . . . just not the lessons they wanted me to come away with. Finals were
an interesting exercise in remembering to cough up the party line, and even
the TAs were snickering in some sessions.]

