
What is gained and lost with 63-bit integers? - lelf
https://blogs.janestreet.com/what-is-gained-and-lost-with-63-bit-integers/
======
DonHopkins
The SPARC has tagged arithmetic (TADDCC), to support Lucid Common Lisp and
Smalltalk.

[http://compilers.iecc.com/comparch/article/91-04-082](http://compilers.iecc.com/comparch/article/91-04-082)

[http://en.wikibooks.org/wiki/SPARC_Assembly/Arithmetic_Instr...](http://en.wikibooks.org/wiki/SPARC_Assembly/Arithmetic_Instructions#Tagged_Instructions)

Back in the late 1980's, a Sun employee jokingly pointed out that because the
DECstation used the MIPS processor in little endian mode (it supported both
big and little endian) to match the VAX, its processor should be called the
SPIM. I asked if the SPARC supported little endian mode, would it be called
the CRAPS? He was flustered, and told me to shut up and never tell anyone that
SPARC spelled backwards was CRAPS.

~~~
ScottBurson
Hi Don! I'm afraid TADDCC has gone the way of the dodo -- the 64-bit SPARCs
don't have it.

~~~
DonHopkins
Which way did the D0D0 go???!

There are much worse ways to go -- like the way of the Oracle, for example. ;(

------
pbsd
If LEA is slow, don't use it. A simple ADD followed by DEC outperforms LEA in
throughput, latency, _and_ size in recent Intel chips.

If you're willing to reduce integers to 62 bits (and assuming no overflows are
allowed), multiplication can also be simplified: (2x+1)(2y+1) = 4xy + 2x + 2y
+ 1. So:

    
    
        t = x + y - 2; // + 1 - 3
        p = x * y; // 4*x*y + 2*x + 2*y + 1
        return (p - t) >> 1; // (4*x*y + 3) / 2
    

In amd64, this can be done as

    
    
        ; input in rdi, rsi
        lea rax, [rdi + rsi - 2] ; slow, but runs concurrently with imul
        imul rdi, rsi
        sub  rdi, rax
        shr  rdi, 1 ; output

~~~
jzwinck
GCC generates LEA even for a regular addition, even with -O3 and -march=native
on a modern Intel x86-64 system. A simple addition of int64_t gives you:

    
    
            leaq    (%rdi,%rsi), %rax
    

Whereas subtracting the tag gives you:

    
    
            leaq    -1(%rdi,%rsi), %rax
    

So pretty much the same either way, though you do get five instruction bytes
for the tag version vs. four bytes without.

Here are the comparisons for multiply and divide, from "gcc -c -O3 -g
-march=native" and "objdump -d":

    
    
        x * y:
    
          10:   48 89 f8                mov    %rdi,%rax
          13:   48 0f af c6             imul   %rsi,%rax
    
        (x >> 1) * (y - 1) + 1:
    
          40:   48 d1 ff                sar    %rdi
          43:   48 83 ee 01             sub    $0x1,%rsi
          47:   48 89 f8                mov    %rdi,%rax
          4a:   48 0f af c6             imul   %rsi,%rax
          4e:   48 83 c0 01             add    $0x1,%rax
    
        x / y:
    
          20:   48 89 f8                mov    %rdi,%rax
          23:   48 99                   cqto
          25:   48 f7 fe                idiv   %rsi
    
        (((x >> 1) / (y >> 1)) << 1) + 1:
    
          60:   48 89 f8                mov    %rdi,%rax
          63:   48 d1 fe                sar    %rsi
          66:   48 d1 f8                sar    %rax
          69:   48 99                   cqto
          6b:   48 f7 fe                idiv   %rsi
          6e:   48 8d 44 00 01          lea    0x1(%rax,%rax,1),%rax

~~~
pbsd
3-operand LEA has a different performance profile than 2-operand LEA on Sandy
Bridge and later chips. The former has 3 cycle latency and only one execution
port, whereas the latter has single-cycle latency and can be dispatched to two
execution ports. So while gcc (and every other compiler) may rightly generate
an LEA instruction to replace regular addition, it's not optimal when a third
argument is added.

Curiously, if you need to move the result to another register (say, to respect
calling conventions) the optimal solution is a 2-operand LEA plus a DEC:

    
    
      lea rax, [rdi + rsi]
      dec rax
    

I was wrong about ADD + DEC being smaller, though. I forgot that amd64 took
away the 1-byte INC/DEC instructions to serve as REX prefix, so the DEC takes
3 bytes, not 1.

------
boardwaalk
In my toy scheme interpreter I used two bits, and so could store integers,
pointers to C functions, symbols, bools and characters in an "immediate" form.
With the caveat that cells (which contained things that couldn't fit in
immediate form) had to be aligned on a four byte boundary.

Just showing the different things you can do with tagged pointers I suppose.
It's pretty common in Lisp/Scheme runtimes. I love bit twiddling hackery.

[https://github.com/boardwalk/quuz/blob/master/quuz.h#L13](https://github.com/boardwalk/quuz/blob/master/quuz.h#L13)

~~~
infogulch
Wait, with 2 bits you can represent 4 possible values, but you listed 5.
What's up with that?

~~~
gnud
Not the GP, but bools and characters don't need 62 bits for the value. They
can share an identifier, and you can tell them apart by the 3rd bit, for
example.

~~~
infogulch
Ah, yes that makes sense. I didn't think about characters and bools.

------
StefanKarpinski
Can someone explain why a statically typed Hindley-Milner language like OCaml
needs to box everything in the first place?

~~~
emillon
Additionnally from the GC, parametric polymorphism requires a uniform
representation of values. For example, List.map does not make assumptions
about the values in the list like their size.

~~~
StefanKarpinski
That doesn't really explain this since `map` could look at the tag on the
list/array instead of on each individual element.

~~~
emillon
Arrays of doubles are actually special-cased and do exactly this.

------
SeanLuke
The Newton had 31-bit integers for the same exact reason. This resulted in the
biggest Newton bug in its history: the Year 2010 bug. Like early Macs, the
Newton's epoch was January 1, 1904. Times were represented with ints. 1904 +
2^31 seconds = 2010, at which point all sorts of things started failing.

[https://www.google.com/search?q=Newton+year+2010+bug](https://www.google.com/search?q=Newton+year+2010+bug)

~~~
glandium
How can 1904 + 2^31 seconds = 2010 when 1970 + 2^31 seconds = 2038?

~~~
judk
Parent poster mis-recalled details.

The bug is 30-bit signed ints (29bits) with epoch 1993.

~~~
SeanLuke
Gah, indeed I did. The Newton actually has two different epochs, Jan 1 1904
for the OS epoch (same as early Macs) with 32 bits and Jan 1 1993 for the
NewtonScript epoch with 29 bits. Someone's tired.

------
vidarh
Ruby (at least MRI) also uses a similar scheme. This is a fairly common
approach with dynamically typed languages.

~~~
phkahler
>> This is a fairly common approach with dynamically typed languages.

That makes sense. The performance penalty might be incremental compared to the
overhead of dynamic typing.

------
lispm
> Almost every programming language uses 64-bit integers on typical modern
> Intel machines.

That's not true. Most implementations which use a GC and/or are dynamically
typed will have less than 64bit integers.

------
arh68
> _Could we have a solution which would keep ints unboxed but have fast
> arithmetic operations?_

Wait, is the answer not floating point? Using 64-bit IEEE 754 means the lower
52 bits (51-0) are significand, and bit 52 is implicitly 1, so setting bit 0
to keep track of GC stuff only introduces an error of ~1/2^52.

Floating-point multiplication isn't as fast as integer ALUs, perhaps, and
you're now limited to 52-bit integers, but that's what Lua does and it gets by
fine.

------
jlebar
This is similar to "fatvals" in SpiderMonkey (Firefox's JS engine).

IIRC SpiderMonkey stores doubles -- the fundamental numeric data type in JS --
as "unboxed" values, using this article's terminology. I'm not sure if there's
good documentation about this out there, but it's an interesting hack, and
much more complicated than just setting the bottom bit of the value.

~~~
ufo
I think you are thinking about a different trick called NaN tagging.

[http://wingolog.org/archives/2011/05/18/value-
representation...](http://wingolog.org/archives/2011/05/18/value-
representation-in-javascript-implementations)

The basic idea is that there are 2^53 different bit patterns that are counted
as NaN by the computer but only one is actually used in practice. This means
that you can "steal" the rest of the NaNs and use them to encode numbers or
pointers.

~~~
munificent
I'm hacking on a fast little dynamically-typed language that uses that
technique. It works fantastically well. I describe it in detail here:

[https://github.com/munificent/wren/blob/master/src/wren_valu...](https://github.com/munificent/wren/blob/master/src/wren_value.h#L400)

~~~
ufo
Thanks for the link! I was actually surprised how hard it was to find a good
description of NaN tagging yesterday.

------
wcummings
Jane Street is a big user of OCaml and have published a sizable amount of
code: [https://github.com/janestreet](https://github.com/janestreet)

------
aidenn0
SBCL allows unboxed machine-word sized integers in certain cases. This allows
for hand-tuned inner-loops and also efficient coding of algorithms with an
implicit MOD 2^64

------
jbert
Would you get simpler (faster) ops if you stole the high bit instead of the
low bit?

e.g. x+y becomes (x + y | (1<<63))

Or does the loss of overflow detection hurt you more?

~~~
breadbox
Setting the high bit would mean that the integer values are no longer
automatically distinguishable from pointer values, which is kind of the whole
point.

~~~
TheLoneWolfling
Why? You could just check if the high bit is set, instead of if the low bit is
set.

IIRC, you can do this just by checking if it is less than zero.

~~~
infogulch
First, pointers are typically _unsigned_ integers. Second, memory addresses
are virtual, they don't have to exist physically. That is, you don't have to
physically have up through 2 quintillion addressable bytes for a pointer with
that value to be valid. This is especially true with Address Space Layout
Randomization (ASLR) security techniques employed by operating systems.

tl;dr: all valid pointer values are possible, regardless of how much memory
you actually have.

~~~
TheLoneWolfling
So then store the pointer implicitly right-shifted by 1. Means that boxed
access is slower, but unboxed access is faster.

And, when you get down to the assembly level, it doesn't matter - you can
treat an unsigned integer as signed for a comparison if it makes things
easier.

------
MichaelGG
So this is done to avoid the hassle of having to implement a precise GC?

~~~
orbifold
No the garbage collector in Ocaml is just a lot faster than a precise GC could
be.

------
auvi
i think the symbolics 3600 had 4 or 8 bit for tags, ocaml is borrowing some
more lisp ideas.

~~~
FullyFunctional
The ideas behind pointer tagging have been applied and reinvented an
astonishingly amount of times, see Knuth and any number of book on the
implementation of functional programming languages. To claim symbolics or Lisp
invented everything is a bit silly and inaccurate.

OT: Thanks to Don Stewart for the two interesting links:
[https://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage...](https://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapObjects)
[https://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Haskell...](https://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/HaskellExecution/PointerTagging)

~~~
_delirium
I believe the basic idea dates to the Rice architecture of the late 1950s and
1960s [1]. Although there it wasn't just a programming language implementation
technique, but a proposal conceived as a full-on alternative to the Von
Neumann architecture [2].

Tagged values as a programming-language implementation technique do seem to
take off more towards the early 1970s, though, with the ex-MIT team that would
produce the Lisp Machine, and the Xerox team that would produce Smalltalk,
both using tagged values prominently.

[1]
[https://en.wikipedia.org/wiki/Rice_Institute_Computer](https://en.wikipedia.org/wiki/Rice_Institute_Computer)

[2] A somewhat later (1973) paper referenced above:
[http://www.feustel.us/Feustel%20&%20Associates/Advantages.pd...](http://www.feustel.us/Feustel%20&%20Associates/Advantages.pdf)

~~~
ScottBurson
Minor correction: the Lisp Machine was created at MIT by MIT staff. Only after
it existed did the commercial entities get spun off.

