
Tagged Pointer Strings - ingve
https://mikeash.com/pyblog/friday-qa-2015-07-31-tagged-pointer-strings.html
======
weinzierl

        I speculate that Apple originally intended to use a 
        fancier variable-length encoding, probably based on a 
        Huffman code. This turned out to be too difficult, or 
        not worth the effort, or they just ran out of time, and 
        so they dialed it back to the less ambitious version 
        seen above, where strings choose between constant-
        length encodings using eight, six, or five bits per 
        character. The weird table lives on as a leftover, and 
        a starting point if they decide to go for a variable-
        length encoding in the future. This is pure guesswork, 
        but that's what it looks like to me.
    
    

A much simpler explanation is that someone just copied and pasted from the
Cocoa string character frequency table.

------
api
I recall a long time ago reading something about potential uses of a 128-bit
CPU architecture with 128-bit addressing.

Obviously 128-bit addressing isn't and probably never will be needed to
address memory, at least not in this geological epoch. But that's not the
point.

Instead, 128-bit addressing would allow every pointer to have 64 bits of tag
information that would allow all kinds of VM optimizations. Not sure if the
additional memory bloat would be worth it though. Like some RISC architectures
it might die by memory bandwidth constraints.

~~~
rogerbinns
> I recall a long time ago reading something about potential uses of a 128-bit
> CPU architecture with 128-bit addressing.

Go back to the IBM System/38 and the AS/400 and that is exactly what you have.
Programs are actually 128 bit oriented bytecode, so the actual CPU
architecture isn't visible, doesn't matter, and has been upgraded many times
over the years.

They also did virtual memory differently. Everything is in the same address
space, which means everything can (after access control) communicate and
cooperate. Task switches are also very fast. They also use tagged pointers,
but it is a single bit outside of what the apps can see, and marks a pointer
as modified. Consequently pointers can be given out, and the kernel knowns if
the app has tampered with them.

I highly recommend reading the architects book about how all this stuff
worked, mixed in with how business machines should behave. Many things are
very different than conventional Unix/Linux/Windows NT kernels, and a good way
of broadening your mind. The second edition book is good - later editions may
have removed earlier interesting detail. [http://www.amazon.com/Inside-
AS-400-Second-Edition/dp/188241...](http://www.amazon.com/Inside-
AS-400-Second-Edition/dp/1882419669)

------
amelius
How about "interning" strings [1]? This is also useful for storing larger
strings.

[1]
[https://en.wikipedia.org/wiki/String_interning](https://en.wikipedia.org/wiki/String_interning)

~~~
chc
Constant strings have always been interned. I don't think any others are,
though I haven't looked into it in a while.

~~~
mikeash
It's not NSString, but it's interesting to note that Objective-C selectors are
just interned strings. By interning them, they can be compared with a simple
pointer comparison. Because they're just C strings, you can print a selector
by casting it to char *. (But don't do this in production code, debugging
only!)

------
jheriko
this is an interesting trick, but it only really has benefit if used this way
when things are dynamically late binding. the cost of the check if an pointer
is tagged can even exceed the cost of any simple operation /in total/ on a
static, compile-time known object (e.g. a comparison with no branch, small
copies, simple arithmetic etc.). for something like NSNumber this is probably
very measurable (compared with, say NSString)

if objective-c had more mechanisms to enable baking out the run-time lookups
and operations at compile-time then i think that this would no longer be a
valuable optimisation. maybe they are there but are just not easy to find or
use... having the ability to use C or C++ does make it unnecessary.

sure, you can use C types in a lot of cases and the C library, but being able
to do more with the object-orientation functionaliy would be nice, rather
having to do OO by convention in C or using C++ to be seriously fast or
efficient - swapping NSDictionary for an std::map or std::unordered_map is an
optimisation i've ended up making multiple times in situations that aren't
really even tight inner loops, where the difference has even been as great as
making the difference between a severe, perceptible hitch (approaching a
second run-time) and something you can't even notice (millisecond or less run-
time).

as a rule i use Objective-C only as much as the platform forces it, and stick
with C++. its enormously harder to make slow and inefficient C++ code by
accident. e.g. you can loop over thousands of objects and do things with them
in << 16ms without special efforts. on some devices this is impossible with
objective-c even if what you are doing with the objects is a nop... this is
why gamedevs, for instance, avoid it like a plague.

(however i'm sure there could be other, interesting, uses of tagged
pointers...)

~~~
mikeash
Tagged pointers really have nothing to do with late binding. They're useful
anywhere you have a data that is sometimes small enough to fit within a single
pointer-sized chunk, and sometimes isn't. Strings are a good example of this:
short strings can be tagged pointers, and long strings can be dynamically
allocated. As another commenter mentioned, some C++ implementations use tagged
pointers for std::string.

Another example might be a big number library, where numbers smaller than the
pointer size could be implemented as tagged pointers, and larger numbers would
be dynamically allocated. A C++ big number library could easily do this.

Whether it's worth the cost depends on how expensive memory allocation is, how
expensive pointer indirection is, how many objects are able to fit within a
tagged pointer, and how expensive the tag check is. Typically the answers are,
a lot, a lot, a lot, and a lot. And dynamism doesn't change the answer much.

~~~
jheriko
erm. why are you even using a pointer if its not late-binding?

this is what static, stack, scopes and global variables are for with
references as needed...

------
cpr
Interesting--the kind of trick you'd normally see in an interpretive language.

Marshalling a string object ref to an actual value (before it can be used for
string manipulations internally) must incur a fair bit of CPU overhead.

I guess they decided the unboxed memory savings were worth it.

~~~
paulhodge
It's not really CPU intensive. To turn a 5-bit char into an 8-bit one, just do
a lookup into a tiny constant array. Doesn't even introduce any new branches.
Roughly the same performance as iterating over every character. (which you're
probably doing anyway if you need the conversion)

The savings aren't just memory, it's probably a performance improvement too.
Fewer allocations, fewer pointer indirections, and some operations (like
equality checking) become O(1) instead of O(n) for these tiny strings.

~~~
Someone
The other way round (deciding whether a newly minted string of length 10 can
be put in a tagged pointer) is slightly more complex. Also, for tiny strings
(the only strings that this supports) on modern hardware, I'm not sure that
O(n) takes much longer than that O(1).

I would think the avoidance of allocations and the pointer indirections are
the big wins.

------
nine_k
Looks very similar to the way in which numbers / strings / objects are
represented in Javascript (and earlier in some Lisps).

The nice thing is that for pointers, tagging is "free", for the lowest 4 bits
of a 16-byte aligned pointer are not used anyway.

~~~
apaprocki
It's more complicated than that -- x86-64 systems only use 48 bits of
addressable space. JS engines use all the remaining bits for tagging, not just
bits resulting from 8- or 16-byte alignment. This works great until you want
to run on another CPU architecture that actually has 64-bits of addressable
space and uses it regularly...

~~~
jhallenworld
I'm surprised they need so many bits.. you really just need to know if the
item is an object (so the type can be embedded in the object), or a primitive
type. But there are not so many primitive types.

Some dynamically typed systems put the tag at the beginning of a page: so all
items of a particular type are allocated out of a page tagged with that type.
You don't waste any bits from the pointer or data, but you do have to do some
masking and worse, a remote memory access to get the type.

Probably safest and fastest to do something like: struct { int type; union {
char *s; int n; double d; }};

~~~
bdash
Many JavaScript engines use IEEE-754 double-precision floats as their
immediate value type, and fit pointers and integer within the NaN payload of
the double representation. See
[https://wingolog.org/archives/2011/05/18/value-
representatio...](https://wingolog.org/archives/2011/05/18/value-
representation-in-javascript-implementations) for a discussion of a few
similar approaches and the tradeoffs of each.

------
TazeTSchnitzel
Note that 8 bits per character isn't ASCII, but ISO 8859-1, which Unicode is a
superset of. I expect the string "née" would become a tagged pointer string,
for example.

