Hacker News new | past | comments | ask | show | jobs | submit login
Static Integer Types (tratt.net)
55 points by matt_d on July 1, 2021 | hide | past | favorite | 42 comments



> If you dive into the C specification, you'll discover that the values I tried to cast lead to undefined behaviour [10] which means that my program could have done anything. In practice, the behaviour I observed can be explained very simply.

This is not entirely correct. Note that this is integer conversion as opposed to signed integer arithmetic overflow. The latter is indeed undefined behavior, while the former is implementation defined.

https://cigix.me/c17#6.3.1.3.p3

> Otherwise, the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised.

In C++ since C++20 the result is defined. Before that the result was implementation-defined (without the option to raise a signal).

AFAIK all major compilers do the obvious thing in both C and C++. The undefined behavior of signed integer overflow is indeed used by compiler optimizations.


One feature I wish was more common is the ability to easily range-restrict primitive types, without all the boxing and boilerplate and operator overloading.

I can write a type that only handles integers 0-10, or floats [0...1], but it is more of a lift than I'd like to do in most cases, inefficient, and so I don't, even though it would be a boon for correctness and avoiding idiot bugs.




TIL Pascal had those. I'll need to check Turbo Pascal some day. I could have the joys of static bondage typing sooner.



Or Idris dependent types, which allows you to prove properties of any value, not just numbers.


Afaik refinement types (e.g. LiquidHaskell) are bit more appropriate than dependent types for this sort of thing


Plus, Nim supports formal proof natively

https://nim-lang.org/docs/drnim.html


Hey, great for Nim. Though I hope they're not /too/ tied to z3. Why3 seems to be the standard SOTA tool for proof: it can harness lots of different SMT solvers (each with their strengths and weaknesses) including z3, cvc4, alt-ergo, but also specific solvers (grappa?). It is at the heart of Frama-C and Spark2014.

I also think that not having executable contracts by default (and defining whether you want them in your final program or not - see ghost code) is going to be a pain point in the future. I don't know nim enough to see if the syntax for provable properties is different from nim's ordinary syntax but I feel that it can also be a trap, to have your users learn yet another language for proof.

I'm hoping for the best here, and I must say, I'm quite happy to see other languages/environment jump in !

Some years ago you had only industrial support for Frama-C, Spark, and some academic success around JML? I don't know Eiffel's Eve enough to say whether I would consider it 'mature'. Nowadays with the work on Rust by Alastair Reid we're getting some new energy in the domain. Great!


> I hope they're not /too/ tied to z3

If you mean having a hard dependency on z3 at compile time or something similar - no, it's not. :)


I was more thinking of having written the whole verifier to fit/please z3. It can be hard/daunting to move to more 'generic' SMT drivers if done late in a project.


Rust's const generics eval is quite powerful. It may be able to represent this.


It probably can't yet on stable, but it will be able to within the next year or so.


Oh yeah, I guess I should have mentioned I'm talking explicitly about the nightly const feature.


> As far as I can see, the fast types are a failure, and the least types are unused, possibly because no-one can understand exactly what the latter mean.

I personally quite like that we have the fast/least types, because they allow you to write portable code, that would otherwise require uintN_t types, which are only defined if the platform supports the specific width of the type.

Let's say you want to implement a cryptographic (often require 32/64 bit modular arithmetic) algorithm that works on any confirming c compiler. uint32_t isn't guaranteed to be defined, but uint_least32_t is. You can now use uint_least32_t instead of uint32_t and mask out the upper bits that may exist when necessary. You'd do this with e.g. UINT32_C(0xFFFFFFFF), which ~~coincidentally~~ by design also has the type uint_least32_t. This would result in "perfect" code gen on any platform that has a 32-bit unsigned integer type (given a reasonable optimizing compiler), since in that case uint_least32_t is the same type as uint32_t, and otherwise it will still work.

The fast types could be used for something similar, if they were properly defined. But C has a history of compilers implementing features not as intended, I'm looking at you rand(): "The C89 Committee decided that an implementation should be allowed to provide a rand function which generates the best random sequence possible in that implementation, and therefore mandated no standard algorithm" (c99 rational).

Edit: Note that you also need to make sure that the variables aren't promoted to signed integers, e.g. x << 3 should be 1u*x << 3.


I found that types like uint32_least_t to be much less useful after learning that standard C/C++ types already have these guarantees. Namely: char is at least 8 bits wide, short is at least 16 bits wide, int is at least 16 bits wide, long is at least 32 bits wide, long long is at least 64 bits wide.

Implementing cryptographic algorithms using uint32_least_t can be dangerous. Suppose that short is 32 bits wide and int is 48 bits wide. Your uint32_least_t might be mapped to unsigned short. When you do something like (x << 31), x gets promoted to signed int, and then you might be shifting a 1 into the sign bit, which is undefined behavior. So I would say that a better idea is to use unsigned long, which is guaranteed not to promote to a higher-rank type.


Oh, you are right, this can be problematic, although I don't like your solution either, since unsigned long is going to be larger than 32-bit on 50% of the platforms (not really, but you know the drill, Linux: 64-bit, Windows: 32-bit).


Indeed. The trick here appears to be to write (1U*x << 31) so that 1U*x becomes a type that is the larger of uint32_least_t and unsigned int.


Yup. Alternatively ((0U + x) << 31). I first thought about this problem years ago: https://stackoverflow.com/questions/39964651/is-masking-befo...


Well, rand has problems other than just being a poor random generator.


On Rust’s pointer-sized integer types, the design was subject to much agonising in 2013–2014. If you’re interested in the history, the end point is https://rust-lang.github.io/rfcs/0544-rename-int-uint.html, and the GitHub comments on that RFC and earlier proposals, and earlier discussions on discuss.rust-lang.org as it was at the time (now renamed without redirect (grr!), change any links to internals.rust-lang.org).

Beyond this, I speak as one who hasn’t done any embedded work or touched a platform where what this article calls “data width” and “address width” differ.

I suspect that the key here for Rust is that it simply doesn’t have a type for what this article calls “data width”, because it wasn’t found to be useful. I’m not certain why you’d actually need such a type; I’m guessing that the only place it matters is the size of registers, which surface-level Rust doesn’t need to worry about, though I’m not sure if asm! has anything to do with it. I can imagine that things like GPIO pins which might need data width? will be done in a library for a specific board, and can thus define the type themselves.

I also have a vague feeling that there’s also a third concept here, not just the two: that what I’ve called “data width” in the previous paragraph is actually “word size”, which could potentially be smaller again. Something about an 8-bit architecture with 16-bit address space and 32-bit pointers. Can’t remember if register sizes are consistent, either.

I’m writing fairly incoherently and inaccurately here because I really don’t know what I’m talking about. :-)


I am the author of a library that requires knowledge and awareness of integers that are strictly not wider than the general purpose register on the processor; the fact that `usize` is "the largest GPR" on every target Rust knows about is great but I'd still appreciate a distinction between "this fits in exactly one GPR" vs "this fits in the address bus"

_especially_ since the library in question creates pointers that are wider than `usize`


The definition clearly says that `usize` is pointer-sized, and makes no reference to register size, so it seems there's already a pretty clear distinction?

If you are specifically looking for a register sized tyoe then you should define it on a per-architecture basis via `cfg()` options.


One can also argue data width is the size of cache lines, which nowadays often are larger than registers (and might even be different in the various caches, but I don’t know whether such architectures exist)


The author of this post is oblivious to fractional bit encodings. When encoding an array of items, you can always encode the array optimally, only wasting less than 1 bit.

For example, if we had an array of n test scores from 1-100, we only need ceil(n*log2(100)) bits.

The key is that when reading the values, instead of shifting like we would normally to iterate an array, we instead divide by 100. Instead of masking bits, we get the current value by doing mod 100. Its equivalent to shifting/masking for a power of 2, but extends packing to fractional bits. It's quite beautiful.

Take a look at this blog post to understand more about fractional bit representations:

https://hbfs.wordpress.com/2011/11/01/fractional-bits-part-i...


Speaking from experience on this... you get some crazy looks from code reviewers. I got perhaps overclever with this at one point using it to generate passwords by extracting log2(44) bits progressive out of a random number to choose from my blessed set of characters for passwords and it took some patience with the security team to verify that it was safe.

I did it just because it's one of the more elegant ways out of the problem of truncation of bits when you have a non-even-power-of-two set of possibilities to choose from, especially when you need to do it repeatedly from a random source. Instead of discarding or fussing about, if you need fractional bits from a given random source of bits, you just... take fractional bits. No fuss. No muss.


> by extracting log2(44) bits progressive out of a random number to choose from my blessed set of characters for passwords and it took some patience with the security team to verify that it was safe.

It's safe but if you somehow depended on all the passwords having exactly equal probability, you'd be wrong (unless you're already using a random number from a limited range). Of course this is easily fixed by a range check on the random number and generating it again if it failed the check.


I also have written about using pairing functions to extract multiple random numbers from a single rand() call

Your method is more concrete than mine, mine is math, yours is binary-based CS.

https://churchofthought.org/blog/2019/07/24/an-ordered-varia...


I feel like the core idea here is really intuitive, but the presentation as fractional bit encoding obscures it—at least to me. It's just encoding an array of values with K distinct possibilities as a base-K number. I had to read the code in the link above to understand that; the exposition didn't help me at all.


> It's encoding an array of values with K distinct possibilities as a base-K number

Yes, and the encoding and decoding step needs to do arithmetic so there's a practical limit to how big these numbers can be. But sometimes you do get some real gains.


Interesting concept, but seems like it only works for fairly small arrays, right?

You could fit 40 base-3 values into a 64 bit integer, but in order to read/write index 39, you have to perform 39 multiplications/divisions. So no efficient random access. And is there any way to encode more than 40 base-3 values if your largest integer is 64 bits? If you need 45 elements, I guess you could use an array of 5 8-bit integers where each 8-bit element represents 5 base-3 elements. Of course, now you're limited to increasing the size of your array to increments of 5 (or more generally, smallest efficient integer size / log base 2 of the number of representable values)


> Interesting concept, but seems like it only works for fairly small arrays, right?

Any size array will work.

> And is there any way to encode more than 40 base-3 values if your largest integer is 64 bits?

Yes, absolutely. And with ease. Bignums make it seamless. Some languages like Perl and Python and Mathematica support them out of the box.

You can still do random access, it just becomes: divide bignum by 3^n, take result mod 3. It will be costly though, like you mentioned. There is a way to bound the cost.

You can chunk your elements to a number of bits that is a multiple of 8 bits, a byte. This allows you to do random access with minimal overhead, while giving you the savings, bounded by your choice of chunk size.

For example, 100 of your elements can be neatly fit into 20 bytes.

101 elements would need more than 20 bytes or 160 bits, but 100 elements only needs 158.5 bits, we only lose 1.5 bits per chunk.

https://www.wolframalpha.com/input/?i=log2%283%29*100.+vs+lo...

Consider that we do this chunking. Let us say we have 10000 elements and want to look at the value of element 6000. We can immediately skip to byte (6000/100)*20=1200 and simply take the value there mod 3. This is possible to do because each chunk is no longer related to eachother. So we get random access with a small procedure that gives us a huge space savings.

If we choose a byte-sized chunk size, we can graph (log2(3)*n) % 8 to look for which chunk sizes would lose the least bits of storage.

https://www.wolframalpha.com/input/?i=discrete+plot+%28log2%...

We want to choose chunk sizes where the mod value is close to 8, since we have to round up.

Hope this helps.


> Coding with an non-integer number of bits is counter-intuitive

I've reinvented this very thing several times in the past. I don't see how it is in any way unnatural or unintuitive. You're simply numbering the members of a Cartesian product of some finite sets. The maximum number possible, if you start from number 1, is the product of the cardinalities of all of your sets. It's basically elementary school math, which is probably why I had no problems with coming up with this idea even in my dilettante past.


> Let’s start with Rust, which offers the usize type, defining it as “The pointer-sized unsigned integer type. The size of this primitive is how many bytes it takes to reference any location in memory.” That definition is based on the assumption that a machine’s address and data widths are the same

Is this true? I don't know the history here, but just from reading that definition, it seems like `usize` is tied only to the address width. Where does the data width come in?


Rust’s usize does the jobs of both uintptr_t (a pointer-width type) and size_t (a data-width type)


size_t is just a type that's large enough to hold the largest possible object. It doesn't necessarily relate to the register size...

This means that `uintptr_t` must be at least as large as `size_t`, and so Rust's usage of `usize` for both is not problematic.


I might be wrong but what came to mind was segmented memory, ie like 8086 where registers are 16bit but (far) pointers are 16:16 (ignoring the quirk with overlapping segments).


That would make usize a 32-bit integer, regardless of the data width.

It does make other hidden assumptions though, such as that there is a single "memory" that is uniquely identified by a usize value, regardless of the way it is accessed. For x86, for example, the same address could either refer to a location in memory or to a location in an I/O device, depending on whether in/out instructions or "normal" instructions were used.


> That would make usize a 32-bit integer, regardless of the data width.

Doh, long day at work, misinterpreted "data width".


Yeah I didn't understand that part either. The quote doesn't mention data width at all...


I thought this blog post would discuss type-level integers but I didn’t see anything about that: https://users.rust-lang.org/t/notes-on-type-level-integral-n...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: