
How to Think About Variables in C - denniskubes
http://denniskubes.com/2013/04/23/how-to-think-about-variables-in-c/
======
voidlogic
Extremely uninteresting- It is like a page of "C-S 1XX: Intro to C" fell out
of its bindings and landed on Hacker News.

This might have been mildly interesting if there had been the assembly for a
few different architectures (x86, MIPS, ARM, PowerPC, etc) showing how the C
code was translated to assembler for each. And could have been very
interesting with an additional discussion of memory barriers and atomic
operations in C and their relation to assignments and pointers.

~~~
minimax
HN has a pretty broad audience and a pretty big chunk of it doesn't know
$language. These types of beginner posts for $language pop up from time to
time. It's nothing to worry about.

~~~
voidlogic
$language in this case is C, the lingua franca of computing.

It is almost always the first language ported to any system, almost every
computer science program at least covers the basics, it has been in 1st/2nd
place on the TIOBE index for over a decade, its the 5th most popular language
on github by commits and it is over 40 years old.

But- I'm willing to accept there might be people on Hacker news that don't
know C, thats why I gave suggestions to the author to expand on the content
and make it interesting to a wider audience. That was the point of my post.

------
haberman
There are some subtle problems with the model as explained in this article. If
you use this as your mental model, you will probably run afoul of undefined
behavior without realizing it.

If you read the C standard, you'll notice it doesn't talk much about "memory"
(the word only appears 13 times in C99); it mostly talks about "objects"
(mentioned 735 times in C99). These objects aren't OO-objects -- obviously C
doesn't have OOP built in -- but rather all the basic types like int, float,
struct, etc are objects. When you declare a variable like "int x", you are
creating an object.

C's aliasing rules dictate that you can only access an object via a pointer of
that object's actual type. This is why it is dangerous to think of the
assignment operator as a simple memory-copying operation. If assignment were a
simple memcpy, you could do something like this:

    
    
      int x = 5;
      // BAD: undefined behavior, violates aliasing.
      short y = *(short*)&x;
    

If a variable were just a memory address and assignment were just a memory
copy, this would be a valid operation. But the right way to think of it is
that a variable is a _storage object_ whose address can be taken, and and a
dereference is an operation that reads a storage object.

A pointer isn't a generic memory-reading facility, it must actually point to a
valid storage object of the pointer's type (or to NULL).

If you do want to read and write arbitrary objects in memory, you can always
use memcpy():

    
    
      int x = 5;
      short y;
      // This is fine, and smart C compilers optimize away the
      // function call.
      memcpy(&y, &x, sizeof(y));

~~~
sillysaurus
_If a variable were just a memory address and assignment were just a memory
copy, this would be a valid operation._

It's a valid operation regardless of whether a standards body says it's not.

    
    
      uint32 x = 5;
      uint16 y = *(uint16*)&x;
    

The effect is to set y to the first two bytes of memory from x. Values
assigned to x are serialized into memory in either big endian or little endian
order. Those are the only two cases you have to account for. Quake 3 engine
has a macro for the above operation which produces the same value of y on all
platforms. This is useful for serializing x to disk, then loading it later
(and possibly on a different architecture).

One source of confusion is that int and short are essentially, for all intents
and purposes, undefined -- they are of course defined by the standards, but
their implementation is allowed to vary so much that no programmer can make
any assumptions about their size (in bytes) at runtime.

int8, int16, int32, int64 are all explicit and force the compiler (and the
hardware) to obey the wishes of the programmer. This is, I think, the right
approach. People make much ado about the fact that "a byte isn't necessarily 8
bits" and "the only assumption you can make about a short is that it's smaller
than an int, and larger than a char", etc, which is probably unnecessary
mental effort.

"Bytes are 8 bits. Here are four bytes. Here's the value that the four bytes
store. Copy two of the four bytes to this other spot (adjusting for endianness
appropriately via a macro)."

You typically don't want a memcpy in situations like this due to endianness.

The reason it's useful to explicitly "break the rules" like this is because
it's important to know what assumptions you in fact can rely on, regardless of
what standards bodies have to say about it. Because at that point you can do
incredible things such as <http://www.codercorner.com/RadixSortRevisited.htm>

    
    
       inline float fabs(float x){
            return (float&) ((unsigned int&)x)&0x7fffffff ;
       }
    

The reason this is incredible and awesome (rather than horrible and dangerous)
is because it enabled game developers to achieve a more impressive product for
end users, because they were able to do more with the CPU resources that were
available at the time.

It's of course not so relevant nowadays, since it's reasonable to assume that
most gamers have at least a core 2 duo. But it's one of those things that
isn't relevant until suddenly it is -- you're in some situation that requires
sorting millions of floats, and your dataset simply demands more performance
than your compiler typically gives you. Then suddenly you find you can do
amazing things like this, and surprise people with how effectively you can use
a modern CPU.

(Although, the modern antidote to "I need to sort millions of floats quickly"
is to use SSE, not to sort floats as integers. Yet that's even more evidence
that it's better to understand the capabilities of the hardware.)

~~~
brigade
_The reason it's useful to explicitly "break the rules" like this is because
it's important to know what assumptions you can in fact rely on, regardless of
what standards bodies have to say about it._

Given that compilers _do_ break when programmers violate aliasing rules, you
should recheck what assumptions you think you can rely on. Non-strict aliasing
is not one of them. Unless you want to slow everything down with compiler-
specific flags like -fno-strict-aliasing.

    
    
        uint8_t foo[4]; *(uint32_t*)foo = 0;
    

Besides even without strict aliasing, the above is not at all guaranteed to
work since not all architectures support unaligned loads. (and if you think
"well but no one uses them, just like no one uses 1's complement architectures
anymore", keep in mind that this includes ARM)

(also use stdint types already)

~~~
sillysaurus

      uint8_t foo[4]; *(uint32_t*)foo = 0;
    

_Besides even without strict aliasing, the above is not at all guaranteed to
work since not all architectures support unaligned loads._

So, the interesting thing about this example is that it does work. It's in
fact very, very difficult to find a platform where that example won't work
(i.e. crashes the program). For example, any C library involving image
manipulation is likely going to have code similar to what you've described,
and those libraries work on almost every platform.

Standards are a good and useful thing. All I'm saying is that it's important
to know which rules you can safely violate.

~~~
__david__
> It's in fact very, very difficult to find a platform where that example
> won't work

No, it isn't. Many ARM processors will bus error on that code if (foo & 3) !=
0. I believe PowerPC doesn't do unaligned word reads either...

It quite often has to do with the memory controller and not with the
particular processor, though I believe x86 has to support unaligned reads.
I've certainly worked first hand with ARMs that did not support it.

~~~
sillysaurus
That's interesting. What causes the bus error?

Would

    
    
      uint8_t foo[4];  *(uint32_t*)(&foo[0]) = 0;
    

also result in a bus error? Why?

~~~
__david__
That's the same thing, so yes, if foo is unaligned then it will cause a bus
error. It causes it because the code is generate a store word assembly
instruction (as opposed to store byte) and if the address is not aligned to 4
bytes then the memory controller hardware will raise a bus error.

Notice I keep saying "if the address is unaligned". The insidious part is that
it _probably_ will work for a while since it's likely that your "foo" array
_will_ happen to be aligned. But add one uint8_t variable to your structure or
stack frame or wherever "foo" is defined and things could shift and suddenly
it starts causing bus errors. It can be a very annoying type of heisenbug.

And bus errors are actually a _good_ thing. I believe I've used hardware (an
ARM or an SH2, can't remember) where the memory controller just ignored the
last 2 bits during whole word reads and writes (which works fine as long as
you only read aligned words). So if run your code on that hardware it doesn't
give you an error, it just subtly "corrupts" your data. Yay!

------
_kst_
"A data type is a number of bytes to the compiler."

The size of a type is just one of its many attributes. Even if, for example,
"long", "float", and "void* " happen to have the same size, they're still very
distinct types.

"Integer data types are defined in the limits.h file. Float data types are
defined via macros in the floats.h file."

Integer and floating-point types are defined by the compiler, guided by the
hardware and the ABI for the platform. <limits.h> and <float.h> _document_ the
characteristics of the predefined numeric types.

"A pointer doesn’t hold a memory address, it holds a number that represents a
memory address."

Sure, and a floating-point object is ultimately just a collection of bits --
but that's hardly the best way to think about either of them. Integers and
pointers (addresses) are logically very distinct things, even if they happen
to have similar representations. For example, the addresses of two distinct
variables have no defined relationship to each other (other than being
unequal); just evaluating (&x < &y) has undefined behavior.

C lets you get away with a lot of type-unsafe stuff, particularly if you
resort to pointer casts, but it's fundamentally much more strongly typed than
the author seems to think it is.

~~~
revelation
See also: strict aliasing

------
dllthomas
1 int x = 10;

2 &x = 20; // this doesn't work

3 * (&x) = 20; // this does work

Why does line 2 &x not work but line 3 does? Because &x returns a pointer, a
number representing a memory address. This is an important distinction. A
pointer doesn’t hold a memory address, it holds a number that represents a
memory address.

=======

No, that is _not_ why. Note that the following _does_ work:

int * x = 0;

and the following works, though typically yields a warning:

int * x = 20;

Line 2 fails because & doesn't give back an l-value.

------
asveikau
> Every variable is a starting memory address to the compiler.

Definitely not true. More like, "it will have an address, if you take the
address with the & operator". Otherwise, the compiler is quite free to store
locals in registers.

~~~
denniskubes
> Yes I am being simplistic and yes certain data types have certain syntactic
> sugar but I have found this to be a good mental model

As stated in the post.

~~~
mturmon
I think you're going to keep getting comments on these ill-considered asides,
but here is another problem:

"In most assembly languages, data types don’t exist. You operate on bytes and
offsets."

This is just not true.

Most assembly languages (I learned on PDP-11 assembler, which I remember best,
but what I say is true of 68000 and x86 too) have a notion of a byte, but also
integers of various word lengths, and floating point numbers.

In fact, some registers are in effect designated as "pointers" for various
kinds of conventional indirect addressing (the instruction pointer, the
register holding the stack pointer, and others).

In this sense, C is even closer to assembly than you indicate, because the
data types are so analogous.

------
snorkel
Integers are the simple case, but you really haven't grasped the C memory
model until you're comfortable handling text strings at any length, calling
functions by pointers, working with structure pointers, and knowing when you
need a pointer to a pointer. Part of it is understanding variable scope, local
vs global vs stack frame memory. It's not rocket science, just takes practice,
and the courage to segfault your way through it.

------
denniskubes
What other mental models do people use to think about variables and memory? I
would like to hear about them.

~~~
bcoates
My mental model for C is symbol-referent diagrams like the first picture on
<http://www.exforsys.com/tutorials/c-language/c-pointers.html>

If you keep track of which boxes are and are not runtime memory cells, that
should be enough to work out any particular C pointer problem except the
pointer-array almost-equivalence mess.

~~~
denniskubes
That is nice. I have seem different pointer diagrams but none that linked it
to a memory list as that does. I like.

------
16s
It sounds simple, but you'd be surprised how many programmers don't grok the
fact that types/data have sizes (especially numeric types). For many tasks,
this doesn't matter, but when it does matter, you need people who understand.

As an example, an IPv4 address is 32 bits. Don't convert it to a string and
put it in a varchar(64) in your database when you are optimizing for space (I
actually saw this once). And yes, the DB had an inet type, but no one knew how
to use it, what it was or why it mattered.

------
__david__
My favorite bit of pointer code is one I had to write in the bootstrap code of
an embedded processor:

    
    
        int r = ((int (*)())startAddress)(); // Wheeee!

------
derleth
> C is memory with syntactic sugar and as such it is helpful to think of
> things in C as starting from memory.

<http://en.wikipedia.org/wiki/Lie-to-children>

> A lie-to-children, sometimes referred to as a Wittgenstein's ladder (see
> below), is an expression that describes the simplification of technical or
> difficult-to-understand material for consumption by children. The word
> "children" should not be taken literally, but as encompassing anyone in the
> process of learning about a given topic, regardless of age. [snip] Because
> life and its aspects can be extremely difficult to understand without
> experience, to present a full level of complexity to a student or child all
> at once can be overwhelming. Hence elementary explanations tend to be
> simple, concise, or simply "wrong" — but in a way that attempts to make the
> lesson more understandable.

OK, the very first sentence of this piece falls flat on its face when you
begin to think about how a computer actually handles getting data into and out
of the parts of the CPU that actually do the work of modifying data according
to the opcodes in flight.

In specific, C is meant to be a pleasant syntax to sling data around a large,
flat address space, where the assumption is that every part of the address
space can be treated like any other, with no special consideration given to
some locations being faster than others. (The 'register' keyword mucked with
this a bit, but approximately nobody uses it anymore in new code. Just as
well, because good compilers ignore it anyway; more below.)

This is horribly, hilariously wrong when you learn about cache hierarchy, and
becomes even more wrong when you throw an OS implementing virtual memory and a
disk cache into the picture. C doesn't have any way to refer to cache; you
can't tell the compiler 'store this in cache' because that would break the
abstraction C enforces.

So we loop back around: C enforces the abstraction for a good reason; namely,
compilers are better than humans at scheduling memory use in practically every
case, and in the few cases they aren't, you're doing something hardware-
specific enough you'll need to drop into assembly anyway. This is also the
reason the 'register' keyword is a no-op and has been for decades. Compilers
can schedule registers better than humans because compilers know more about
all of the optimizations in play, and when they can't, you'll have to drop
into assembly anyway.

 _TL;DR_ : This is a basic introductory post. Nitpicking it for things that
compilers take care of for you anyway is pointless.

~~~
denniskubes
Thank you.

