Hacker News new | past | comments | ask | show | jobs | submit login
Walter Bright on C's Biggest Mistake (dobbscodetalk.com)
64 points by kssreeram on Dec 24, 2009 | hide | past | favorite | 47 comments



Lack of array bounds checking is not a problem with C.

It is a problem with our hardware.

Introducing bounds checking without introducing a penalty on array access time is impossible on our "C machines".

C/C++ are often thought of as "close to the metal" - but they are close to particular varieties of metal - those designed to run C/C++. We arrived at them through historical accident. There are many other ways to build a computer - and it is not entirely obvious that a "C architecture" is necessarily the simplest or most efficient:

http://www.loper-os.org/?p=46

That a language which is "close to the metal" is braindead is solely a consequence of braindead metal.

The "C architecture" is a universal standard, to the extent that it has become the definition of a computer to nearly everyone. This is why you will never find the phrase "C architecture" in a computer architecture textbook. And yet it is a set of specific design choices and obsolete compromises, to which there are alternatives.


See also the C FAQ, which patiently devotes 24 questions to the topic. (You can almost tell just how frequently the question came up on the list.)

http://c-faq.com/aryptr/index.html


Is there another place that buffer overflows occur than in the char* with no bounds checking? If not, this single fact is the one that leads to so many of the software vulnerabilities in the wild.


Yes, memcpy overflows are just as common as strcpy overflows, and structure overflows are more common today than strings, if only because most of the trivial string stuff has been flushed out by now.


There are others, but that case covers most of the ones seen in practice.


Thank god for this mistake, this mistake makes c what it is good at: at low level programming. It just pass directions between functions. Light and fast,no abstractions.

I love it, a way of making assembler like coding but multiplatform.

If I want high level programming I will program in another language but when you want machine control you have c without all the bloat.


There would be little lost from having to specify &arr[0], rather than having array typed arr degrade into a pointer directly, but a huge amount to be gained - some very much needed help with tracking array sizes.


C combines the power and performance of assembly language with the flexibility and ease-of-use of assembly language.

However, I was amazed to find that modern assembly language (since I was last in the game 25 years ago) has many high-level concepts in it (structures, loops, conditions etc), and looks... suspiciously... C-like.

But you're quite right about portability. Although C is famously not perfectly portable (int sizes, all those #defines - just some of the issues Java tackled), it is a hell of a lot more portable than an actual assembling language. :-)


> (structures, loops, conditions etc), and looks... suspiciously... C-like.

Which assembly? All the ones I know of - especially the modern ones - have become even lower level over time, as compilers started liking more regularity over more powerful instructions.


From what I gather _machine code_ has become more low level, while _assemblers_ have adopted features from "high level" languages.

Not that I would know, the last time I wrote assembler it was for the Z80, and the last time I wrote machine code it was for the 6502.


The assembly language used by Sharc DSPs has an algebraic syntax rather than using all mnemonics, and loops are pretty easy to create as well (at least compared to x86 or, as my current project uses, PIC).


Can you give an example of a loop in an algebraic syntax please?

(All I can think of is CFG style, like "A -> (aA)?" oh, of course, there's also "a*" - still, it would be interesting to see an eg.)


I don't recall. Someone pointed out an example in a thread a while back.


I'd say using null-terminated strings rather than pascal style length embedded strings is C's biggest mistake. Responsible for so many inefficiencies (strlen is O(n) instead of O(1) as it should be) and, worse yet, so many incredibly serious security vulnerabilities.

All to avoid having to incur a 1-3 byte per string overhead or figuring out how to efficiently work around a 255 character limit.


Other than strlen, string operations can be faster with null-terminated strings. Plus, there are other benefits:

If you want to turn some arbitrary data into a string, just put a 0 byte where you want it to end.

Instead of having to increment and compare both a counter and a pointer (or store a final pointer for comparison) in string-manipulation operations, you just increment the pointer. It makes for very concise loops in strchr, etc.

Tokenizing a string is just a matter of throwing down 0 bytes where the tokens are (as strtok does).

Passing a tailing subset of a string to a function is as easy as adding an integer to the string pointer, rather than requiring a memcpy and length calculation.


Null-terminating arbitrary data to turn it into a string is great--if your arbitrary data doesn't have any 0 bytes in it naturally.

Strtok is just as easily handled by handling strings as a 2-item struct: a size_t for length and a pointer (separating the size data from the character array). In fact, that would kick ass: you can non-destructively tokenize, pass around substring arguments, etc.


Storing size separately from the character data, rather than as a prefix to the character data, would indeed alleviate some of the potential problems of a size+data string. I could see myself using such an implementation in situations where the minimal overhead of allocating a separate area on the stack or heap for the size+pointer struct is negligible. Expanding strings would certainly be a lot easier, and it would open the posibility (as you mentioned) for complex data sharing among strings, similarly to Qt4.


On modern processors simple arithmetic like incrementing is effectively free. Control dependence is expensive. Data dependence is somewhere between. Fat pointers do not add any dependence. Storing the end has little cost given shadow registers.

If this had been adopted 20 years ago the likely outcome would be that cpus might have fractionally more integer resources. It wouldn't affect the wallclock speed of any code.

All of the operations you describe are done just as easily with fat pointers with the additional advantage of being applicable to truly arbitrary data, not just null clean strings.


It is really an instance of the same problem. Walther Brights "fat pointer" proposal would give you length embedded strings for free.


Length-embedded strings are little better than the 0-terminated ones. You cannot take a substring without copying.

The "fat pointer" approach has been used in D for nearly 10 years now, and has proven itself to be very effective.


You cannot take a substring without copying.

That's not a bug, it's a feature. Indeed, immutable strings should almost always be the default choice due to security concerns alone. C is optimized for the rare case, modern languages with immutable strings (like C#) are optimized for the common case, this is how it should be.


I think you missed the point. Even with immutable strings, some representations will allow you to take a substring without copying data and some will not. Length-prefixed strings do not.


People are permanently trying to "fix" C, but C has nothing to fix.

It is a limited language, both by the constraints at the time of its creation, but also by the problem space where it has been used over the years. And that's how it should be.

C is part of an ecosystem of languages, it doesn't have to be changed to acommodate the latest fads or to fix problems that nevertheless never stopped it from being widely used for decades.

If C doesn't fit a purpose, don't use it. You don't even have to stray too far, since there are a few languages that basically are just C with extras.


We have been awash in buffer overflows and other, similar errors (printf strings come to mind) that are actually impossible in a safer language for years. SQL injection can happen in a safer language but you can't take over the web server by doing them. There is nothing fundamental about system languages that requires unsafe array operations. This is a flaw, and it is a flaw of C specifically and a flaw inherited by many C-descended languages. This is not some ivory tower thing that was discovered after C was designed; it was apparent even at the time (although Pascal's fix was pretty bad, variable length arrays fix it neatly). There are compiler articles from the late 70s and early 80s pointing out how even a naïve compiler could easily optimize out bounds checking in most operations!


If you design the software correctly then array bounds checking is often a waste of resources. For a stupid example let's assume you have 3 arrays of the same size and you are doing this.

  For (i = 1; i < 10000; i++)
  {
    a[i] = i * i;
    b[i] = a[i] * i;
    c[i] = b[i] * i; 
  }
Now that's not a lot of code but with array bounds checking you add 50,000 bounds checks that do nothing useful if the arrays are of the correct size. Clearly there are uses where those bounds checks are useful, but when you care about speed they can become fairly costly.

You might even want to rewrite it as because it really is faster:

  For (i = 1; i < 10000; i++)
  {
    c[i] = i * (b[i] = i * (a[i] = i)); 
  }
PS: Ugly c code often has a vary good reason for looking the way it does.


Those are precisely examples where even the most naive late 70s compiler can optimize out the bounds checking, as is well documented in the literature. (presuming the sizes are >= 10000, of course) All it takes is a validity range, noting that i goes from 1 to 9999 and that at each access i is within the range of the array. To trick out the compiler optimizations you need to at least start doing nontrivial mathematics on the bounds indices, which are also the ones that are the least obvious and thus in need of the bounds indices.


I would be impressed if that worked correctly on multithreaded code or if it could survive some of the more esoteric pointer manipulations you can do in C. It's mathematically impossible to make the perfect language for all problems, so while many languages have been built that are a "safe" version of C they lose something for everything they gain.

PS: Even compiler bugs can be useful under the correct circumstances.


People are trying to create a language that fits the niche that C is supposed to fill in today's world but just barely misses the mark (fast, low-level, but still sane and more or less modern). To a lot of people it looks like the easiest way to create a language that fills that niche is to fix the bugs in C rather than create something new from scratch.


When I was reading this, I thought "Nooo! Don't make the size of an array part of its type!" That has been rightfully shown as a very bad idea by Brian Kernighan, see http://www.lysator.liu.se/c/bwk-on-pascal.html Luckily the proposal is about passing a 'fat pointer', really a pointer and a length. I did that often in my C programs too: int process(char *buf, int buflen);

Maybe this fix to C's Biggest Mistake, a.k.a. the 'fat pointer', is just syntactic sugar.


I don't think Walter suggested having the array size part of the type (static array types).

He suggested using "fat pointers" -- pointers along with their extent. This is similar to how many Pascal compilers treat the type "String".

Kernighan mentions Pascal strings in his article but claims the solution does not scale to other types. Walter's solution does work for all array types (but admittedly has other problems).


Making the size part of the type is a useful feature in some cases. The compiler can benefit from this information to optimize the data structure and its manipulation.

The exponential grow of CPU processing power has led to discard such optimizations in benefit of code simplicity. But hand held devices with limited energy, computing and storage capacity may put it back into perspective.


I feel the lack of a module system is the biggest mistake in C. It is tiresome to prefix every single public function: list_append, list_delete, hashmap_insert etc.


I think the preprocessor is the biggest mistake. It was introduced to address issues of separate compilation in a simple way, but it generated more trouble than advantages.


I understand the issues with the pre-processor, but I still think C is better off with it than without it. I know it can be abused horrifically but it can also be abused in really nice and convenient ways. If you construct you macros right they can be useful or even wonderful. In that sense the pre-processor is very C-like.

The things that makes macros actually nice are some of the gcc extensions, like the ({ }) block expression syntax and typeof().


Macros can easily be abused, but the point of the preprocessor today is platform-dependent code generation. In higher level languages you can select with an if which branch gets executed, but at the end the platform-dependent code of that language runtime is written in C with ugly ifdef's.


Which issues has the preprocessor ?


I disagree. I've only recently started programming in C properly, having only programmed before in various dynamic scripting languages, Haskell and Java before. Macros are by far my favourite feature of C.


I don't think arrays are "converted" to pointers. Arrays are simply a cleaner way of doing pointer arithmetic and allocating large(r) blocks of the stack. Nothing is lost in this "conversion." The array never knows it's own dimensions beyond the time you declare it. It's up to the developer to keep track of that.


Here is an example that caused me a few bugs.

  // define a new type called md5_t
  typedef char md5_t[33];
  md5_t g_md5;
  // here sizeof(g_md5) == sizeof(md5_t)
  
  void f(md5_t md5)
  {
    // here sizeof(md5) == sizeof(char*)
  }
The type information definitely lost inside functions.


Simple rule: Don't use sizeof on pointers. And don't typedef arrays to make them look like simple types.


A better simple rule: always use sizeof on typename. That always works therefore you can use typedef arrays too.


Totally agree. If you start relying on the subtle behaviour of sizeof, it's just asking for a world of trouble. It's so easy to get mixed up with whether it's going to return the length of the array, the size of the elements in the array, the total number of bytes the array is using or the size of the pointer. Come to think of it, I think that's completely scope dependent and depends on whether the array has lots its array-ness. I have no idea. :)


This works, sure, but if you do:

        char test[1024][32];
        printf("%i\n", sizeof(test));
You'll get 1024 * 32 (32,768). It didn't know that it's a 2D array. When you pass a stack-allocated array, it passes by pointer. It works this way because C is portable assembler.


There is no logical connection between the use of the `sizeof` operator and differentiating between 1D and 2D arrays, so "it didn't know that it's a 2D array" makes no sense.

Secondly, it's not really meaningful to talk about "pass by pointer". Parameters are generally either passed by value, by reference, or by name. C only does pass by value; to emulate pass by reference, you need to pass the address by value, as a pointer - or as an "array". But the only array type you can declare is missing the crucial dimension length aspect.

Finally, saying that C works the way it does because it's portable assembler does not have strong explaining power. Arrays are not usually machine-level concepts; pointers with appropriate declaration syntaxes for static and automatic block allocations could take their place in C. Since C does have a higher-level construct called an array, it's not a big leap to imagine it being better designed.


sizeof() is used to determine the actual in-memory size of something for use by memcpy/malloc/etc., not the number of elements. There are commonly-used array size macros that can help with this, such as

  #define ARRAY_SIZE(x) sizeof(x)/sizeof((x)[0])


Arrays are converted to pointers on the first use. You can't ever use an array in C.


C's biggest mistake would have to be C++.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: