Well this if fun folks, but frankly, you all have proven just how conservative programmers are. My chapter advocating something as "revolutionary" as including the lengths of strings when you process them has received more hate mail than anything I've written. And I'm the guy who gets death threats because I don't like Ruby.
If you are reading this and saying, "This guy's wrong about including the lengths of strings!" Then I ask: Why are you using any modern language? Nearly every language in use today include the lengths of strings and backing buffers in their string implementation, and you all use it and appreciate it, but when I advocate it suddenly it is heresy and deserving of vitriol.
I personally don't care what you all think, which I know insults your pride at being the smartest people in the room, but until you are willing to admit that your hatred of that chapter is based entirely on nostalgia and not on the merits of my fairly simple claim that K&R (or any) C code is error prone, then you are not going to advance the state of the art any further than the 1970s.
This is what makes me sad about computing today. You are all desperately clinging to the notion that you are radical free thinking futurists while you desperately cling to reverence for the past and are incredibly resistant to any change. If something as simple as a thought experiment to look objectively at how things are done and try something better makes you angry, then you are not free thinking futurists.
Anyway, enjoy your day folks. I'm going to go do some painting.
That'd be up to my publisher (A/W) but I do believe they roll out French versions of my book depending on market demand. How that demand happens I have no idea.
`safercopy` also does not always terminate, say if one of the passed pointers point to invalid memory, which then segfaults when accessed. Similarly, you can have problems if the destination buffer overlaps the saved return address of the stack of another thread.
C's type-system is unsound, and even if it was sound, it is too weak to express interesting invariants, certainly it is far too weak to express invariants required for termination. In fact, no popular language can express the invariant "this doubly-linked list doesn't contain a cycle", which is essentially required for termination when working with these.
Even ignoring termination, in essentially all languages, including C, there are properties like "this object isn't going to be accessed by another thread" which can't be expressed in the type system, or even checked at runtime, but are required for any interesting kind of correctness.
I don't think you know what it means when someone asks whether a loop will terminate. It's an analysis of the logic of the loop to determine if it will exit. This kind of analysis has been the actual foundation of computer science since there was computer science. I believe this guy Turing was doing some stuff with it.
However, you have a logical flaw in your statement. You say the function will not terminate if it is passed an invalid pointer, but processes terminate when they access invalid pointers. If that's true, then it will terminate. However, if processes do not terminate when passed invalid pointers, then this function will still terminate because it will hit the end of its length variable and terminate.
I take it that you're completely unfamiliar with the concept of "undefined behavior" in the context of C. You're not qualified to discuss the language at this level until you understand that. It's possible to invoke undefined behavior in your function. Undefined behavior can mean something never terminates. Thus, it's possible for your function to never terminate.
If you don't believe me, I pose the following challenge: give me an implementation of this function. I will then provide a chunk of code that calls it and causes it to enter an infinite loop.
1. You said you can make that for-loop run forever that can call it and it'll enter an infinite loop.
2. To prove that, you can only alter the main function, then hand me back the code and I'll compile it and run it on my machines.
3. It has to run without stopping for 24 hours. If you can do that then I'll consider that an "infinite loop".
4. You can't call any more functions than what's in there already. So no fancy hacks to keep the OS from allowing segfaults by putting in signal handlers, linking against other libraries, or anything.
Very curious how you do this. This is fun!
Edit: And, I may not be checking comments so email me to gloat if you figure it out. help@learncodethehardway.org any time. You can also post it here. Just link me the reply so I can go look.
This is running on a Mac with 10.10.1 and Xcode 6.1.1, compiled without optimizations. The offset value may need to be different on other architectures. With optimizations on, the approach may need to change. Don't give me any guff about the conditions needed, since that's the whole point of undefined behavior: it depends on context that should be irrelevant.
There's no need to run it for 24 hours. Just run it, then pause in the debugger and step through a few loops. It'll be evident that nothing changes.
If you need help getting it to work properly on your own setup, let me know.
I officially concede that because you can work two pointers on a computer to overwrite another location of memory to alter a for-loop (incidentally, there's not UB listed in ANSI for 'alter the variable of a for-loop') that everyone should go back to writing their C code just as K&R intended.
Please, you all should rely on only the '\0' byte terminator of all strings, don't do any bounds checking, don't check the return code of functions, and you will be totally safe.
Your insistence that undefined behavior is not at play here is bizarre. It is not possible to construct a pointer into another stack frame like my code does without invoking undefined behavior. To wit:
"When an expression that has integer type is added to or subtracted from a pointer, the result has the type of the pointer operand. If the pointer operand points to an element of an array object, and the array is large enough, the result points to an element offset from the original element such that the difference of the subscripts of the resulting and original array elements equals the integer expression. In other words, if the expression P points to the i-th element of an array object, the expressions (P)+N (equivalently, N+(P)) and (P)-N (where N has the value n) point to, respectively, the i+n-th and i−n-th elements of the array object, provided they exist. Moreover, if the expression P points to the last element of an array object, the expression (P)+1 points one past the last element of the array object, and if the expression Q points one past the last element of an array object, the expression (Q)-1 points to the last element of the array object. If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined."
This is a long-winded standards-language way of saying that if you compute x+y, where x is an array and y is not a valid index in the array or one past the end of the array, the behavior is undefined. The moment I computed 'output', I hit UB. Everything that happens after that is up to the whims of the platform.
C itself is also scary. Most other languages provide at least run-time safety; some provide great compile-time safety. C providing neither and being the most popular language for system software is what is really scary.
I guess part of what is scary about C is that it gives you the illusion of a high-level language, but unless you know all UB by heart, you might accidentally start working in assembly.
Isn't there at least a flag that activates warnings for stuff like this? I tried -Wall in both clang and gcc and they didn't say jack shit.
What do modern C developers do these days? Arm themselves with expensive advanced static analysis tools to their teeth?
I agree, C itself is kind of scary. And what's worse is that you've understated it a bit. You don't "start working in assembly," because some of the scariness is that your operations don't map nicely to assembly, as the compiler does its thing. For example, taking a pointer to something on the stack and adding an offset to get a pointer to something else on the stack would be reasonable to do in assembly, and fine if done correctly, but if you try to do the same thing in C it's a crapshoot as to whether the compiler will do what you expect, or whether it will decide to eliminate the whole chunk of your code because it can't possibly run, or something else.
Static analysis helps a lot, as does being careful about what you write. Most constructs aren't dangerous, so you can mostly avoid the scary ones, and take extra care when you need to use them. Not that this saves you all the time, but it helps.
With clang 3.5 on x86-64 Linux I also get an infinite loop.
Compiled with GCC, it doesn't go into an infinite loop. Zed seems to think this makes your example a "fail". This is curious since the whole quibble is about functions that work in some context but fail when brought into a different context.
Edit: FWIW, safercopy also has undefined behavior when from_length and to_length are larger than MAX_INT.
It's only curious if you think Zed is honestly trying to figure out the truth, rather that "prove" he's right. That assumption is clearly shown to be wrong by this comment thread.
Fail. I've ran it repeatedly over and over and it doesn't run forever. It segfaults or exits. But....you did cover a corner case I hadn't considered. Thanks!
My point is that the offset value needed to produce the described behavior depends on various implementation-specific things, so that constant may need to be altered when trying the code on other compilers, OSes, or CPU architectures.
Ahhh the "undefined behavior" trope, whereby a C "expert" who's memorized a standard trots out the abstract machine to justify his point. An abstract machine that doesn't actually exist and that no computer actually functions as.
I'll take you up on that challenge. I'm curious to see how you'd write a program to make a for-loop never exit. Give me a few minutes...
'there are properties like "this object isn't going to be accessed by another thread" which can't be expressed in the type system,'
In a recent project, I effectively enforced exactly that - in C. It was in a limited context but it saved my ass many times and eased refactoring substantially.
Functional languages, along with Rust's linear type-system, can ensure this for single-ownership structures (trees, singly-linked lists), but not for structures with more complex ownership, like doubly-linked lists. The unpopular languages I was talking about were Agda and friends.
To be clear, though, Rust's type system can ensure thread safety of doubly-linked lists if you use the doubly-linked list structure in the standard library (or write one yourself using RC/etc.)
I hadn't seen the previous discussions, but the last one was two years ago, so maybe it's worth having this here so people can add any new thoughts or comments.
Many people have looked at this copy function and
thought that it is not defective. They claim that, as
long as it's used correctly, it is correct. One person
even went so far as to say, "It's not defective, it's
just unsafe." Odd, since I'm sure this person wouldn't
get into a car if the manufacturer said, "Our car is
not defective, it's just unsafe."
Terrible analogy.
A rapidly spinning piece of steel is unsafe, flammable liquids are unsafe, sparking hot metal is unsafe, cars contain all of them.
A copy function is obviously a component, and not a whole car.
For anyone who has picked up a copy of K&R wanting to learn C and hit a wall I highly recommend C Programming: A Modern Approach, 2nd Edition by K N King.
Having read many books on C over the years I think King's book is without doubt the best intro book to C. K&R is great but it is a reference more than a tutorial in my opinion and is better to read once you have the basics of C (and programming) under your belt.
I think everyone is failing to see the wood for the trees. The great thing about "The C Programming Language" is that it is short and to the point. It's a great way to learn about about many aspects of programming. I don't think anyone recommending it would ever expect it to be perfect; I recommend it to people in the hope that they will think about the business of coding and I suspect that Kernighan and Ritchie hoped the same. I agree that it is a shame that the code examples have errors, perhaps those who have the time and expertise should contact the publishers to point them out so that the 43rd (or whatever it is now) printing stands a chance of being correct.
It's a great pity that this discussion has descended into an exchange of insults.
Nit: I would suggest not using assert() for checking error return values. Compiling the program with -DNDEBUG disables the check. assert() is for finding bugs in your program. A malloc() call returning NULL is not a bug.
But, when talking about this I didn't want to muddy the waters with my own assert alternatives. Programmers have a hard enough time focusing on the issue of a for-loop vs. a while loop.
Edit: malloc returning NULL means you're out of heap. That's usually catastrophic in almost all cases, but the important part is that people don't detect that, then use the NULL pointer. That's the bug.
In a chapter about deconstructing someone else's book, your own book also does dangerous things according to yourself:
> The problem is, as with every book with code ever in the universe, beginners will copy that code out and use it somewhere else and then the function is wrong.
Yet, if a beginner copied some of the code in this chapter they'd have the exact bug you are talking about here (using the NULL pointer returned from malloc()).
I think you should probably expand that into the safer checks (don't forget to free(line) if longest is NULL too!).
Are we looking at the same code? I use an assert to quickly check for NULL, and this is a simple example of how to work with the function. Other parts of the book use more extensive error checking and I use my debug.h macros quite a lot, but if you find bugs feel free to email me them. I'd really appreciate it.
I'm referring to this code on the originally linked page [1] (you'll have to scroll back a bit because your header blocks the content).
In the context of this thread, brghts states that this is dangerous because if you compile with -DNDEBUG the assert is optimized away.
So if I copy that code with the assert statement, it will be optimized away and your code no longer performs the NULL check. This is bad.
As you mention, beginners tend to copy code off the Internet and cause bugs. If you recognize this and claim to be teaching people you should not use bad practices in your example code. Period.
If you don't want to muddy the waters with your custom debug macros, then you should still play it safe when checking return values the a beginner may simply copy and think is correct.
> That means the for-loop will never loop forever, and as long as it handles all the possible differing lengths of A and B, never overflow either side. The only way to break safercopy() is to lie about the lengths of the strings, but even then it will still always terminate.
So what? The function is still crazy dangerous. Buffer overflows are still possible, and therefore buffer overflow attacks are still possible. It's still possible (and very easy) to trigger undefined behavior, and it's meaningless to make claims about whether a function terminates if it triggers UB, and there are as a result still security concerns.
This isn't a slam on Zed Shaw's book, as much as it is a slam on C. You can adopt whatever heuristics you like, but at the end of the day you're always working in C, where it is possible for your program to self - destruct (or worse!) unless your function inputs are exactly right.
> The worst possible scenario for the safercopy() function is that you are given an erroneous length for one of the strings and that string does not have a '\0' properly, so the function buffer overflows.
Your argument is that the safercopy() function is "safer" in that it is guaranteed to terminate regardless of whether the underlying buffer has a NUL byte in it. While that's true, it's sort of missing the point a bit, I think. The unsafe copy() function wasn't primarily unsafe because there was no guarantee it would terminate -- it was unsafe because it corrupts a bunch of memory, exposing you to a wide range of insecurities (the least of which is that your program might crash). safercopy() is still prone to that behavior if the lengths you pass in aren't accurate. While it is guaranteed to only corrupt n bytes of memory rather than an arbitrarily large number of bytes, the damage might as well already be done by the time you corrupt those n bytes. So to answer your question:
> How do you think an alternative copy function that uses lengths would have buffer overflows?
It's unsafe in exactly the same way that the unsafe copy() function is: if the arguments you pass into the function are incorrect/don't point to the data you think they point to, you'll corrupt memory. Now, you could make the argument that it's much easier to just remember the lengths of all the buffers you allocate than it is to remember to NUL-terminate all your C strings -- I would agree -- but I don't know if the article does a great job of explaining that.
> It's unsafe in exactly the same way that the unsafe copy() function is
No, that's the logic error every programmer makes. The copy() function is always wrong, because it can't confirm that the string has the right length without looking at the string which causes the error.
With my function I can go to as great a length as I want to confirm that the string is actually as long as I say it is. I can't mitigate every possible error of misuse, but the errors safercopy() can have are much smaller than copy().
Your argument is effectively stating that because you can exploit one with a general "UB" error, that it's the same size and classification of errors as with the other. That's invalid, and proven in my writing.
If the lengths are wrong. You said that the function shouldn't assume that strings are null-terminated correctly -- why should it assume that the lengths are correct?
BNF can specify the syntax, but not the semantics ie the meaning. There are several methods of specifying the semantics[1], it's known to be notoriously difficult to completely specify the semantics of a language, with some languages more suited to a formal definition than others. I believe Haskell is the most commonly used language today with the best formal definition.
The code in K&R looks correct to me as written. The copy function is only called on char arrays obtained from getline which properly null terminates the array. As pointed out, C style "strings" do not work if the null terminator is not present. If underlying assumptions are not valid in any context, nothing can be assumed to work. The text about malloc vs stack is both wrong and not relevant.
Thank you for proving my point. It is right as long as the tower of babel it's used in never collapses. The problem is, as with every book with code ever in the universe, beginners will copy that code out and use it somewhere else and then the function is wrong.
And, you literally just repeated every single argument I demolish in the chapter, which means you didn't read it.
His "Stylistic Issues" section also leaves a lot to be desired.
The second if statement is clearly not part of the loop structure, it might be possible to throw somebody off (if they don't know how this stuff works) if you were to indent the second if statement, but we don't, and your linter(you're using one, right?) will also complain.
What a terrible article. I feel dumber having read it. It's one of those articles that sounds good and takes more effort to correct than it took to write, and thus causes a net loss.
Just one example: "That gives every path to this function a 50% to 75% chance it will fail with just the inputs above." I mean, what kind of nonsense is this? This is pretty much saying that the odds of winning the lottery are 50%, because either you win or you don't.
Most of the article is spent on discussing the perils of a strcpy reimplementation. Yet there's no discussion of a fix. The author simply assumes the existence of a magical "safercopy" function that behaves correctly on all inputs. Of course the author doesn't provide the implementation of this function, because if he tried to write it, he'd discover that it's impossible.
You actually were an asshole first my friend. You came in with a comment that was rude, not insightful, and seriously uninformed and then announce how you're going to flag it as awful. Everything about what you wrote makes you an asshole, and you started it. There were about a trillion different ways you could have written your comment in a polite yet still direct way to further the discourse along but you chose to be obnoxious.
Don't be butthurt that I'm just better at being an asshole than you are.
People are downvoting you, but as the author of this I can say you're right. Not necessarily lisp, but using any language that doesn't have the fatal flaw of totally broken strings is better.
C is still a large historical influence on many languages, and knowing it makes people more capable as programmers. That's why I teach it, and I use it because I'm just old like that. Once someone knows C they can fix all kinds of problems, and then can learn a huge number of languages fairly easily.
But these days, without a solid reason for using it, I'd avoid C and use a better alternative.
If you are reading this and saying, "This guy's wrong about including the lengths of strings!" Then I ask: Why are you using any modern language? Nearly every language in use today include the lengths of strings and backing buffers in their string implementation, and you all use it and appreciate it, but when I advocate it suddenly it is heresy and deserving of vitriol.
I personally don't care what you all think, which I know insults your pride at being the smartest people in the room, but until you are willing to admit that your hatred of that chapter is based entirely on nostalgia and not on the merits of my fairly simple claim that K&R (or any) C code is error prone, then you are not going to advance the state of the art any further than the 1970s.
This is what makes me sad about computing today. You are all desperately clinging to the notion that you are radical free thinking futurists while you desperately cling to reverence for the past and are incredibly resistant to any change. If something as simple as a thought experiment to look objectively at how things are done and try something better makes you angry, then you are not free thinking futurists.
Anyway, enjoy your day folks. I'm going to go do some painting.