Writing correct code at scale is essentially impossible. We have multiple operating systems, runtimes, libraries, and hardware platforms to worry about without even considering things like random bit flips in RAM or our block devices lying to us with unknown probability.
The best we can do is write simple, understandable code with as few indirections and as little undocumented magic as possible.
I'm highly disagree that it's impossible to write high quality code in C language for a big project (I don't like his statement: The first rule of C is don't write C if you can avoid it).
I was mostly C developer (80% C, 20% C++) for 4 years in a big project (I was backend developer of ICQ Instant Messenger). It has more than 2M lines of C code. Almost all libraries was written from scratch in C language (since ICQ is very old project, many of essential stuff was written within AOL).
I felt comfortable to write high level code in C.
When I read Nginx source code (it's written in C), I see better code quality than 90% of proprietary commercial software which is written on nice and comfortable high level languages.
Good developer is able to write high quality code in C language with minimum bugs in large scale project.
Regardless of how nice and high level and convenient language, bad developer will write c....y code.
You can also check my personal project which is written entirely in C.
(I didn't contribute for 1.5 years to my github since I was changed countries of living, jobs etc.
I hope I will return to it when my life become stable again).
P.S. To be clear, I think that C++11 is great language. But unlike many C++ developers, I love pure C.
This is why we can't have these discussions...everything gets interpreted literally, and it becomes some sort of ego contest. I'll take your claim at face value, that your team wrote 2M lines of C for a commercial project, with no memory corruptions, crashes, integer overflows, etc. That's a shockingly impressive achievement, one that I think has probably never been done before.
But how is that a useful or replicable result, that will work in the real world for other projects? You're basically saying that everyone just needs to be a top 0.001% C programmer. This is the "abstinence-only sex education" of programming advice. It's super simple to not get STDs or have out-of-wedlock pregnancies - just don't have sex with anyone but your spouse, and ensure they do the same. It's 100% guaranteed to work, and if you don't follow that advice you're stupid and deserve whatever bad things happen to you.
> You're basically saying that everyone just needs to be a top 0.001% C programmer.
A more constructive/charitable phrasing of the parent's point might be: you should only be allowed to code in C if you're already a top 0.001% programmer. Then, obviously, all C code will be good—not that there'll be much of it.
That's actually a more interesting point than it seems. Programmers generally tend to reject the idea of guilds and "professional" licensing—mostly because, AFAIK, most programs aren't critical and most crashes don't matter. But what if, instead of licensing all programming, we just licensed only low-level programming? What if, as a rule, everyone who was going to be hired to write code in a language with pointers was union-mandated to be certified as knowing how to safely fiddle with pointers? That could probably split on interesting dimensions, like a bunch of regular coders working in e.g. a Rust codebase, with a certified Low Level Programmer needing to "sign off on" any commit that contained unsafe{} code, where the actual failure of the unsafe{} code would get the Low Level Programmer disbarred.
Not all C developers need be a top 0.001% C developer to write high performance and rock solid code. Why?
I'm a very experienced C programmer and one day my boss came to me and said that the sales guys had already sold a non-existing client side module to a house hold name appliance manufacturer. The deal was inked and it had to be ready in only 3 months. Even worse, it had to run in the Unix kernel of the appliance and therefore be rock solid so as to not take the whole appliance down. It also had to be ultra high performance because the appliance was ultra high performance and very expensive. Now the really bad news: I had a team made up of 3 more experienced C developers (including myself) and 3 very un-experienced C developers. We also estimated that in order to code all the functionality it would take at least 4 months. So we added on another 4 less experienced C developers (the office didn't have a lot of C developers). The project was completed in time, a success, and almost no bugs were found, and yet many developers without much C experience worked on the project. How?
(a) No dynamic memory allocation was used at run-time and therefore we never had to worry about memory leaks.
(b) Very, very few pointers were used. Instead mainly arrays. And not just C developers understand array syntax, e.g. myarray[i].member = 1 :-) Therefore we never had to worry about invalid pointers.
(c) Source changes could only be committed together with automated tests resulting in 100% code coverage and after peer review. This meant that most bugs were discovered immediately after being created but before being checked in to the source repository. We achieved 100% code coverage with an approx. 1:1 ratio of production C source code to test C source code.
(d) All code was written using pair programming.
(e) Automated performance tests were ran on each code commit to immediately spot any new code causing a performance problem.
(f) All code was written from scratch to C89 standards for embedding in the kernel. About a dozen interface functions were identified which allowed us to develop the code in isolation from the appliance, and not have to learn the appliance etc.
(g) There was a debug version of the code littered with assert()s and very verbose logging. Therefore, we never needed to use a traditional debugger. The verbose logging allowed us to debug the multi-core code. Regular developers were not allowed to use mutexes etc themselves in source code. Instead, generic higher level constructs were used to achieve multi-core. My impression is that debugging via sophisticated log files is faster than using a debugger.
(h) We automated the process of the Makefile so that developers could create new source files and/or libraries on-the-fly without having to understand make voodoo. C header files were also auto generated to increase developer productivity.
(i) Naming conventions for folders, files, and C source code were enforced programmatically and by reviews. In this way it was easier for developers to name things and comprehend the code of others.
In essence, we created a kind of 'dumbed down' version of C which was approaching being as easy to code in as a high level scripting language. Developers found themselves empowered to write a lot of code very quickly because they could rely on the automated testing to ensure that they hadn't inadvertently broken something, even other parts of the code that they had little idea about. This only worked well because there was a clear architecture and code skeleton. The rest was like 'painting by numbers' for the majority of developers who had little experience with C.
The same team went on to develop more C software using the same technique and with great success.
I'm pretty you would pass whatever sort of bar exam for a "Low Level Programming" certification they might deign to thrown at you. You're the 0.001%. (Although 0.001% might be the wrong number—I'd hazard 0.1%, or maybe even as high as 0.5%.)
Which is to say, your story matches my hypothetical pretty well. You effectively created the same structure as Rust has, where there are safe and unsafe sublanguages, and the "master" developers thoroughly audited any code implemented in the unsafe sublanguage.
Which makes my point: there exist a small set of people qualified to write in "full C", and a much-larger set of people who aren't; and the only way—if you're a person who isn't qualified—to write C that doesn't fall down, is with the guidance and leadership of a person who is.
Though I think maybe your thesis statement agrees with that, so maybe I'm not arguing with you. Not all the developers on a given project need to be from the qualified set, no. But at least some of the developers need to be from the qualified set, and the other developers need to consider the qualified ones' guidance—especially on what C features to use or avoid—to be law for the project. As long as they stay within those guidelines, they're really working within a DSL the "master" developers constructed, not in C. And nobody ever said you can't make an idiot-proof DSL on top of C; just that, if you need everything C offers, your the only good option is to remove the idiots. :)
We do agree that the experienced developers are necessary along side the less experienced developers. At least when the project is getting off the ground and the less experienced developers are still learning the ropes.
Thanks! And I have not seen any teams doing this kind of thing either.
Another tidbit: The developer pairs were responsible for writing both the production code and associated automated tests for the production code. We had no 'QA' / test developers. All code would be reviewed by a third developer prior to check-in. However, at one stage we tried developing all the test code in a high level scripting language with the idea that it would be faster to write the tests and need less lines of test source code. However, because we did several projects like this, we noticed that there was no advantage to writing tests in a scripting language. The ratio of production C source lines to test source lines was about the same whether the test source code was written in C or a scripting language. Further, there was an advantage to writing the tests in C because they ran much faster. We had some tens of thousands of tests and all of them could compile and run in under two minutes total, and that includes compiling three version of the sources and running the tests on each one; production, debug, and code coverage builds. Because the entire test cycle was so fast then developers could do 'merciless refactoring'.
As another software developer working in the "embedded systems" industry, I can tell you that using C like that is not unique. The companies I have worked for and several of my friends have worked for all use C similarly. Sometimes we do use the dangerous features of C. When we do, additional reviews are done to ensure the use is both sufficiently justified and properly done.
I also forgot to mention about performance. Whether you code * foo or the easier to comprehend and safer(?) foo[i] then the compiler still does an awesome job optimizing. However, it's much easier to assert() if variable i is in range rather than *foo. And it's also easier to read a verbose log variable i (usually a 'human-readable' integer) than to read a verbose log pointer (a 'non-human-readable' long hex number). At the time we wrote a lot of network daemons for the cloud and performance tested against other freely available code to ensure that we weren't just re-inventing the wheel. NGINX seemed to be the next fastest, but our 'dumbed down' version of C ran about twice as fast as NGINX according to various benchmarks at the time. Looking back, I think that's because we performance tested our code from the first lines of production code, and there was no chance for any type of even puppy fat to creep in. Think: 'Look after the cents and the dollars look after themselves' :-) Plus NGINX also has to handle the generic case, whereas we only needed to handle a specific subset of HTTP / HTTPS.
It was at this point that I realized that the performance of the user-land code in general (including NGINX) could be much better if it wasn't for the legacy network stack in the kernel, which does a bad job at some things such as slow performance accept()ing new sockets, and slower performance because every packet read has to be another system call unnecessarily copying memory; why not use shared memory? That's when I started to get interested in C10M... :-)
Well, there's a trick to writing C++ with no crashes and no memory corruption: never allocate memory using pointers. If you store all your data in stack variables and STL containers, C++ becomes a _much_ more forgiving language (especially if you also avoid templates and operator overloading).
I imagine that this poster and his team took a similar approach with C: avoiding the dangerous features, or just finding ways to factor them out.
Sorry to side-track, but the percentage of people who abstain from sex until marriage or death is far higher than 0.001%, so the comparison isn't really fair.
This sums up internet commenting in 2016. Congratulations on entirely missing the point because you chose to focus on an irrelevant and pointless detail, which I'm sure will spawn a stupid subthread that changes no one's mind.
Way to hijack a useless subthread and do something useful with it. Great meta-point. Honestly. This feature alone makes the comments on Reddit bearable for consumption.
The great thing about threaded textual commenting is that you can have a sub-thread to discuss a side point, without removing the ability to debate the main point in parallel.
The comment struck me as incredibly anachronistic, bordering on offensively so.
The parent commenter is saying that there is a type of person who believes that everyone should just abstain from sex until marriage, and there should be no out-of-wedlock pregnancies. Those people are as ridiculous as someone who believes that everyone should just write perfectly secure, bug-free C code.
For the record, I don't think your stupid if you have a child accidentally.
However, once you do accidentally have a child, the results are often not pretty, and I will feel bad watching as you deal with the consequences of your poorly thought out decision.
Regarding programming, I can't see the relationship you make at all.
If you write a good program, well good job. If you write a bad program, you are probably going to write a better one next time. (Ever look back at your old code?)
It takes practice to get good, and the more we work at it, the better we'll get. Just keep trying.
I think defen's point was that if a team produces truly high-quality C code, it's similar to a couple that both abstain until marriage. Yes, it's technically possible, but it's rare, and it's not something that should be the expected outcome (that is, most C code will have some problems, and most couples will have at least a little bit of regrettable sexual history before they're married).
It's as if I say: "Step 1 of writing good code: Don't make mistakes. If you make mistakes, you deserve what's coming to you!" Some mistakes are expected, and essentially inevitable; they're the default, not the exception. Blaming the programmer might seem like the right thing to do, but there's more practical benefit to improving the tools and teaching programmers how to handle errors, so that we can decrease the negative impact that bugs will have.
The point of the analogy is that it is very easy to prescribe behavior that is guaranteed to work if followed 100% to the letter, but fails catastrophically if there is even 1 mistake. There are two responses to this - you can either design systems that are easier to follow & fail less catastrophically in the real world (don't use C for new code if you don't need it; teach people about safe sex practices), OR you can double down on the sermonizing about how if people had just been smarter / had more willpower, they wouldn't be in this bad situation.
> However, once you do accidentally have a child, the results are often not pretty... Regarding programming, I can't see the relationship you make at all.
It's either that or the life of the grasshopper ;)
Now that I think of it, there's another parallel with unwanted pregnancies. There are men that stick around, and there are boys who run and let others clean up after their mess.
I remember seeing someone's C++ code at GDC and thinking to myself, "That looks like good Smalltalk!" Everything was intention revealing. Everything was about objects "sending messages" to each other. I've also seen production Smalltalk that might as well have been FORTRAN from the 70's. (Not one instance method or variable in the whole system. Everything was indexed loops.)
The people and the culture/organization of the shop counts for much more than your language. Language counts for something, but it's not where you get the most bang for your buck, and it shouldn't be your first priority!
The only possible ways you can believe you can write issue-free large scale projects in C, which don't suffer from all the usual pitfalls of C, are:
* You've never deployed one.
* You've never had to maintain one.
* You wrote the last line of code and handed it over to the maintenance team, who then never gave you feedback.
* You have no customers.
* You have done a never-before-seen formal analysis of a large scale project down to the individual source line level, and made no mistakes with the formal specification.
The author is correct: use C only if you must. There are still an enormous number of reasons why you must use C, but it should be a constraint imposed on you, and not a language you actively want to start a project in.
Parent seems to believe in nginx. But as a guy who actually tried to do a few of those things with nginx I can confirm that nginx's code base is pretty awful and working around all the usual pitfalls of C in nginx is a huge PITA and enormous amount of work.
> There are still an enormous number of reasons why you must use C, but it should be a constraint imposed on you, and not a language you actively want to start a project in.
It's the same as with, say, goto or #ifdef's - one has to consider the alternatives and make an informed choice. Most of the time, the alternatives are preferable, but sometimes there either aren't any alternatives, or they are even worse.
Which is not to say that C can't be a whole lot of fun. I love it dearly. But I, too, tend to avoid it when I can. It's just too easy to shoot yourself in the foot.
It's what they paid me to do in the 90s. That was a compelling reason to use C.
(of course, one employer appropriately used C as a portable assembler, and respected it as such, for the purposes of developing a cross compiler and tool emulations for an older minicomputer environment; the next was doing (batch) "business applications" and foolishly spent too little on hardware and too much wasted effort on application development)
Pascal (Modula) did much the same, SAFELY, but the money wasn't put into compiler optimization :-(
Interesting that the GNU compiler collection now includes a very efficient Ada compiler.
> Interesting that the GNU compiler collection now includes a very efficient Ada compiler.
GNAT has existed for ... at least ten years, I think. The DOD apparently paid for the development so there would be at least one open source/free software implementation. (According to Wikipedia, development started about twenty years ago, and it was merged into GCC in 2001.)
As a C++ programmer who delved deeply into nginx, I think that your chosen example of "high quality C code" is telling of the state of C programs. It could be one of the best specimens of C code in terms of organization, but absolutely speaking it's one of the best large code-bases I had to deal with:
1. Almost devoid of any documentation.
2. No ADTs - all struct fields are directly accessed, even though there are sometimes very complicated invariants that must be preserved (and these invariants are not documented anywhere, of course).
3. Ultra-short variable names that really describe nothing but their types (it's like Hungarian notation with only the notation part).
4. These mystery variables are all defined in the top of the function of course, making them hard to trace. Does anyone even compile Nginx with C89?
5. Configuration parsing creates a lengthy boilerplate that's really hard to track.
6. Asynchronous code is very hard to track, with a myriad of phase-specific meaning for return codes, callbacks and other stuff that makes reading Boost.ASIO code look like a walk in the park.
I think it really shows the sad state of C programming. Die-hard C programmers always complain C++ is hard to read because of its template and OOP abstractions, and it sometimes really is a mess to read, but C is even harder since large programs always re-implement their own half-assed version of OOP and template-macros inside.
Sorry if I think you're overstating your achievements.
- Was your code unit-tested?
- Did you use the 'strn' functions? (strncpy, instead of strcpy for example)
- Did your code run under Valgrind? (was there valgrind at that time? I'm not sure)
- Was it tested for memory leaks?
- Was there input fuzzying tests?
I agree that there are ways of writing great C code (like nginx, the linux and BSD kernels, etc)
But from a certain point on you're just wasting developer time when you would have a better solution in Java/Python/Ruby, etc with much less developer time and much less chance of bugs and security issues.
Unit tests? In C? I worked on a system processing millions of dollars a day in real time transactions, written in C++ with lower level libraries written in C. Some of it was written in K&R pre-ANSI C.
There was not one unit test. And there probably still isn't.
C is one of the easier languages to do unit testing in:
1. write a new one file script which parses function declarations from a source code file
2. have the script find all function declarations which start with "utest_", accept no parameters, and return "int" or "bool".
3. have the script write out a new C file which calls all of matching unit test functions and asserts their return value is equal to 1
4. have your Makefile build and run a unit test binary using code generated by your script for each source code file containing greater than 0 unit tests.
Writing a script to parse function declarations from your C source code is useful because you can extend the script later on to generate documentation, code statistics, or conduct project-specific static analysis.
It is, of course, possible. It is just rarely practiced, I think. It is as necessary as it is in any other language, which means either not at all, absolutely necessary, or somewhere in between depending on your point of view...
I have seen a few approaches: google test (actually a C++ framework) used to test C code, "custom" test frameworks (makefiles, test-runner shell scripts, and small main() functions that actually run tests), etc.
It has more than 2M lines of C code. Almost all libraries were written from scratch.
Every professional security researcher reading this just raised an eyebrow and thought to themselves, "That's a vulnerable application." In fact, your team (or a similar one at AOL) did introduce vulnerabilities[1][2]. If we assume that, as you imply, AOL really was exclusively composed of above average C developers writing high quality code, this should really just demonstrate the point for us.
I'm sure your team wrote high quality C code (or at the very least, tried very hard to write high quality C code), and as someone who likes C I also dislike the meme that C should simply not be written, no exceptions. I'd probably change it to, "Don't write C unless you need extreme performance, have great engineering resources and you'll generate unmitigated financial returns such to make any reasonable business risk assessment moot." That means virtually the same thing for anyone reading it, but it sounds a lot less sexy and memorable as far as programming aphorisms go.
I have never professionally audited a C project and not found a vulnerability. Most of my friends in this industry would likely say the same. I've never seen a serious audit of a "large" (100kloc or more) C project by a serious, reputable firm with no vulnerabilities reported. If you wrote 2 million lines of C code, if you wrote libraries from scratch, you have serious vulnerabilities. Otherwise, your engineering practices and resources exceed NASA's in formal verification and correctness.
I believe it is theoretically possible to write 100% safe C code, for some approximation of that ideal that leaves aside the ivory tower definition ("the only safe machine is the one which is off"). But I also believe this is so expensive and resource intensive (secure coding standards, audits, attempts at formal verification and provable correctness) that it's just generally not realistic in practice.
The reason I am saying all of this is not to attack you or your sense of worth as a C developer. Rather, I'd like you to consider that you simply have no idea how many vulnerabilities exist in ICQ. In fact, I found eight different security reports for ICQ, linked at the bottom. One of the more serious ones allowed arbitrary email retrieval, which I'm willing to bet was introduced because your team liked to develop in-house libraries and decided to do that for POP3.
C is a language which is both very powerful and which requires constant vigilance while coding. It is a language which makes pushing a buffer overflow to production on a Friday at 3 pm very easy. I bet your team was comparatively well-versed in C vulnerabilities for the time, but the very fact that you rolled libraries from scratch tells me you already have a high probability of errors showing up. Rolling a library/framework in-house is basically the first thing security researchers look for in an audit, because the third party ones are (usually) more secure due to their exposure. When you further consider the probability of third party software you did use becoming vulnerable in the future, changing compiler optimizations introducing new vulnerabilities and the sheer size of the SEI CERT Secure Coding Standard itself, you are left with a very dangerous risk profile.
We cannot expect C to be a safe language for general use if you need to have the rigor of one of the best research organizations in the world coupled with the top 0.01% of all C programmers. It just isn't feasible.
> Every professional security researcher reading this just raised an eyebrow and thought to themselves, "That's a vulnerable application."
I think the reason why every seasoned security researcher I've met also happens to be a heavy drinker or is damaged in some way is because they're in a war of attrition. You want perfect security but it will never happen. These are ultimately mutating, register-based machines. We will probably never be certain that with a sufficient level abstraction a program can be written which will never execute an invalid instruction or be manipulated to reveal hidden information.
Where the theory hits the road is where the action happens.
Which is how we end up with this wide spectrum of acceptable tolerances to security. Holistic verification of systems is extremely costly but necessary where human lives matter. However if someone finds a weird side-channel attack in a bitmap parsing library I think we can be more forgiving.
The whole idea that C programs are insecure by default and can never be secure is where theory wants to ignore the harsh realities. We can write languages with tighter constraints on the verification of the programs they create which will lower the risk of most security exploits by huge margins... but we have to trade something away for the benefit. The immediate costs being run-time performance or qualitative things like maintainability.
What I ultimately think will make these poor security researchers feel better is liability. Having a real system and standard in place for professional practices will at least let us soak up the damage will force us to consider security and scrutinize our code.
> The immediate costs being run-time performance or qualitative things like maintainability.
I don't think that's true. There's no reason why memory safety has to cost either performance or maintainability.
It feels like there's some sort of fundamental dichotomy because we didn't know how to do it in 1980, and we're still using languages from 1980, but our knowledge has advanced since then.
I probably shouldn't get in on this heated discussion ... but C++ aims to be a safer (higher level) C, at no runtime cost. I think it mostly succeeds at this.
I think C++ can only get so far, without deprecating some of C. Or at least C style code, should come with a big warning from the compiler:
"You are using an unsafe language feature. Are you absolutely sure there is no bug here, and there are no way to use safe language features instead?"
In libraries there might be cases where C-style code like manual pointer arithmetic, c arrays, manual memory management, c-style casts, or *void pointers, uninitialized variables, etc are necessary. But in user code most often there are safer replacements.
There are no such provably safe programs. I assume that you are thinking about programs that have been verified by formal proof systems.
They are not proven to be safe, they are proven to adhere to whatever the proof system managed to prove. This shifts the possible vulnerabilities away from the actual code and onto the proof system and the assumptions that the proof system makes.
For example, take a C program that was proven to be absolutely free of buffer overflows. And therefore labelled by the proof system as 'secure'. But, unbeknownst to the proof system, the C program also interprets user-supplied input as a format string! So it's still vulnerable to format string exploits.
Correct proof systems probably add a huge margin of security compared with the current security standards, but it's not absolute.
There are all kinds of people doing security research. In my experience with some of those people that I've met it doesn't seem like the challenges they have to contend with are not unrelated to the stress caused by the work that they do.
Sort of like how someone who works with giant metal stamping presses is likely to be missing a couple of fingers if they've worked long enough with them (and in an environment where safety regulations are too relaxed).
These are also some of the wonderful people in my life and I enjoy them very much.
"Don't write C unless you need extreme performance,
This is the wrong mindset and has been the guiding principle behind over a generation of poor programming discipline.
You might be able to rationalize that you don't always need extreme performance, but it's not about performance, it's about efficiency. Efficiency always matters. We live on a small planet with finite resources. Today, the majority of computing is happening with handheld portable devices with tiny batteries. Every wasted byte and CPU cycle costs power, runtime, and usability.
Even if you're coding for a desktop PC or a server application - every bit of waste costs electricity, generates heat, increases cooling load. Every bit of waste costs the user real money, even if they don't see the time that you wasted for them.
Performance is a side-effect. The only thing that matters is efficiency, and it matters all the time.
Most of today's programming problems come down to laziness and fear of "premature optimization" (which is obviously bunk). It is possible to write perfect C code that runs correctly the first time and every time. You do that by planning ahead. Formal specs are part of this.
The same consideration applies to writing, in general. I witnessed this first hand, when I worked at the computer labs at the University of Michigan back in the 1980s. When all anyone had was an electric typewriter, they planned their papers out well in advance - detailed outlines, with logically designed introductions and conclusions. This was a requirement, because once you got into actually typing, correcting a mistake could be expensive or impossible without starting over from scratch.
When we only had Wordstar available, and users came in with the inevitable questions and requests for help, you would still see well-organized papers written with a clear logical flow.
In 1985 with the advent of the Macintosh and MacWrite, all of that changed: People discovered that they could edit at will, copy and paste and move text around on the fly. And so they did - throwing planning out the window. And the quality of papers we saw at the help desk plummeted. Surveys from those years confirmed it too - paradoxically, the writing scores in the better-funded departments dropped rapidly, commensurate with their rate of introduction of Macs.
Planning ahead, making sure you know what you need to write, before you start writing any actual lines of code, prevents most problems from ever happening. It's a pretty simple discipline, but one that's lost on most people today.
> Most of today's programming problems come down to laziness and fear of "premature optimization" (which is obviously bunk). It is possible to write perfect C code that runs correctly the first time and every time. You do that by planning ahead. Formal specs are part of this.
If "planning ahead" were enough to prevent security vulnerabilities in practice, someone would have done it by now. Instead, we've had C for upwards of 30 years and everyone, "bad programmers", "good programmers" and "10xers" alike, keeps introducing the same vulnerabilities.
It's a nice theory, but we've been trying for over 30 years to make it work, and it keeps failing again and again. I think it's time to admit that C as a secure language (modulo the extremely expensive formal coding practices used in avionics and whatnot) has failed.
Sorry, but your statement makes no sense. That's like saying "we've been trying for over 5000 years to make hammers work, and people keep smashing their thumbs and fingers again and again. I think it's time to admit that hammers as a safe tool have failed."
Security is not a property of the tool, it is a property of the tool's wielder.
I'm not talking about NPE (though I would be happy to elaborate on how not having null pointers also reduces crashes in the wild). I'm talking about exploitable and/or non-type-safe memory safety problems.
Writing in Java is an effective way to reduce remote code execution vulnerabilities over writing in C. Statistics have shown this over and over again.
The very fact that NPE exists, in a language which doesn't have pointers, running on a JVM that doesn't have pointers, is evidence that the entire model of "doesn't have pointers" is a sham. The reality of computer architecture is that data resides in memory, memory is an array of bytes, and bytes have addresses (i.e., pointers) and denying this fact is impossible. Even in a pure pointerless system like Java. And if there are pointers in the system, somewhere, anywhere, they are exploitable.
Remote code execution vulnerabilities in C are less a feature of the language syntax and mostly due to the abysmally poor C library, which is regrettably part of "C The Language" - but no one forces you to use it - there are better libraries. Also, it's an artifact of implementation - just as Java's supposed immunity is an artifact of implementation, not an inherent feature of the language syntax.
As for RCEs in C code - these would be easily avoided if the implementation used two separate stacks - one for function parameters, and one for return addresses. If user code can only overwrite passive data, stack smashing attacks would be totally ineffective. Again - there's nothing in the C language specification that dictates or defines this particular vulnerability. It is solely an internal implementation decision.
> As for RCEs in C code - these would be easily avoided if the implementation used two separate stacks - one for function parameters, and one for return addresses.
No. Increasingly, RCEs are due to vtable confusion resulting from use after free.
The trouble here is that the efficiency you hold up so highly is marginal in most cases. Making the computer do work to make human lives easier isn't a waste of resources, it's the only reason computers exist in the first place.
Totally agree with your latter statement. But the overall assertion still leads people down misguided paths. Modern OSs like MacOS, Windows, and desktop Linux are presumably intended to make using computers easier. Each new release adds reams of new code implementing new features intended to improve ease of use. Yet the net effect is often negative, because the sheer amount of code added makes the overall system significantly slower.
Meanwhile, new programming languages abound, presumably to make programmers' lives easier. But these developments often completely lose sight of the users. Python is easier to write, but a program written in python uses 1000x the CPU and memory resources of a program written in C. It does the job, but slowly, and consuming so much that the computer is unable to do anything else productive. Meanwhile, it's questionable whether programmers' lives have been improved either, because they spend too much time chasing the next new tool instead of growing expertise in just one.
As for marginal efficiency gains - that is only a symptom of the latter developer problem. E.g., LMDB is pure C, less than 8KLOCs, and is orders of magnitude faster than every other embedded database engine out there. It is also 100% crash-proof and offers a plethora of features that other DB engines lack. All while being able to execute entirely within a CPU's L1 cache. The difference is more than "marginal" - and that difference comes from years of practice with C. You will never gain that kind of experience by being a dilettante and messing with the flavor of the week.
> > It has more than 2M lines of C code. Almost all libraries were written from scratch.
> Every professional security researcher reading this just raised an eyebrow and thought to themselves, "That's a vulnerable application."
That may be true, but substitute "C" with "Python" and wouldn't you have the same reaction? Maybe you would expect it to be slightly less vulnerable, but the key risk that I read in that statement (I am not a security professional) is the "2M" and the "written from scratch". The risk from "C" is secondary.
I would think a 2M line Python application would suffer logic bugs, like the C version. However, I wouldn't worry about indexing off the end of arrays, NULL pointer accesses, etc. which occur in C (sometimes silently) on top of the logic errors.
"I have never professionally audited a C project and not found a vulnerability."
Just out of interest: How many C projects have you audited? And have you ever looked for a relationship between the number of vulnerabilities and the percentage code coverage from automated tests?
Granted I would say nginx, and perhaps Redis are quite exceptional that way. I use them as examples of nice, clear, clean C code.
A good programmer will make COBOL look clean and understandable. But on average, I could way C code requires a lot more discipline, knowledge, and experience to produce quality code.
The first rule of C is don't write C if you can avoid it).
I interpreted that differently than you. I didn't think take it to use another language. But to keep c code as concise as possible and to not reinvent the wheel when a good library is available.
There may be individual exceptions, but the key point "write simple, understandable code" for most people and cases is the right thing to do.
I like tweaking younger engineers with C eccentricities, but any commercial code tends towards being as vanilla as possible with lots of error checking and docs to be as clear as possible.
The problem are the enterprises that don't care about the quality of the C developers they hire, with big teams, high attrition and rotation of outsourcing companies.
I have seen the quite a few of those and I imagine you wouldn't enjoy to audite their code.
No, and to claim as much is simple laziness. It is possible to write provably correct code. Even if you don't want to go full formally verified, you can eliminate large categories of possible errors with a relatively small amount of proof.
None of which removes the need to write simple, understandable code without undocumented magic. But we should not be throwing up our hands and treating invalid accesses, memory safety failures, buffer overflows, SQL injection, XSS, or a whole host of common failures as inevitable. It really is possible to do a lot better than most programs and a lot better than C.
I interpreted the comment a bit more generously that you did. You're absolutely corrent that it's possible, and I'll go as far as to say even desireable, to write provable correct code. However, I've never seen a demonstration of probably correct hardware. I interpreted the comment as reminder that even binary is, ultimately, a leaky abstraction.
I used to work in a facility where they'd test systems for radiation hardness. Get the system running, then shoot a proton beam travelling a significant fraction of the speed of light directly at the processor. It doesn't matter if the code was proven correct Dijkstra, it will eventually manifest bugs under these conditions.
I don't think anyone proves their hardware from basic physics[1], but hardware comes with specs and tolerances. If your software needs to operate in a high-radiation environment, you buy appropriate hardware and potentially take other countermeasures (e.g. multiple processors executing in sync).
[1] practically we resort to empirical testing to confirm that hardware conforms to its spec. But as your sibling says, that's no excuse not to do better in the places where we can.
Perhaps don't put your system in the path of a proton beam? I've had a large test suite that consistently used 8GB running for three years without ECC RAM. Any bit flip would have been caught by the test suite. My conclusion is that bit flips are entirely exaggerated (probably by the hardware industry for obvious reasons).
Do you have the suite published? The frequency of errors depends on how much memory and in which conditions it is being used. Datacenters, supercomputers, even personal workstations used for scientific calculations benefit from error correction. Check this out: http://techcrunch.com/2009/11/02/new-study-proves-that-ecc-m...
Usually comments that complain about downvotes are downvoted. Have another go, perhaps you'll reach -20. If you persist, you will be eminently qualified as a Wikipedia admin.
John Regehr observed a proven-correct C compiler generating incorrect code. The reason? An error in a system header file. The truth of the matter is that for most of the software systems formally proving every single piece of the system to be correct is too expensive. Perhaps as the rate of change in the software industry inevitably slows, this will change, as slower change will increase re-usability, but at the moment it is the sad state of things.
... and then someone finds a bug in the specification which is provably correctly implemented.
Using gets() to read the input can be provably correct. All you have to do is remove any requirements for security from the program's specification, and specify that the program will always be used in circumstances when the expected input datum will be less than a certain length.
... and then the hardware is buggy or failing. That RAM bit flips and you're screwed. The machine as a whole isn't provably correct.
> ... and then someone finds a bug in the specification which is provably correctly implemented.
As a programmer, if the specification is wrong, it's Not My Fault (tm). If you want me to write specifications, well, pay me to do it! I guarantee dramatically better results than the usual crap.
> The machine as a whole isn't provably correct.
As a programmer, defects in the machine are Not My Problem (tm). In any case, hardware vendors have a much better track record than software developers at delivering products that meet strict quality standards.
Assignment of blame is important, because it determines what component needs to be fixed to solve the problem. It's all about correctness in the end.
As an aside, I want to clarify that I'm not advocating being a bad team player. The point to assigning blame isn't demonizing the developer of the faulty component, or reducing cooperation between developers of different components to the bare minimum. The point is just reducing the cost of fixing the problem.
Assignment of blame is important, but it is often an opinion.
If one program misuses another, usually the fix can be an alteration in either one or both. The used program can be expanded to accept the "misuse" which is actually legitimate but not in the requirements. Or the using program can be altered not to generate the misuse.
Sometimes what is fixed is chosen for non-technical reasons, like: the best place to fix it it is in such and such code, but ... it's too widely used to touch / not controlled by our group and we need a fix now / closed source and not getting fixed by the vendor {until next year|EVER} / ...
"If it isn't in the specification, it isn't legitimate, period."
You have some goal you're trying to achieve, or why was a specification written at all? If you find that the existing specification is not the optimal path to that goal, you may very well want to change the specification. In which case, it's entirely legitimate.
Anyway, requirements, specifications and programs are just multiple representations of the same thing. Ideally, we would just state some requirements and they would execute.
So the requirements/specification/program complex exists Just Because. Humans decided they want that: for their amusement, for the sake of supporting some enterprise or solving a problem, or to try to sell to other humans.
The complex contains several different representations because that's what it takes to bridge the gap between stating the requirements and making the machine carry them out.
A proof is something internal to that complex: that the specification corresponds to the requirements, and that the program implements the specification.
If we had just one artifact: a specification that executes, then there would be no concept of proof any more.
There is only the question whether the specification that was expressed is the one that was intended in the mind. That equivalence is no more susceptible to proof than, say, the correspondence between the Ceasar salad that the waiter just put on your table (and which you clearly specified) and the idea of whether you actually wanted one, or did you mistakenly say "Ceasar salad" in spite of having wanted a soup.
Plus, there is the external question of whether a specification has unintended consequences. That basically amounts to "you say you want that, but maybe you should revise what you want because of these bad things".
"You say you want plaintext passwords to be stored for easier recovery, and that can certainly be implemented (provably correctly, too) but consider the following ramifications ..."
> If we had just one artifact: a specification that executes, then there would be no concept of proof any more.
It is possible for an executable functional specification to have unbearably bad performance for production use. In that case, the programmer has to reimplement the program in a more efficient (but perhaps less obvious) way and supply a proof that the reimplementation is functionally equivalent to the original executable specification.
Proofs aren't going away anytime soon.
> There is only the question whether the specification that was expressed is the one that was intended in the mind.
If other people can't bother communicating what they really want, that is absolutely Not My Problem (tm).
> Plus, there is the external question of whether a specification has unintended consequences.
If other people can't analyze the logical consequences of what they wish for, that is absolutely Not My Problem (tm). The most I can do is point at contradictions in the requirements.
Maybe cosmic rays give you a lower bound on the defect rate (though you can certainly measure and account for them). But I have never seen a complete system in which cosmic ray errors were anything more than the tiniest of undetectable noise, drowned out by the vast weight of software errors.
Is it really the case that some programming languages are more resilient to hardware errors than others?
Java does array bounds checking in principle, but it is usually optimized out of the machine code. Isn't it then virtually just as susceptible to hardware error as C? And the JVM is written in C anyway.
The JVM is written in C, but that C has been subject to some formal analysis. And even if it hadn't, I'd bet the defect rate of a Java program running on a C JVM would be much much lower than that of a C program.
But, in keeping with this topic, that C has not been subject to formal analysis from the angle of defending against rays doing random things to the CPU and RAM or other hardware errors.
So let's stop unit testing all of our code, since it's provably incorrect anyways, due to cosmic rays flipping random bits of RAM once in every gazillion calculations.
What is the largest program that has been proven correct?
We're writing programs that are tens of millions of lines of code (OSes, databases, air traffic control systems). My impression of provable correctness is that the largest program that has been proven is in the area of 100,000 lines (feel free to correct if I am in error). We need to be able to prove programs two orders of magnitude bigger than that; provable correctness has not been tried for such programs because nobody wants to wait a generation or two before they get the results.
Sort of. We haven't proven a single thing beyond certain size and complexity. That was true when it started back in Orange Book days. Even when Dijkstra came up with the stuff and the method of dealing witg it: composition w/ preconditions, postconditions, and invariants.
You code in a functional style and/or with state machines composed. They composed stuff hierarchically and loopfree where possible to facilitate analysis. Each component is small enough to analyze for all success or failure states. The results of each become "verification conditions" for things that build on it. With such methods, one can build things that are as large as you want so long as each component (primitive or composite) can be analyzed.
Not clear at what point this breaks down. I used it for drivers, protocols, algorithms, systems, networks, and so one. I brute forced it with more manual analysis because Im not mathematical or trained enough for proofs. However, what holds it back on large things is usually just lack of effort far as I can tell.
So maybe 1) we shouldn't wrote systems that are tens of millions of anarchically intercoupled lines of code or 2) we should write 100 100,000 line subsystems and link them together in a rigorous way.
They key is then to simply avoid emergent interactions between modules/subsystems. That's pretty doable.
What doesn't happen is that people never reason about the cost of defects. Even if they do some reasoning about this, they don't apply that to other components of systems.
Exactly what I said above. Verifying individual components and their straight-forward compisition is how it was done in the past. People just ignore that.
Y'know a lot of the best, most robust and mature software out there is a generation old... Maybe that's just how long it takes to get large scale software right.
The discussion is about C, so I would assume the post you're responding to was in that scope as well. Leading with an insult isn't likely to win any hearts or minds either.
How many modern programmers hold this mistaken belief? It's false -- the Turing Halting problem (https://en.wikipedia.org/wiki/Halting_problem) demonstrates that one cannot establish that a non-trivial program is "provably correct."
It seems the OP got carried away with his rhetoric, and the way he put it is simply wrong.
That is a very common but incorrect misinterpretation of the halting problem.
The halting problem says that there exist programs that cannot be proven correct; it says that for any property you want to prove computationally, no program can give you an answer for every last program. It doesn't say anything about "non-trivial" programs. In particular, if you write programs with awareness of the halting problem in mind, you can write extremely complicated programs that are provably correct.
One straightforward way to do this is with a language that always halts. The proof assistant Coq is based on such a language, and it's powerful enough to write a C compiler: http://compcert.inria.fr/
It is certainly correct that you can't stuff an arbitrarily vexatious program through a proof assistant and get anything useful on the other side. But nobody is interested in proving things about arbitrarily vexatious programs; they're interested in proving things about programs that cooperate with the proof process. If your program doesn't want to cooperate, you don't need an exact answer; you might as well just say "No, I don't care, I'm not trusting you."
Right. This is a bogus complaint about proof of correctness.
The way you prove halting is to find some integer measure for each loop (like "iterations remaining" or "error being reduced") which decreases on each iteration but does not go negative. If you can't find such a measure, your program is probably broken. (See page 12 of [1], the MEASURE statement, for a system I built 30 years ago which did this.)
The measure has to be an integer, not a real number, to avoid Zeno's paradox type problems. (1, 1/2, 1/4, 1/8, 1/16 ...)
On some floating point problems, with algorithms that are supposed to converge, proving termination can be tough. You can always add an iteration counter and limit the number of iterations. Now you can show termination.
Microsoft's Static Driver Verifier is able to prove correctness, in the sense of "not doing anything that will crash the kernel", about 96% of the time. About 4% of the time, it gives up after running for a while. If your kernel driver is anywhere near close to undecidability, there's something seriously wrong with it.
In practice, this is not one of the more difficult areas in program verification.
Pascal-F is interesting and not in my collection. What were most useful or interesting things you did with it? And does it have any use today in modern Pascal-like languages (maybe Modula-2 revision)?
> It doesn't say anything about "non-trivial" programs.
No, but its source does. The source for the Turing Halting problem is Gödel's incompleteness theorems, which do specify that the entities to which it applies are non-trivial. The Turing Halting problem, and Gödel's incompleteness theorems, are deeply connected.
> One straightforward way to do this is with a language that always halts.
That doesn't avoid the halting problem, it only changes the reason for a halt. And it cannot guarantee that that program will (or won't) halt, if you follow -- only its interpreter.
> In particular, if you write programs with awareness of the halting problem in mind, you can write extremely complicated programs that are provably correct.
You need to review the Turing Halting problem. No non-trivial computer program can be proven correct -- that's the meaning of the problem. It cannot be waved away, any more than Gödel's incompleteness theorems can be waved away, and for the same reason.
> The source for the Turing Halting problem is Gödel's incompleteness theorems, which do specify that the entities to which it applies are non-trivial. The Turing Halting problem, and Gödel's incompleteness theorems, are deeply connected.
Gödel doesn't say anything about "non-trivial" any more than Turing does. His first theorem is a "there exists at least one statement" thing, just like the halting problem. Like the halting problem, the proof is a proof by construction, but that statement is not a statement that anyone would a priori are to prove, just like the program "If I halt, run forever" is not a program that anyone would particularly want to run. So while yes, they're deeply connected, from the point of view of proving actual systems correct, it's not clear they matter. And neither theorem says anything about "non-trivial" statements/programs (just about non-trivial formalizations and languages).
Gödel's second theorem is super relevant, though: it says that any consistent system cannot prove its own consistency. But that's fine. Coq itself is implemented in a Turing-complete language (ML), not in Coq. It would be nice if we could verify Coq in Coq itself, but since Gödel said we can't, we move on with life, and we just verify other things besides proof assistants. (And it seems to be worthwhile to prove things correct in a proof assistant, even without a computer-checked proof of that proof assistant's own correctness.)
This is not waving away the problem, it's acknowledging it and working within the limits of the problem to do as much as possible. And while that's not everything, it turns out a lot of stuff of practical importance -- perhaps everything of practical importance besides proving the system complete in its own language -- is possible.
Quote: "Rice's theorem generalizes the theorem that the halting problem is unsolvable. It states that for any non-trivial property, there is no general decision procedure that, for all programs, decides whether the partial function implemented by the input program has that property. (A partial function is a function which may not always produce a result, and so is used to model programs, which can either produce results or fail to halt.) For example, the property "halt for the input 0" is undecidable. Here, "non-trivial" means that the set of partial functions that satisfy the property is neither the empty set nor the set of all partial functions."
Quote: "Gödel's incompleteness theorems are two theorems of mathematical logic that establish inherent limitations of all but the most trivial axiomatic systems capable of doing arithmetic."
> It would be nice if we could verify Coq in Coq itself ...
I was planning to bring this issue up, but you've done it for me. The simplest explanation of these two related theorems involves the issue of self-reference. Imagine there is a library, and a very conscientious librarian wants to have a provably correct account of the library's contents. The librarian therefore compiles an index of all the works in the library and declares victory. Then Gödel shows up and says, "On which shelf shall we put the index?"
In the same way, and for the same reason, non-trivial logical systems cannot check themselves, computer languages cannot validate their products (or themselves), and compilers cannot prove their own results correct.
> And it seems to be worthwhile to prove things correct in a proof assistant, even without a computer-checked proof of that proof assistant's own correctness.
Yes, unless someone is misled into thinking this means the checked code has been proven correct.
> This is not waving away the problem ...
When someone declares that a piece of code has been proven correct, in point of fact, yes, it is.
> for any non-trivial property, there is no general decision procedure that, for all programs
> all but the most trivial axiomatic systems
This is about non-trivial properties on arbitrary programs, and non-trivial axiomatic systems proving arbitrary statements. It is not about non-trivial programs or non-trivial statements.
Rice's theorem states that, for all non-trivial properties P, there exists a program S where you cannot compute P(S). (The Wikipedia phrasing puts the negative in a different spot, but this is de Morgan's law for quantifiers: ∀P ¬∀S P(S) is computable = ∀P ∃S ¬ P(S) is computable.)
Gödel's first theorem is that, for all non-trivial formalizations P, there exists a statement S where you cannot prove P in S.
You are claiming, for all non-trivial properties P, for all non-trivial programs S, you cannot compute P(S); for all non-trivial formalizations P, for all non-trivial statements S, you cannot prove P in S.
This is not what Rice's or Gödel's theorems are, and not what the texts you quoted say.
> Then Gödel shows up and says, "On which shelf shall we put the index?"
You keep it outside the library. The index tells you where every single book is, other than the index.
In particular, Gödel claims that, if the index has any chance of being correct, it must be outside the library. The fact that Coq cannot be proven in Coq does not decrease our confidence that it is correct; in fact, Gödel's second theorem means that if it could be proven in itself, we'd be confident it was wrong!
>> Then Gödel shows up and says, "On which shelf shall we put the index?"
> You keep it outside the library. The index tells you where every single book is, other than the index.
This approach would have saved Russell and Whitehead's "Principia Mathematica" project, would have placed mathematics on a firm foundation, but it suffers from the same defect of self-reference -- it places the index beyond validation. Russell realized this, and recognized that Gödel's results invalidated his project.
> The fact that Coq cannot be proven in Coq does not decrease our confidence that it is correct;
Yes, true, but for the reason that we have no such confidence (and can have no such confidence). We cannot make that assumption, because the Turing Halting problem and its theoretical basis prevents it.
Quote: "Gödel hammered the final nail home by being able to demonstrate that any consistent system of axioms would contain theorems which could not be proven. Even if new axioms were created to handle these situations this would, of necessity, create new theorems which could not be proven. There's just no way to beat the system. Thus he not only demolished the basis for Russell and Whitehead's work--that all mathematical theorems can be proven with a consistent set of axioms--but he also showed that no such system could be created." (emphasis added)
> Russell realized this, and recognized that Gödel's results invalidated his project.
One particular project. The specific project of formalizing mathematics within itself is unachievable. That doesn't mean we've stopped doing math.
And this is in fact the approach modern mathematics has taken. The fact that one particular book (the proof of mathematics' own consistency) can't be found on the shelves doesn't mean there isn't value in keeping the rest of the books well-organized.
> Yes, true, but for the reason that we have no such confidence (and can have no such confidence). We cannot make that assumption, because the Turing Halting problem and its theoretical basis prevents it.
No! Humans aren't developed in either Coq or ML. A human can look at Coq and say "Yes, this is correct" or "No, this is wrong" just fine, without running afoul of the halting problem, Gödel's incompleteness theorem, or anything else. The fact that you can have no Coq-proven confidence in Coq's correctness does not mean that you can have no confidence of Coq's correctness.
(Of course, this brings up whether a human should be confident in their own judgments, but that's a completely unavoidable problem and rapidly moving to the realm of philosophy. If you are arguing that a human should not believe anything they can convince themselves of, well, okay, but I'm not sure how you plan to live life.)
This is like claiming that, because ZF isn't provably correct in ZF (which was what Russell, Whitehead, Gödel, etc. were concerned about -- not software engineering), mathematicians have no business claiming they proved anything. Of course they've proven things just fine if they're able to convince other human mathematicians that they've proven things.
>> Russell realized this, and recognized that Gödel's results invalidated his project.
> One particular project.
A mathematical proof applies to all cases, not just one. Gödel's result addressed Russell's project, but it applies universally, as do all valid mathematical results. Many mathematicians regard Gödel's result as the most important mathematical finding of the 20th century.
> That doesn't mean we've stopped doing math.
Not the topic.
> A human can look at Coq and say "Yes, this is correct" or "No, this is wrong" just fine, without running afoul of the halting problem, Gödel's incompleteness theorem, or anything else.
This is false. If the Halting problem is insoluble, it is also insoluble to a human, and perhaps more so, given our tendency to ignore strictly logical reasoning. It places firm limits on what we can claim to have proven. Surely you aren't asserting that human thought transcends the limits of logic and mathematics?
> Of course they've proven things just fine if they're able to convince other human mathematicians that they've proven things.
Yes, with the caution that the system they're using (and any such system) has well-established limitations as to what can be proven. You may not be aware that some problems have been located that validate Gödel's result by being ... how shall I put this ... provably unprovable. The Turing Halting problem is only one of them, there are others.
> A mathematical proof applies to all cases, not just one. Gödel's result addressed Russell's project, but it applies universally, as do all valid mathematical results.
What? I have no idea what you mean by "applies universally" and "all valid mathematical results."
A straightforward reading of that sentence implies that Pythagoras' Theorem is not a "valid mathematical result" because it applies only to right triangles. That can't be what you're saying, is it?
I am acknowledging that Gödel's second theorem states that any mathematical system that attempts to prove itself is doomed to failure. I am also stating that Russell's project involved a mathematical system that attempts to prove itself. Therefore, Gödel's second theorem applies to Russell's project, but it does not necessarily say anything else about other projects.
> Many mathematicians regard Gödel's result as the most important mathematical finding of the 20th century.
So what? This is argumentum ad populum. How important the mathematical finding is is irrelevant to whether it matters to the discussion at hand.
> This is false. If the Halting problem is insoluble, it is also insoluble to a human
OK, so do you believe that modern mathematics is invalid because Russell's project failed?
No, I don't believe that the future is languages that are not Turing-complete, mostly because halting is not the property most people care about. I believe that a large part of the future will be languages that are not Turing-complete, for applications where that's appropriate (e.g., C compilers) but not where it's not (e.g., interpreters). I also don't think that "well, that's how we've always done it" is a particularly compelling argument given how young the field of programming languages is.
For instance, another wonderful example of requiring code to cooperate with the proof process is Google's Native Client. NaCl doesn't care about halting; what it cares about is that the code in question only accesses its own memory, or makes specific pre-defined calls into the supervising code. You can do this with hardware support (and they did, with segmentation, where available), but you can also do this with software support.
The NaCl technique is straightforward. You're required to lay out your assembly so that every 32-byte alignment boundary separates instructions; you must insert NOPs as necessary. Any fixed read, write, or jump must be within an address within a mask (or to a predefined address in privileged code), and any computed read, write, or jump must be immediately preceded, within the same 32-byte block, by a mask operation on the relevant register. For data, the mask enforces that only data is accessed; for jumps, the mask enforces that the jump target is within the code area and also 32-byte aligned.
Now you've proven a fairly complicated property of the code, that control flow never escapes to privileged code except at well-defined entry points. Doing so on arbitrary x86 would be impossible (unless you resort to binary translation), but this is a very minor modification to arbitrary x86, and a regular C compiler can be easily modified to emit it. The resulting restricted variant of x86 remains Turing-complete.
And it is entirely reasonable for the NaCl verifier to say "Uh, what are you doing, please try again" if presented with code that is technically non-malicious but does not follow the rules. The NaCl verifier doesn't claim to detect malicious code (that's the antivirus problem, and unsolvable). It claims to take code that makes a specific promise about not being malicious, and verifies that property.
"halting is not the property most people care about"
I agree that non-Turing complete languages are hugely interesting and likely to be more important as we go forward.
But not for this reason. Most of the non-trivial properties people do care about can be used to tell you if a program is going to halt in a finite amount of time. Want an upper bound on memory? This program doesn't have one? Well, then it never halts. It has one? Well, then we know it halts or loops within 2^bits steps - run it for that many steps and you know which.
I think non-Turing-complete will prove interesting because our programs are currently described in Turing complete languages, but run in more constrained contexts anyway. A program that always halts after running for 2^100 steps is not Turing complete. A program that halts if it tries to use more than 2^100 bytes of memory is not Turing complete. Applying these constraints in this way doesn't help us with analysis much, but it hints that we can probably find other ways to constrain our programs which won't get in the way of what we're trying to accomplish, and it's quite possible that some of those will let us build languages more amenable to analysis (and further possible that the gains in analysis will make up for a limited amount of getting-in-the-way).
I don't know where this thinking gets us. My computer, once I disconnect it from the internet, is actually a finite state machine and not a Turing machine, which I think you acknowledge. How does that help with anything?
If my program uses one kebibyte of memory -- no I/O and harddrive, then it loops or halts within 2^(2^10) steps. For me to detect that, in general I need to store a history of length 2^(2^10), which is more than 10^300. That's 10^301 bytes that I need to detect if a 1kb RAM program loops.
It is a proof that we can get what we need to done in a language that is not Turing complete. The hope would be that there is some sweet spot we can find wherein we can say things that are dramatically more useful than "it will finish inside 2^(2^100) steps" without impacting usability tremendously more than "it won't use more than a kebibyte of memory" - which is already enough to defang the theoretical objection.
http://research.google.com/pubs/archive/34913.pdf section 4 discusses performance: "The worst case performance overhead is [the SPEC2000 benchmark named] crafty at about 12%, with other benchmarks averaging about 5% overall." For some applications, there is no slowdown or even a minor speedup because of weird caching behavior and alignment.
I haven't seen this used for a general-purpose OS distro (I suppose you'd use it for running a VM-like thing as an unprivileged user, without access to any hardware features) but I agree it would be cool.
Oh, perhaps I misunderstood, then. Is it not likely that the vast majority of C/C++ userspace programs could survive this change-in-compiler-output-to-meet-restriction without functionality impact?
Yes, you could apply this change in compiler output without functionality impact to existing userspace applications. But you'd have to recompile everything, and it's not clear how it'd be useful; part of the question is what the privileged interfaces should be.
For Chrome this is useful because they want Chrome itself to be an unprivileged application, and they want to be essentially a recursive kernel to untrusted web content: there are multiple "users" (websites) that shouldn't be able to access each other's content, and certainly shouldn't be able to mess with the browser itself, and they get a restricted filesystem-like thing, screen-like thing, etc.
For a general-purpose OS, the OS is assumed to have control of the CPU and can just use normal hardware protection. So it's primarily useful if you want to build an OS that can be installed inside an unprivileged user account on another OS. But in practice, most people grant their unprivileged user account enough privilege to use hardware-assisted virtualization.
Wrong. What Russell/Gödel/Turing/Rice/etc. tell you you can't do is write a decision procedure that checks arbitrary nontrivial properties of programs. In other words, if you first write a program and only then ask whether it is correct, you've already lost.
Obviously, the solution is to stop writing arbitrary programs. As Dijkstra said long ago, we need to find a class of intellectually manageable programs - programs that lend themselves to being analyzed and understood. Intuitively, this is exactly what programmers demand when they ask for “simplicity”. The problem is that most programmers' view of “simplicity” is distorted by their preference for operational reasoning, eschewing more effective and efficient methods for reasoning about programs at scale.
> Obviously, the solution is to stop writing arbitrary programs. As Dijkstra said long ago, we need to find a class of intellectually manageable programs - programs that lend themselves to being analyzed and understood.
You're overlooking the issue of self-reference. A given computer language can validate (i.e. prove correct) a subset of itself, but it cannot be relied on to validate itself. A larger, more complex computer language, created to address the above problem, can check the validity of the above smaller language, but not itself, ad infinitum. It's really not that complicated, and the Halting Problem is more fundamental than this entire discussion seems able to acknowledge.
Correctness is always relative to a specification. More precisely, it's the program's specification, which must exist prior to the program itself, which defines what it constitutes for the program to be correct. So a program that “validates itself” simply doesn't make sense.
Also, it seems you misunderstand Gödel/Turing/etc.'s result. What a consistent and sufficiently powerful formal system can't do is prove its own consistency as a theorem. Obviously, it isn't sustainable to spend all our time inventing ever more powerful formal systems just to prove the preceding ones consistent. At some point you need to assume something. This is precisely the rôle of the foundations of mathematics (or perhaps I should say a foundation, because there exist many competing ones): to provide a system whose consistency need not be questioned, and on top of which the rest of mathematics can be built.
In any case, we have gone too far away from original point, which is that, if arbitrary programs are too unwieldy to be analyzed, the only way we can hope to ever understand our programs is to limit the class of programs we can write. And this is precisely what (sound) type systems, as well as other program analyses, do. Undoubtedly, some dynamically safe programs will be excluded, e.g., most type systems will reject `if 2 + 2 == 4 then "hello" else 5` as an expression of type `String`. But, in exchange, accepted programs are free by construction of any misbehaviors the type system's type safety theorem rules out. The real question is: how do we design type systems that have useful type safety theorems...?
> So a program that “validates itself” simply doesn't make sense.
The language comes from Gödel's incompleteness theorems, in which there exist true statements that cannot be proven true, i.e. validated.
A given non-trivial computer program can be relied on to produce results consistent with its definition (its specification), but this cannot be proven in all cases.
It's not very complicated.
> This is precisely the rôle of the foundations of mathematics (or perhaps I should say a foundation, because there exist many competing ones): to provide a system whose consistency need not be questioned, and on top of which the rest of mathematics can be built.
This is what Russell and Whitehead had in mind when they wrote their magnum opus "Principia Mathematica" in the early 20th century. Gödel proved that their program was flawed. Surely you knew this?
> Also, it seems you misunderstand Gödel/Turing/etc.'s result.
> A given non-trivial computer program can be relied on to produce results consistent with its definition (its specification), but this cannot be proven in all cases.
Provability is always relative to a formal system. The formal system you (should) use to prove a program correct (with respect to its specification) isn't the program itself.
> Gödel proved that their program was flawed. Surely you knew this?
Gödel didn't prove the futility of a foundations for mathematics. Far from it. What Gödel proved futile is the search for a positive answer to Hilbert's Entscheidunsproblem: There is no decision procedure that, given an arbitrary proposition (in some sensible formal system capable of expressing all of mathematics), returns `true` if it is a theorem (i.e., it has proof) or `false` if it is not.
If you want to lecture someone else on mathematical logic, I'd advise you to actually learn it yourself. I'm far from an expert in the topic, but at least this much I know.
> Gödel didn't prove the futility of a foundations for mathematics.
Your words, not mine, but Gödel did prove that a non-trivial logical system cannot be both complete and consistent. As Gödel phrased it, "Any consistent axiomatic system of mathematics will contain theorems which cannot be proven. If all the theorems of an axiomatic system can be proven then the system is inconsistent, and thus has theorems which can be proven both true and false."
Gödel's result don't address futility, but completeness. That's why his theorems have the name they do.
> The formal system you (should) use to prove a program correct (with respect to its specification) isn't the program itself.
This refers to the classic case of using a larger system to prove the consistency of a smaller one, but it suffers from the problem that the larger system inherits all the problems it resolves in the smaller one.
I'm sorry, but 'catnaroek is completely justified here. Multiple people have told you that you are simply factually wrong in your interpretation of the halting problem, and you are acting arrogant about it. If you tone back the arrogance a bit ("Surely you knew this?", "The misunderstanding is not mine") and concede that, potentially, it's not the case that everyone else is wrong and you're right, you might understand what your error is.
It's a very common mistake and is frequently mis-taught, especially by teachers who are unfamiliar with fields of research that actually care about the implications of the halting problem and of Gödel's theorems (both in computer science and in mathematics) and just know it as a curiosity. So I've been very patient about this. But as I pointed out to you in another comment, you are misreading what these things say in a way that should be very obvious once you state in formal terms what it is that you're claiming, and it would be worth you trying to acknowledge this.
My original objection was to the claim that a non-trivial program could be demonstrated to be "provably correct." This statement is false. For non-trivial programs, it contradicts the simplest and most easily accessible interpretation of the Turing Halting problem and Gödel's incompleteness theorems, both of which demonstrate that this is not possible.
Quote: "Alan Turing proved in 1936 that a general algorithm to solve the halting problem for all possible program-input pairs cannot exist." (emphasis added)
> Multiple people have told you that you are simply factually wrong ...
So as the number of people who object increases, the chances that I am wrong also increases? This is an argumentum ad populum, a logical error, and it is not how science works. Imagine being a scientist, someone for whom authority means nothing, and evidence means everything.
In your next post, rise to the occasion and post evidence (as I have done) instead of opinion, or don't post.
> Quote: "Alan Turing proved in 1936 that a general algorithm to solve the halting problem for all possible program-input pairs cannot exist."
Do you agree or disagree that the quantifiers in these two statements are different? Do you agree or disagree that "the Turing Halting problem demonstrates that one cannot establish that every single non-trivial program is either 'provably correct' or 'provably incorrect'" is a different claim than the one originally made?
> My original objection was to the claim that a non-trivial program could be demonstrated to be "provably correct." This statement is false. For non-trivial programs, it contradicts the simplest and most easily accessible interpretation of the Turing Halting problem and Gödel's incompleteness theorems, both of which demonstrate that this is not possible.
You don't need to solve the Halting problem (or find a way around Rice's theorem, etc.) to write a single correct program and prove it so. A-whole-nother story is if you want a decision procedure for whether an arbitrary program in a Turing-complete language satisfies an arbitrary specification. Such a decision procedure would be nice to have, alas, provably cannot possibly exist.
To give an analogy, consider these two apparently contradictory facts:
(0) Real number [assuming a representation that supports the usual arithmetic operations] equality is only semidecidable: There is a procedure that, given two arbitrary reals, loops forever if they are equal, but halts if they aren't. You can't do better than this.
(1) You can easily tell that the real numbers 1+3 and 2+2 are equal. [You can, right?]
The “contradiction” is resolved by noting that the given reals 1+3 and 2+2 aren't arbitrary, but were chosen to be equal right from the beginning, obviating the need to use the aforementioned procedure.
The way around the Halting problem is very similar: Don't write arbitrary programs. Rather, design your programs right from the beginning so that you can prove whatever you need to prove about them. Typically, this is only possible if the program and the proof are developed together, rather than the latter after the former.
This much math you need to know if you want to lecture others on the Internet. :-p
> The way around the Halting problem is very similar: Don't write arbitrary programs. Rather, design your programs right from the beginning so that you can prove whatever you need to prove about them.
The system you outline works perfectly as long as you limit yourself to provable assertions. But Gödel's theorems have led to examples of unprovable assertions, thus moving beyond the hypothetical into a realm that might collide with future algorithms, likely to be much more complex than typical modern algorithms, including some that may mistakenly be believed to be proven correct.
If we're balancing a bank's accounts, I think we're safe. In the future, when we start modeling the human brain's activities in code, this topic will likely seem less trivial, and the task of avoiding "arbitrary programs," and undecidable assertions (or even recognizing them in all cases), won't seem so easy.
> This much math you need to know if you want to lecture others on the Internet.
My original objection was completely appropriate to its context.
> The system you outline works perfectly as long as you limit yourself to provable assertions.
Of course. And if a formal system doesn't let you prove the theorems you want, you would just switch formal systems.
> But Gödel's theorems have led to examples of unprovable assertions, thus moving beyond the hypothetical into a realm that might collide with future algorithms likely to be much more complex than typical modern algorithms, including some that may mistakenly be believed to be proven correct.
Programming is applied logic, not religion. And a proof doesn't need to be “believed”, it just has to abide by the rules of a formal system. In some formal systems, like intensional type theory, verifying this is a matter of performing a syntactic check.
> If we're balancing a bank's accounts, I think we're safe.
I wouldn't be so sure.
> In the future, when we start modeling the human brain's activities in code, this topic will likely seem less trivial,
I see this as more of a problem for specification writers than programmers. Since I don't find myself writing specifications so often, this is basically Not My Problem (tm). Even if I wrote specifications for a living, I'm not dishonest or delusional enough to charge clients for doing something I actually can't do, like specifying a program that claims to be an accurate model of the human brain.
> and the task of avoiding "arbitrary programs," and undecidable assertions (or even recognizing them in all cases), won't seem so easy.
I don't see what's so hard about sticking to program structures (e.g., structural recursion) for which convenient proof principles exist (e.g., structural induction).
Please, get yourself a little bit of mathematical education before posting your next reply. You're so stubbornly wrong it's painful to see.
> If we're balancing a bank's accounts, I think we're safe.
You seem to be contradicting your original claim, that it is impossible to prove any non-trivial program correct. Is this because you believe the program to balance a bank's accounts is trivial? Do you also believe that a C compiler is trivial?
I guess we never defined what you meant by "non-trivial." If your position is that the vast majority of software in the world today is "trivial," sure, I'd agree with you. But that's not really a meaning of "trivial" that anyone expected.
> "Alan Turing proved in 1936 that a general algorithm to solve the halting problem for all possible program-input pairs cannot exist."
You seem to be mixing up where the "for all" lies. The result is:
There does not exist a procedure that, for all (program, input) pairs, the procedure will say whether the program halts for that input.
{p | Ɐ(f,i): p(f,i) says whether f(i) will halt } = ∅
You are using it as if it said:
For all (program, input) pairs, there does not exist a procedure that will say whether the program halts for that input.
Ɐ(f,i): {p | p(f,i) says whether f(i) will halt } = ∅
"post evidence (as I have done)"
What is at issue is your incorrect assertions about what your evidence says. We read it, it contradicts you, everyone points this out, and you just ignore it and get louder.
> You seem to be mixing up where the "for all" lies.
Not at all. My original objection was to the claim that a program could be proven correct, without any qualifiers. Any program, not all programs. As stated, it's a false claim.
Your detailed comparison doesn't apply to my objection, which is to an unqualified claim.
> everyone points this out ...
Again, this is a logical error. Science is not a popularity contest, and in this specific context, straw polls resolve nothing.
> you just ignore it and get louder.
I wonder if you know how to have a discussion like this, in which there are only ideas, no personalities, and no place for ad hominem arguments?
(∃p ∈ P s.t. ¬(p ⊢ p ∈ C(halts)) ∧ ¬(p ⊢ p ∉ C(halts))) → ((p,s ⊢ p ∈ C(s)) → p is trivial)
Please supply a proof.
Note that it's quite true that
(∃p ∈ P s.t. ¬(p ⊢ p ∈ C(halts)) ∧ ¬(p ⊢ p ∉ C(halts))) → ((p,s ⊢ p ∈ C(s)) → *s* is trivial)
proved usually by reducing s to "halts" for any non-trivial s.
> Again, this is a logical error. Science is not a popularity contest, and in this specific context, straw polls resolve nothing.
I am not deciding it by poll. I am pointing out that there are objections to your reasoning that you have not addressed and persist in not addressing - despite them being made abundantly clear, repeatedly.
> I wonder if you know how to have a discussion like this, in which there are only ideas, no personalities, and no place for ad hominem arguments?
You're accusing me, as a person, of not understanding that a discussion like this has no place for accusations against a person? My complaint was not about you as a person, but about the form of your arguments in this discussion.
Can you clarify your last sentence? What do you mean by "preference for operational reasoning"? What's an example of the more effective methods for reasoning about programs at scale that you're contrasting this to?
“Operational reasoning” means using execution traces (the sequence of states underwent by a process under the control of your program) to derive conclusions about the program. You can do it in your head, using a piece of paper, using a debugger, using strategic assertions, etc.
By “more effective methods for reasoning about programs at scale”, what I mean is analyzing the program in its own right, without reference to specific execution traces. There are several kinds of program analyses that can be performed without running the program, but, to the best of my knowledge, all of them are ultimately some form of applied logic.
It's a guess, but I think GP uses "operational" as a synonym for "imperative". I think I encountered the term "operational semantics" in the context of Prolog programming. Denotational semantics is a term for a kind of semantics which are mathematically pure. IIRC Prolog has both operational and denotational semantics defined and they conflict. There's something about cut operator and tree pruning, but I read this long ago :-)
> It's a guess, but I think GP uses "operational" as a synonym for "imperative".
Not really. It's also possible to reason non-operationally about imperative programs, e.g., using predicate transformer semantics. It's also possible to reason operationally about non-imperative programs.
The Halting Problem helps us prove programs because it informs us that some functions cannot be computed. This means that the requirement spec for some computations cannot be implemented. When we spot this problem in a requirement spec, we instantly know that its proposed implementation is an incorrect program, no matter what that program is. We don't have to look at a single line of code.
But the halting problem for a subset of programs isn't The Halting Problem, which is the question whether a UTM can exist to calculate any function, over its entire domain.
The parent's wording was a little sloppy, but the point was that the grandparent was incorrectly using the Halting Problem to argue that you can't even do it for a meaningful subset of programs, which of course does not follow.
Here is something else: provably correct code can demonstrate unacceptable complexity for some input cases. We know from the proof that it will halt and the result will be correct: in 200 years on the given hardware. You have to add a time bound as a requirement specification to avoid this, or a constraint on the complexity (e.g. "the implementation must use no algorithm slower than quadratic time in such and such inputs").
I've noticed that "low-level" languages like C, or for a more extreme example, assembly languages, tend to encourage simpler code if for nothing other than the fact that everything takes many more lines than in higher-level languages. E.g. if you want to do OOP you'll have to allocate memory and call constructors manually, pass explicit 'this' parameters into all method calls, declare virtual function tables, etc. Certainly there are cases where this tempts too easy solutions like fixed-size overflow-prone buffers instead of dynamically allocating strings, but on the other hand it forces you to think more about whether you actually need to do something ("should this string ever be more than X bytes long? Maybe not, might as well use fixed-size array - but I should not forget to check that length!"), which I think is overall a good thing.
It also makes you think upfront about the design much more. With all the intricacies of writing correct low-level code you really want to write it once only and not try out different designs and refactor between them, where mistakes quickly happen.
Of course this is a disadvantage if you don't have sufficient experience in the problem domain. Sometimes I resort to writing prototypes in a script language in that case, translating the best design to the low-level language.
A language that is "write it right the first time, refactoring is painful" is a poor fit for most problems. Requirements change, if your software can't be adapted easily your project will fail. The general failure of waterfall-style software development is a strong indicator that finding the "right" design up front usually doesn't work out.
> on the other hand it forces you to think more about whether you actually need to do something ("should this string ever be more than X bytes long? Maybe not, might as well use fixed-size array - but I should not forget to check that length!"), which I think is overall a good thing.
While that certainly can't be a bad thing, you just have to keep in mind that it forces you to think about a specific set of details – the ones that matter for the machine ("how many bytes will the binary representation of this username require") – and it may fool you into forgetting to think about higher-level concerns ("Does å compare equally to å?").
Au contraire my friend, I've seen C code with "functions" that are >1000 lines long, probably ~10 parameters with lovely names like `lfp`, and where probably 40% of those lines are #ifdef and #endif.
The same code would be much simpler with C++ templates, but the "C being simple" really translates into limited, then you have that some have who never learned how to use macros and this unfounded and misunderstood fear of goto that gives you a nice long lines of error checking, each that that are duplicated if statements character by character where a goto and a single error handler would have worked.
Perhaps this is an exception, but low-level in my case did not correspond to simple code.
In an extreme example, I replaced a 10,000 line C method with one template of less than a page of C++ code. It was marshaling different structures for transmission over a mailbox port. All that code copied and pasted and one or two numbers changed depending on the structure size.
Yeah, let's use copy & paste and C preprocessor macros as our abstraction tools ...
Avoiding unnecessary abstractions is important but at the same time those abstractions were invented for a reason. Basically the Go vs generics debate, except even more rudimentary. It's fine if your code doesn't need those abstractions, it sucks badly if it does (eg. gobject)
Depends on your definition of rudimentary. The C preprocessor sucks in general, but you can use it to implement all sorts of abstractions Go basically isn't capable of.
If replacing standard high level programming concepts like generics/templates with C macros is your idea of a productive programming environment than I don't know what to say.
Just look at any C collection library for things like maps compared to C++.
I think C does not make you write simpler code. It makes people take short-cuts. Static allocations, using a hand rolled linked list, skipping return value checks or boundary checks etc. are all typical to C programs.
> Although developers skipping return value checks is true in most languages.
Yes, which is why I disagree with Go's approach for error codes as return values. I'm not saying that exceptions are the right solution everywhere, but if you return a error code, you should make it pretty hard to ignore it. In languages with Algebraic Data Types it's very easy to use Maybe or a Error(code) | Success(result) type. You've then got to explicitly deconstruct that type to get the result, so it's a bit harder to just forget handling the error.
#1 The submitted article is very good.
#2 I'd add to it, avoid pointer arithmetic when possible. Use array notation instead. Typically the difference in execution speed is zero to nil.
#3 I'd really like a pragma that forces an exception for unchecked return values. Something like
int DoSomeThing(/* whatever*/) #pragma exit "unchecked non_zero"
Meaning if DoSomething() doesn't return zero the program bails.
I think it's mostly that C's lack of language features removes distractions. You have only a couple abstractions to work with if you're writing idiomatic C code, and so you can focus on the task at hand more easily.
The tradeoff is that the task at hand tends to have more implementation details in it than in higher level languages, but as you said, this isn't always bad.
I think it only looks that way because people don't compare like with like. x lines of code C may well be simpler than x lines of a given alternative. But http://qdb.us/301922
Or some obscure library found on github that's effectively become abandonware. Now you've just inherited a legacy codebase to maintain because some of your coworkers wanted to use a small subset of its functionality!
>The first rule of C is don't write C if you can avoid it.
Man, this is just lame. C is a really nice, useful language, especially if you learn the internals. It requires discipline and care to use effectively, but it can be done right. The problem is that a lot of C programmers only know C and can't apply higher level concepts or abstractions to it because they've never learned anything else. This makes their code harder to maintain and easier to break.
He's not saying that C isn't useful, but that for most problems it's overkill. Why inflict the extra burden on yourself -for no benefit-? Most code being written can be done just as well in a language with garbage collection, that has more powerful abstractions and paradigms as first class citizens...why not take advantage of them? Of course you can write those higher level concepts/abstractions in C, but the point is, -they've already been written-, tested, and documented, by other people. Why write untested, undocumented code that reimplements those things when you have no compelling reason to?
That is some really good perspective. The incorrect C program is a bit of an externality that's felt by QA/testing/customers/etc.
I liken programming w/Rust to programming with C or C++ with warnings and a static analyzer barring object files from being created. If I were new to C/C++, I would probably find it baffling how anyone ever gets it to work.
Actually, it's the very process of “fighting the compiler” (as you call it, though I prefer to view it as collaboration between a human who knows what he wants and a program whose logic codifies what is possible) that makes it easier to write correct programs in Rust/Haskell/ML/$FAVORITE_TYPEFUL_LANGUAGE.
I think there's a time before you learn how to make that collaboration productive, for a given tool, which is more aptly termed "fighting" and which you should certainly be able to get past. After that, you can have disagreements but they are far more productive.
This seems an unwarranted dig against a poorly specified group of people.
The notion that a type system is "just" logic and so should be trivially followed by anyone "smart enough" or "careful enough" is laughable - there are many different type systems (sometimes with different details depending on the options or pragmas you feed to your tooling), and it can take time to get a sense of where the boundaries are.
The notion that you're not likely to find a type system helpful if you have trouble with it initially is harmful.
I would've never said "just". Formal logic is big f...reaking deal.
> so should be trivially followed by anyone "smart enough" or "careful enough" is laughable
I never said anything about being "smart" or "careful". I only said that hacker types, for whom the ability to subvert anything anytime is a fundamental tenet, tend to have a hard time sticking to the rules of a formal game, whether it is a type system or not. Just look at the reasons Lispers give for liking Lisp.
> I would've never said "just". Formal logic is big f...reaking deal.
"Just" was in no way meant to diminish significance; it was meant to draw attention to your improper limiting of scope. Someone "fighting the compiler" is not fighting "executable embodiments of formal logic", but "executable embodiments of formal logic and a pile of engineering decisions". Often, enough of those decisions are made well that the tool can be phenomenally useful; some are made less well, and some simply involve tradeoffs. It doesn't have the... inevitability that your wording carried. Programmers coming from a context where enough of those decisions have been made differently are likely to get tripped up for a while in the transition, without harboring any objections to logic.
> I never said anything about being "smart" or "careful". I only said that hacker types, for whom the ability to subvert anything anytime is a fundamental tenet, tend to have a hard time sticking to the rules of a formal game, whether it is a type system or not.
True, but you left it up to us to figure out what attributes of "hacker types" were relevant. As there are many different uses of the word "hacker", it was quite unclear that you meant to refer specifically to inability to stick to rules.
While people of that description may well persist in "fighting the compiler" for longer, that is not most of what I've observed when I've observed people struggling with a type system, and I reiterate my assertion that those new to a particular tool and used to doing things another way also frequently struggle for a period.
While I agree with your sentiment, I think this is not true in general. The Rust compiler sometimes rejects correct programs, in which case it can be harder to write it in Rust than in C.
In the absolute worst case, you can pretty much transcribe your C program into unsafe Rust, though you will be fighting the syntax, which was deliberately made cumbersome to discourage programmers from writing unsafe code.
In practice, I have never seen any programs where this is the best approach. Coming up with designs that don't need unsafe code is very useful, because the safe subset of Rust is vastly easier to analyze and understand, both for humans and computers.
Sometimes C compilers can't even agree on what counts as "correct"! With Rust, you can at least be reasonably sure that if your program compiles with your version of the compiler, it will continue to do so with all later versions, because they're coming from the same codebase and made by people who care about backward compatibility.
> Sometimes C compilers can't even agree on what counts as "correct"!
This comes primarily from the fact the C standard itself is hard to interpret right. Standard ML has less issues in this department: since the Definition is clearer and less ambiguous, implementations disagree less on what counts as correct code and what its meaning is. So having only one major language implementation isn't the only way to prevent compatibility issues.
C is a much simpler language thats all. I understand its easier to write safe program in Rust. But to just pick up a language and go, C is easier in my mind.
It's probably easier to write CS101 prime-number or fibonacci programs in C.
Any nontrivial bit of software? You're going to have to learn about undefined behavior and deal with it. Rust forces you to learn this before getting started.
Plus Rust has a lot of higher-level (still zero cost) abstractions like Python which make a lot of patterns really easy to program. (And Python is _the_ teaching language these days)
Coming from JavaScript, Java and Python, Rust has a steep learning curve, but with thought and patience you can learn to write Rust. I am getting much better at thinking about ownership before even compiling now. If some silly webdev like me can hack it, other programmers can do it too.
There are even languages with deterministic destruction which solves nearly all resource leak problems, and at least one of them is directly compatible with C and has nice metaprogramming facilities. C++ it is.
If I make a mistake in Python, C#, Java, Swift, Go, Ruby, Javascript, Haskell, D, Rust, et al I don't leak arbitrary contents of memory, smash the stack, overrun a buffer, or otherwise provide huge gaping security holes that allow arbitrary code injection.
C is riddled with opportunities to do _just that_ with even the slightest mistake.
Yes, I can make a logic error in any of the above languages that performs some operation a user shouldn't actually be allowed to perform. But I can't accidentally let the user read or write arbitrary memory locations.
With the exception of unsafe blocks or similar mechanisms; being limited to these explicit scopes, most programmer's natural laziness leads to avoiding unsafe behavior whenever possible. It also greatly limits the scope of potential bugs, as well as testing and code review required.
> But I can't accidentally let the user read or write arbitrary memory locations.
Many of the languages you mention contain an 'eval' method, and most use environment variables in less than transparent ways. I wouldn't be so sure of this claim.
I'd argue that it can't be done right for non trivial programs without enormous effort. Even software written by expert C programmers with years and years of experience contains bugs that are caused by C's unsafe features.
I'd agree, there was a great book called Deep C Secrets by Peter van der Linden. The book itself is not without errors but still provided some useful insights to things I didn't realize I was doing wrong back when I coded in C.
Modula-2 was released in 1980 - a Pascal successor which offers faster compilations speeds (precompiled modules) executables as fast as C built ones and a very safe type system avoiding most common errors in C programs. It is sad it was mostly ignored for so long a time. Fortunately, Go picks up its heritage these days.
- Using fixed width integers in for loops seems like a fabulous way to reduce the portability of code
- the statements in "C allows static initialization of stack-allocated arrays" are _not_ equivalent, one is a bitwise zeroing while the other is arithmetic initialization. On some machines changing these statements blindly will cause a different bit pattern to end up in memory (because there are no requirements on the machine's representations for e.g. integers or NULL or ..). There are sound reasons why the bitwise approach could be preferred, for example, because a project has debug wrappers for memset that clearly demarcate uninitialized data
- the statements in "C99 allows variable length array initializsers" aren't even slightly equivalent. His suggestion uses automatic storage (and subsequently a stack overflow triggered by user input -- aka. a security bug)
- "There is no performance penalty for getting zero'd memory" this is bullshit. calloc() might optimize for the case where it is allocating from a page the OS has just supplied but I doubt any implementation ever bothered to do this, since it relies on the host OS to always zero new pages
- "If a function accepts arbitrary input data and a length to process, don't restrict the type of the parameter." the former version is in every way more self-documenting and consistent with the apparent function of the procedure than his use of void. It also runs counter to a rule from slightly more conservative times: avoid void at all costs, since it automatically silences all casting warnings.
Modern C provides a bunch of new things that make typing safer but none of those techniques are mentioned here. For example word-sized structs combined with struct literals can eliminate whole classes of historical bugs.
On the fixed width integers thing, the size of 'int', 'long' and 'long long' are designed to vary according to the machine in use. On some fancy Intel box perhaps there is no cost to using a 64bit type all the time, but on a microcontroller you've just caused the compiler to inject a software arithmetic implementation into your binary (and your code is running 100x slower too). He doesn't even mention types like intfast_t designed for this case, despite explicitly indicating "don't write C if you have to", which in 2016 pretty commonly means you're targeting such a device
Advice presented along the lines of "there are no reasons to do anything but that which I recommend" is usually somewhat flawed... just because there are so many reasons to do different things in different situations.
Regarding calloc (which indeed might be slower than malloc contrary to TFA) - a variable that should have been initialized to something but instead keeps zero bits written by calloc is not necessarily a great situation. If you use malloc, at least Valgrind will show you where you use the uninitialized variable. With calloc it won't, though it might still be a bug - a consistently behaving bug, but a bug nonetheless. Someone preferring consistently behaving bugs to inconsistently behaving bugs in production code might use a function my_malloc (not my_calloc...) which in a release build calls calloc, but in debug builds it calls malloc (and perhaps detects whether it runs under Valgrind and if not, initializes the buffer not with zeros, but with deterministic pseudo-random garbage.)
another thing is that both calloc and malloc return non null pointer for zero arguments (that can be very bad). my_alloc should check for zero size arguments and return NULL in this case. (The problem is that if your size computation overflowed then you will get a very small non null memory block - which you will be very likely to overwrite)
Calloc has another benefit: if you calculate size by n * sizeof(kuku) then it will detect integer overflow (this can be very important for security sensitive stuff that deals with untrusted input), if you did the computation before that and passed calloc( size, 1) then this multiplication and check is another few cycles wasted.
I wonder if they could have done a standard library allocation function like calloc that does the multiplication with overflow detection, does not zero out memory and that returns NULL on zero arguments - but that would be too much to ask.
actually openbsd has something like this: mallocarray [1] and reallocarray [2] (OMG realloc) actually here [3] they say that use of these function in libressl and the openbsd port of the http server, etc.
Yeah in fact using non-fixed width integers reduces portability!
I ran into a bug where I was running some Arduino code (16-bit) on an mBed (32-bit). The use of `int` rather than `uint16_t` caused a timeout to become an infinite loop.
I'm pretty sure it would be impossible to have portability issues due to using fixed-width types - I mean their entire point is to ensure portability.
For one, if you write uint_least16_t, you're implicitly warranting that the code does the right thing if it isn't 16 bits, which means you have to think about it. And C's integer conversion rules and overflow restrictions are typically quite a lot to think about already... Not the strongest argument, but I think there is a case for applying YAGNI.
Well, the assumption you have typically been making, when your unsigned loop counter is causing problems because it is larger than expected, is that the counter will exhibit a particular behavior when it over- or underflows. IIRC, this is unspecified in the standard and you can't write portable code making such assumptions.
I can see how the type changing to a smaller one might cause problems, but I don't see how IshKebab's example could happen without exploiting implementation specific overflow behavior.
> I can see how the type changing to a smaller one might cause problems, but I don't see how IshKebab's example could happen without exploiting implementation specific overflow behavior.
INT_MAX (or INT_LEASTN_MAX for annathebannana's suggestion) doesn't require exploiting overflow behaviour, but going from 2^15 iterations to 2^31 or 2^61 iterations may be problematic.
Problematic, yes, but not the difference between a terminating loop and an endless loop, which is what he said was the result of the size change. If that actually is the case I'd guess the problem was that the compiler for the latter platform was removing a zero comparison loop condition based on the assumption that a value that only ever increments can only ever be zero once, using the unspecified nature of ocerflows to its advantage, while the former did not.
And using INT_MAX or similar for what sounds like a timing loop is a whole other can of bad practice. Then the problem isn't that you used the wrong type, it's that you used the wrong value.
> Problematic, yes, but not the difference between a terminating loop and an endless loop
If the loop does significant work and was calibrated for an expectation of 65k iterations, stepping to 2 billion (let alone a few quintillion) is for all intents and purpose endless.
> And using INT_MAX or similar for what sounds like a timing loop is a whole other can of bad practice. Then the problem isn't that you used the wrong type, it's that you used the wrong value.
No objection there, doing that is making invalid assumptions, my point is that moving to exact-size integral does fix it.
> If the loop does significant work and was calibrated for an expectation of 65k iterations, stepping to 2 billion (let alone a few quintillion) is for all intents and purpose endless.
And if it's not, it's not, the point being that endless in this case would be meaningless outside it's literal meaning unless we know more about the specific case.
> No objection there, doing that is making invalid assumptions, my point is that moving to exact-size integral does fix it.
No, using an exact value fixes it. Any unsigned integer type is just fine for any integer value from 0 to 65535. If you change the type to a larger integer type without changing the supposed iteration count, the code would not have this problem, and if you changed the value to something higher than 65535 without adjusting the size of the type, you would have a different problem. Thus, the problem described here does not pertain to the type of the variable used.
The only issue I can think of is that arithmetic on an uint16_t will be signed on a 32-bit target and unsigned on a 16 bit target. I think the only possible way to get undefined behavior would be to left shift by enough to overflow the signed integer on the 32-bit target, which should be rare (and might not be well defined on the 16 bit target; I don't have the spec in front of me right now).
You "know the range", but not necessarily in terms of concrete numbers. Often, you iterate over variable size data structures, for example. All you know is that it won't be larger than the C implementation it's being compiled with allows--which usually roughly means that it won't be larger than the amount of memory the processor it's running on can address. But how much that actually is? Well, depends on the processor, and there also is no upper limit, as you always can invent a processor that can address more memory than any processor before. That's why you should usually use appropriate abstract types (such as size_t in this case) that the compiler will make sure are large enough to be able to iterate over any data structure you will encounter on that target.
Oh sure, for array index stuff like that. For general iteration against another number, say, I would use a fixed (or fast) size appropriate to my data.
It will, that's guaranteed by the standard. If it's not a standards compliant compiler/runtime/platform, you've got bigger problems.
>> If you are working with arrays, always use size_t if not, use [u]int_leastN_t
Sure, on array indices, size_t is appropriate.
I can't see an advantage in using 'least' as standards dictate that 8/16/32/64 bit types are available.
>> in 99% of the cases, except if you are working with network protocols, [u]intN_t is a bad choice
No, it's the best choice in most cases because it makes the code a little more explicit and easier to understand, and it makes developers think about the range of the data you're using.
The sized types are optional, since they are more restrictive than the ordinary types - see section 7.20.1.1 in the standard (http://port70.net/~nsz/c/c11/n1570.html#7.20.1.1). To summarize: intN_t is exactly N bits, with zero padding bits, and 1 2's complement sign bit; uintN_t is exactly N bits, with zero padding bits; the implementation must provide such typedefs for all such types it supports.
So, if a system is 1's complement, you won't get the intN_t types at all. If the system doesn't support a particular bit width, you won't get the types for that width.
I am playing the language lawyer game - there is nothing to call out. Is what I say justified by the standard? Yes, or no? There are no other issues involved ;)
I have no idea if C99 runs on any 1's complement systems. I don't know what end-around carry is. I don't care what it is. I don't care about any of this 1's complement nonsense. But the standard says it exists, and must be taken into account, and, therefore, by the rules of the game, I am obliged to assume these things.
What if your array is never going to be larger than, say, UINT8_MAX, and you need to pack thousands of these "sizes" into an array of structs? Using an 8-byte size_t in that case will inflate the cache usage significantly.
You've written a list of coding practices which assure job security for folks like me, who have to undo these gross portability problems, bugs, and security vulnerabilities these result in.
for (int i = 0; i < 100000; ++i) // most optimal int size used
You're wrong about the arrays, the standard guarantees that zeroing the bytes of an object of integer type will give a 0 value to the integer. Pointers are a different matter though.
The initialisers bit seems downright dangerously wrong. This code:
uint32_t array[10] = {0};
Does not initialise every element to 0 in the way it would seem to. To see the difference contrast the difference you get when you a) remove the initialiser and b) replace the initialiser with {1}.
When you replace the initialiser with 1, it should only initialse the first element to 1. Objects initialized in this way have unmentioned elements set to 0 or NULL (recursing into aggregates, initialising the first named union member). (See C11 standard, 6.7.9.21. 6.7.9.10 gives the rules for selecting the default value. Aside from the extra syntax, I don't think the rules differ substantially from C89...)
C++ lets you do "uint32_t array[10]={}", which is something C should allow as well, really. But it doesn't.
I was going to remark on this. I've personally run into this writing some C that was built on multiple platforms, one of which had c99 and one of which did not (c89 instead).
I think the real issue here is the inconsistency between C standards on details like this. If you can always assume that your code is built with c99 or later, then use all of its features, but in many cases that's not a realistic assumption.
“If there are fewer initializers in a list than there are members of an aggregate, the remainder of the aggregate shall be initialized implicitly the same as objects that have static storage duration.”
It does if your target system properly zeros bss because, by the rules, array[10]
will be placed in bss and bss is zeroed. If you replace the initializer with {1}, you force the compiler to file to a different rule that forces the compiler to explicitly initialize everything to 1.
The problem I've had (and you probably are referring to) is embedded systems that may not properly zero bss. I've also had problems in embedded systems trying to place "array[10] = {0};" into a specific non-bss section (that was really annoying).
In my experience, TFA is good practices generally, but will have problems in corner cases that you run into in deeply embedded systems.
These are wonderful. Concise nuggets of knowledge that you can keep in your head, little warnings of what not to do.
Side note, it's really fun reading this article about modern C given that I'm working on an experimental programming language meant to replace C [1]. Just today I got the Guess Number Game example working, without a dependency on libc [2] (caveat: Linux x86_64 only so far). So, I'm having fun trying to think about how C could be better, reading implementations of various libc functions, and writing my own "standard library".
For example, in my language, there are no `int`, `unsigned char`, `short` types, there are only `i32`, `u8`, `i16`, etc (I shamelessly stole this idea from Rust). This mirrors the article's suggestion to only use int32_t, uint8_t, int16_t, etc. But, since it's a new programming language, we don't have the other types sitting there as a red herring.
While these are not new ideas - Why not use s32 rather than i32 and so on. Makes more sense if you use something like u32 too.
I think it's important when making another language to really question everything you do in it. Why use this operator, or why use this mnemonic for built in types?
Most of this advice is pretty standard, but this stood out:
> You should always use calloc. There is no performance penalty for getting zero'd memory.
But isn't there? This may be true the first time the program requests a new page from the OS, but when your program starts reusing memory there's going to be a performance hit.
Agreed. I think the OP's claim was a little too strong. The conclusion stands though, which is that calloc is a better default. If you discover that zeroing memory for a particular piece of hot code is slowing you down, then by all means use malloc instead. However, in that situation, you'd probably be better off avoiding allocating memory in the hot path. Allocating memory has unpredictable performance and may involve a syscall and so really, the solution to improving the performance of calloc is probably to preallocate before the hot path and make sure the hot path has no memory allocations or frees.
The OP's claim is definitely too strong. Even if you're not in a hot path, code that immediately entirely fills a chunk of memory with data doesn't benefit from changing the allocation from malloc() to calloc().
It still does, because it future proofs you against a change in the future that separates the allocation from completely filling the memory, and saves you from future programmers who don't realize that only half-filling the memory could have catastrophic security or correctness consequences. It also harmonizes the programmer's mental model (which has generally always assumed that newly-allocated memory is "clean" even when on some level we theoretically know better [1]) with the execution model, which is also a good thing.
I'd very much recommend that advice; use it all the time, remove it only when it proves to be a problem, and then, as always when optimizing, bear in mind that you've chosen to juggle lit sticks of dynamite and prepare accordingly.
[1]: I hypothesize that this is because what "really" comes back from a malloc is basically too impossibly complicated to model by a human. Even calling it "random" isn't correct; it's not random, but what it is is very hard to characterize.
My mental model of memory in C is that it is unitialized unless otherwise specified. Certainly in higher level languages (e.g., Java, Pyton, Ruby), we expect memory to be initialized (otherwise garbage collection doesn't work).
Coming to C with a mental model of memory from another language is a recipe for security problems.
Then you use valgrind instead of sweeping bugs under the rug.
The next person may have equally likely (or maybe more likely?) simply forgotten to initialize some particular field and now you've got a problem if the field is supposed to read 0 most of the time and maybe sometimes 1.
When you're using Valgrind and ASan, the "calloc for safety" advice becomes backwards. calloc defeats such instrumentation, because heap memory isn't uninitialized anymore.
> It still does, because it future proofs you against a change in the future that separates the allocation from completely filling the memory, and saves you from future programmers who don't realize that only half-filling the memory could have catastrophic security or correctness consequences.
But how does replacing that radioactive waste (neat analogy) with zeros improve correctness here?
Which is all to say: It doesn't improve correctness at all. You improve correctness by making your interface difficult to use incorrectly, not by limiting the bad effects of incorrect usage. And any time your code accesses memory that is for all intents and purposes undefined (which it still is, even if you overwrite it with zeros that have no semantic intention behind them), your code is behaving incorrectly by definition.
> I hypothesize that this is because what "really" comes back from a malloc is basically too impossibly complicated to model by a human. Even calling it "random" isn't correct; it's not random, but what it is is very hard to characterize.
I think "uninitialized" is the word you are looking for :)
No, it's not. That deceives programmers into thinking "oh, it's just some random garbage", but it's not. It's not random, and it's not merely "garbage". It's an incredible interplay between the exact allocation patterns of your program and user input, and results in that glossy word "uninitialized" including things like "user passwords", "private keys", "shell code" (that is, attacker-supplied arbitrary assembly code), and arbitrary other non-garbage information that you may not appreciate having at at this point in the program that is completely oblivious about what is in that "uninitialized".
Uninitialized isn't "a mess that you really ought to clean up before using", it's radioactive waste.
If you want to tell me that you really do think of "uninitialized" that way every time you use it, I have no grounds to disagree with you personally... but based on the code in the wild, it does not seem that most C programmers do have a correct understanding of the issue.
(People who develop the proper level of concern about handling radioactive waste tend to find ways to stop programming in languages that don't default to handing back nuclear waste on every memory allocation. I suspect some form of evaporative cooling may be a problem here.)
Yes, but it's more subtle. The argument that there's no performance different between malloc/calloc is only valid if you have an MMU that will zero pages. Lots of embedded systems don't have such MMUs (or even an MMU at all); therefore, there is a difference between malloc and calloc.
But desktop processors also don't zero pages, do they?
And by zeroing pages, you prevent lazy memory allocation policies which do not allocate the pages until they're read or written for the first time. This can have a significant impact on memory usage, data locality etc.
So I'm quite skeptical that there is no difference between calloc and malloc. Is there more evidences of this somewhere?
How does it prevent such allocation? Can you not just mark the first read from an area to always return 0, and then allocate as soon as it happens? That also has the benefit of not having to zero anything if you're only writing to it.
Internally, an OS like Linux maps every page in a requested piece of memory to a read-only page of zeros. When you attempt to read from this page, it works fine. When you attempt to write to it, the MMU causes a page-fault and the OS inserts a page where the zero-page is, and then continues your program execution. Thus, it doesn't actually have to allocate any memory for you if you never write to the new pages.
But the OS/MMU doesn't distinguish between a regular write, and a write of zero. Thus, if you manually zero every page you get (And thus write zeros to the read-only zero-page), it'll page-fault back to the OS and the OS will have to allocate a new page so that the write succeeds - Even though if you didn't do the zeroing of memory you would have gotten the same effect of having a bunch of zeros in memory, but without having to allocate any new pages for your process.
Kinda. Reading my comment a second time, I'm not exactly happy with my description, since while it's 'right' is a very simplistic description, ignoring some of the finer points.
Since malloc/calloc are generally used for smaller allocations, the chances you can actually avoid allocating some pages you ask for is pretty slim since a bunch of objects get placed into the same page (And thus writing to any of them will trigger a new page being allocated). There's also no guarantee there isn't headers for malloc to make use of, or similar surrounding your piece of memory, which makes the point moot - Just using malloc triggers writes to at least the first page. So while calloc/malloc are kinda compatible with lazy-allocation, you really shouldn't rely on it being a thing, and it probably won't matter.
It's worth understanding, but the chances it actually comes into play aren't huge. If your program does lots of small malloc's and free's, then it basically won't matter because you won't be asking the kernel for more memory, just reusing what you already have.
If you care about taking advantage of lazy-allocation for one reason or another, the bottom line is probably that you shouldn't be using malloc and calloc for that then. Just use mmap directly and you'll have a better time - more control, you have a nice page-aligned address to start with, and you can be sure the memory is untouched. malloc and calloc are good for general allocations, but using mmap takes out the guesswork when you have something big and specific you need to allocate.
Desktop OSs (not processors afaik) zero pages. To not do so would leak information between privilege contexts.
E.g. imagine /etc/passwd is read and later on the page it occupied is put back in the free pool. Another process comes along and asks for memory. It gets that page and can now read /etc/passwd .
Bit lost here, not really a C programmer, by "zeroing pages" are we talking about zeroing out memory? Because I wrote a little C program that just reads memory until it segfaults, usually I can get about 3gb of stuff out, random little bits of files. I've noticed it works much better on OS X than Linux.
- Last time I ran this, in the memory dump, every second character was the letter "f". Like, in the dump, instead of saying "Words" it would say "Wfofrfdfsf". It changes each time, probably to do with my char onechar variable. (Actually now that I think about it, it should probably only be onechar[1] instead of onechar[2]. It was called onechar because I was trying to printf 1 char at a time)
- Before I changed "p" to &memcpy, it was "p = &p", and it would only read my environ before segfaulting.
- This program, on linux, dies almost immediately. It only spits out a few bytes.
First of all even a simple program like that will have standard libraries mapped into it's address space. Your program is reading from a region of the processes address space that has the binary code for the memcpy() function.
On linux you can dump a processes address space mapping from file "/proc/<pid>/maps".
Or to see an example just run: cat /proc/self/maps
That cat command will dump the address space mapping for that instance of the cat command. If you run it multiple times it will show different address ranges because of a security feature (ASLR, address space layout randomization).
Also you shouldn't use fprintf or printf to print 1 char. printf will try to look for format characters in the "string" you are passing it. It would be better to use fputc(*p, stdout), then you don't even need to call memcpy() or the "onechar" array.
If you set "p" to a block of allocated memory, then your program should segfault shortly after reading past the end of the memory. How many bytes you can read past the end of the allocated block of memory depends on how the memory allocator manages pools of memory and it's own bookkeeping information is stored.
your program doesn't allocate memory, so it is unrelated to memory allocation. You instead are taking the address of a function in memory (memcpy); and then printing out the function. This isn't a leak in any way since your program has to be able to reach memcpy and call it.
So, no, you aren't printing out bits of files. You're printing out functions that your program has the ability to call.
He's starting somewhere in /usr/lib/system/libsystem_platform.dylib (where memcpy lives) and printing the rest of the executable code his program has mapped in.
It doesn't surprise me that some of this contains public keys and certificates. They are probably used by some OSX networking library.
I can grep for -----BEGIN and find a few public keys and certificates, I doubt they're part of the program (but they might be, as I said I don't really know) http://i.imgur.com/IXgkWNu.png
Even on a modern non-embedded system, you aren't going to benefit from the MMU when calling calloc(). Your pages are only going to be zero'd when you first get them from mmap() or brk()/sbrk().
If you're sitting in a loop calling malloc(), filling it with data, processing it, and calling free(), changing malloc() to calloc() is just wasting cycles.
In fact, brk and mmap aren't even really guaranteed to return zero filled memory. It just happens that kernels often zero everything to avoid the headache of tracking which pages contain data you aren't supposed to see.
I guess one can rely on such behavior when writing a platform-specific libc. But in portable code? Yuck.
1. There is nothing inherently wrong in "leaking" a page which used to cache some word-readable file or belonged to a process you'd be able to ptrace anyway.
2. One could imagine this being configurable for embedded or HPC systems which don't care about security, though I'm not aware of people actually doing that.
3. The memory could equally well be overwritten with 0xdeadbeef and you have no control over that.
4. I once heard that Windows may sometimes fault-in nonzero pages if it runs out of pre-zeroed ones, but I couldn't locate this in MS docs so I'm not quite sure.
MMUs don't zero pages. MMUs don't handle data at all. MMUs handle addresses only, translating from the virtual (software) view to the hardware (RAM) view.
I used to use C a lot, but haven't in a while, so question: isn't there a sub-allocator involved also that could affect the condition of the memory allocated without an explicit zeroing ? Is the memory always allocated directly from the OS ? That seems like it could be slow.
Of course there is, the parent is clueless and spreads misinformation.
Heap allocation can easily contain data that have been previously freed in the same process and hence calloc has to perform some extra work to wipe them.
Ehh, that doesn't apply to what he's talking about though. That only works when the pages are actually freed. Since actually asking the kernel for more memory is expensive, most malloc/calloc implementations will hold-on to some pages since it assumes you're going to allocate more. Since it reuses this memory without ever giving it back to the kernel, the kernel won't zero it for us.
> Ehh, that doesn't apply to what he's talking about though.
Of course it does.
> That only works when the pages are actually freed. Since actually asking the kernel for more memory is expensive, most malloc/calloc implementations will hold-on to some pages since it assumes you're going to allocate more. Since it reuses this memory without ever giving it back to the kernel, the kernel won't zero it for us.
It's what the standard system malloc does, possibly in cooperation with the kernel but not necessarily so (why would it need the kernel involved after all?)
Just as OpenBSD's malloc framework can do way more than trivially request memory from the kernel: http://www.openbsd.org/cgi-bin/man.cgi/OpenBSD-current/man5/.... In fact, since OpenBSD 5.6 it'll junk small allocations on freeing (fill them with 0xdf) by default. And as usual that uncovered broken code…
> When we started junking memory by default after free, the postgres ruby gem stopped working.
Would you mind providing some more information on this? I googled it but I couldn't find any documentation on what you describing.
To be clear though, the parent commenter (and I) are talking about the pages that malloc keeps around in the processes memory to be reused (And thus are never released back to the OS). Are you saying FreeBSD/OpenBSD/others have a system to tell the kernel when a user process has pages it plans to zero, and then a system for the kernel to notify the process when/if it does? That would be pretty interesting to see, but I've never heard of that being a thing.
> Would you mind providing some more information on this? I googled it but I couldn't find any documentation on what you describing.
I was talking about https://www.freebsd.org/doc/en/articles/vm-design/prefault-o... but after actually re-checking the docs rather than going from bitrotted memory pre-zeroing only happens for the initial zero-fill fault, so the original calloc is "free" (if there are zeroed pages in the queue) but freeing then re-calloc'ing will need to zero the chunk (unless the allocator decides to go see if the kernel has pre-zeroed space instead I guess). My bad.
> To be clear though, the parent commenter (and I) are talking about the pages that malloc keeps around in the processes memory to be reused (And thus are never released back to the OS).
Yeah so was I, but I was misremembering stuff.
> Are you saying FreeBSD/OpenBSD/others have a system to tell the kernel when a user process has pages it plans to zero, and then a system for the kernel to notify the process when/if it does?
Is all-bits-zero really "safe"? Sure, sometimes it is, but the safest approach is ensure that you've assigned a valid value to any object before you access that value.
If you fail to do that, pre-zeroing might just give you consistent incorrect behavior. It can also inhibit a compiler's ability to detect that you're accessing an uninitialized object. The compiler has no way of knowing whether that zero-initialization was setting it to a meaningful zero value, or just "clearing" it for safety.
It is fairly easy to develop software in such a way that the correctly initialized version of most structures is mostly zero. If you get in the habit of designing your code that way, calloc is clearly superior: you get much faster initialization of your structures (because of SSE or equivalent optimizations for memset) and your code is more maintainable (because extending a structure with a field where a zero value implies unchanged behaviour will just work without changing all the places where the structure is initialized).
If you believe a millisecond is a short amount of time you're probably not writing video games. At 60 fps, you only have 16.67 milliseconds to render each frame, so wasting a couple of milliseconds would have a huge impact on what you'd be able to do for each frame.
Luckily, calloc doesn't actually take milliseconds unless you're making a very large allocation.
I think you're answering to the wrong comment. I'm saying calloc doesn't actually take milliseconds (which would have been absurd), so feel free to use it always.
Calling calloc/free on a 1 MB buffer seems to take about 50 microseconds on my computer (with no optimizations), and I consider that an unusually large allocation.
For an allocation of that size, it's likely that your system's malloc is requesting the allocation directly from the OS's virtual memory system with mmap(2), and freeing it with munmap(2), without ever actually writing a megabyte of zeros. Smaller allocations, however, are likely to require explicit clearing since they will hardly ever require a system call. (It depends on the details of your OS and malloc, of course.)
> Calling calloc/free on a 1 MB buffer seems to take about 50 microseconds on my computer (with no optimizations), and I consider that an unusually large allocation.
Do you actually touch it and write to it? If you don't the allocation may be provisional and not actually happen until you try to write memory, in which case depending on the system you'll either get a COW fault (IIRC linux points the allocation to a zeroed COW page, first time you need to write it'll copy the page and let you write in) or a zero-fill fault (e.g. freebsd doesn't allocate the BSS by default, write access will cause a zero-fill fault and in response the VM will actually allocate the page — which it can do from a queue of pre-zeroed pages)
calloc itself calls memset in user space. That's what separates it from malloc. Is it so hard to believe a modern computer can memset 20 GB/s?
If I run "dd if=/dev/zero of=/dev/null bs=1M count=10000", I get 21.0 GB/s. Reading from /dev/zero is basically performing a memset to dd's buffer in kernel space. Writing to /dev/null is a no-op, so it's a very similar test.
Have you considered reading my comment past the first phrase? If you don't actually touch the page, the system doesn't need to allocate it, let alone zero it. If you don't touch the page, assuming you're benching on linux you're measuring the cost of mapping a preexisting zeroed COW page, not the cost of actually allocating that memory.
calloc is a library function, as opposed to a system call, so it calling memset is no different from the caller doing so. calloc will do the write in "copy on write". Calling memset on every byte in a buffer should qualify as "touching it".
> Swift (it will eventually cover all use cases with 3.0)
There's nothing in 3.0 that will make Swift any better at safe systems programming than it is today (last I checked, anyway), and Chris Lattner is on the record as saying that anything akin to Rust-style static analysis is explicitly out-of-scope for 3.0.
So, effectively Ada and Rust. Which effectively means Rust.
I was really hoping that the list would be longer than that. :(
> Ada
> Spark
> Rust
Ada (Spark is effectively Ada) and Rust
> D
Nope. Garbage collected doesn't work for systems programming.
> Swift
Not sure I buy this. When I see someone chewing on an OS in Swift I might believe it.
> FreePascal
> Delphi
Dunno about FreePascal, but my anecdotal experiences with Delphi programs have left me wishing that their authors had written them in C. It is, of course, possible that the programming teams were bad, but it sure seems like Delphi isn't any better than C in that respect.
Ah, nice to see such an august set of popular operating systems with GC ... oh, wait ...
All of those operating systems got their ass kicked by a 1970's reject of an operating system rewritten by a college student who didn't know what he was doing.
And, in fact, Symbolics got its ass kicked so badly in terms of performance that they had to port everything to a non-garbage collected OS to even begin to compete.
If you want to try to refute me, choose Erlang. It actually still exists and is in use. It also garbage collects per process and runs down on bare metal levels so its ecosystem could be called an OS.
However, Erlang succeeds in spite of its GC, not because of it. Per process GC combined with "shoot and restart" mentality means that when the GC goes pathological the whole process gets shot and the memory gets released all at once and reallocated.
Of course, if I'm being truly cynical, I would ask why we should do GC at all in Erlang and not just shoot the process regularly.
Fair enough - but in that case would you agree that writing C and not doing regular profiling is stupid? I see far too many projects with that combination.
embedded systems are resource and cost-constrained devices. If you can use a slightly less powerful device and save 2c per device... multiply that by 10 million production run makes a really significant difference.
efficiency really matters and costs real money.
Of course if you're using dynamic allocation in an embedded system then you have other issues anyway...
And the probability of you shipping 10 million units of anything is vanishing small. 10,000 is more typical--at which point the savings is $200. Not having to think saves you more money.
which is quite unreadable to say the least. To address this issue I even wrote a small library (https://github.com/cppformat/cppformat) that allows you to write something like this instead:
print("Local number: {}\n\n", someIntPtr);
It is written in C++ rather than C, because the latter doesn't have facilities (at least overloading is necessary) to implement such library.
Thanks for the suggestions. There has been some work to make formatting more customizable (https://github.com/cppformat/cppformat/issues/235). This will enable extensions like the ones you suggest while keeping the library itself lean. Generally useful extensions like thousand separators can be added to the core library of course. They are actually on my TODO list.
I recently wrote a similar article ("C Programming Substance Guide")[1] which made the rounds here[2]. I've updated and improved it quite a bit since then too. It seems like we agree on the major points but it might be a good second opinion.
One point about char is that a pointer to a character type (character types are char, unsigned char, signed char - see 6.2.5.15) may alias non-char objects in a standard-compliant fashion (see 6.3.2.3.7). But uint8_t and int8_t may not necessarily be suitable replacements, because they are defined to be integer types (see B.19 - character types are a subset of integer types), and in any event may not even exist (7.20.1.1.3).
B.19 - actually I think 7.20.1 is a better demonstration that these are integer types (and not, since it's never mentioned, character types) - http://port70.net/~nsz/c/c11/n1570.html#7.20.1
The GCC developers talked about making uint8_t something other than unsigned char and figured they would probably get lynched if they did that (because uint8_t aliasing rules would be different). While the standard agrees with you, practice is somewhat different.
I think it isn't clear what the standard means here. If uint8_t exists it must be a typedef to unsigned char (edit: unless to3m is right that there could be a distinct 8-bit implementation-defined integer type, which I didn't think was allowed but maybe it is), so the question is if a typedef to a character type is a character type. My reading is that it isn't, but I've heard enough good C programmers assume that it is that either I am wrong or hopefully compilers will always allow uint8_t to work in this case. It certainly makes more sense to use uint8_t than unsigned char for that usage, but ideally maybe there would be a standard byte_alias_t or such.
http://port70.net/~nsz/c/c11/n1570.html#6.2.5p4 leaves open the option for extended signed integer types, and the following paragraph provides for an unsigned integer type for each signed type, extended signed types included.
So it would be perfectly possible, if not actually very reasonable - not that this is stopping anybody at the moment - for an implementation to provide __int8, a signed 8-bit integer type that fulfils all the requirements of uint8_t, but is not a character type, and use that as uint8_t. Then, conceivably, it could fail to support the use of uint8_t pointers to alias other objects, something supported by the more usual situation of using unsigned char for uint8_t.
I'm not sure that anybody would do this, but they could. I'm rather surprised gcc doesn't do it, come to think of it, just to teach its users a lesson. Maybe I should file a bug report?
Thanks, good to know the full situation. I'll try to remember not to use uint8_t that way in the future, just in case.
I looked briefly previously and somehow got the sense that extended types could only be larger than standard types, but this time I found that 6.3.1.1 explicitly mentions same width extended types.
Personally, I've been happy with GCC's decisions on such issues.
But it doesn't have to be - the 8-bit integer types could map to 8-bit implementation-defined integer types instead. Such types would not necessarily to have the same aliasing behaviour as their equivalently sized and equivalently signed character types.
Typedef doesn't create a new type, it merely creates a new name for a type. I don't know what you mean by implementation defined integer type. If the implementation is conforming, then any integer type that isn't a basic C integer type, is really just a typedef for one of the basic C integer types ( char, short int, int, long, ... ).
In C, with CHAR_BIT defined as 8, unsigned char is the only type that satisfies the requirements of uint8_t.
( It is also possible that it is defined as char, if implementation defines char to have the same range, representation, and behavior as unsigned char. In this case, char effectively becomes unsigned char, but is still a distinct type. The types are compatible (can alias).)
Now why is unsigned char the only possibility. Type uint8_t is defined to have two'2 complement, have exactly 8 bits, no padding, and be unsigned. So we need an unsigned integer type. The unsigned type that follows unsigned char in rank is unsigned short char. But this type is defined to have ranges at least from 0 to 65535. Since CHAR_BIT is 8, unsigned short int cannot be used, because it has to have more than 8 bits. As there is no type in rank before unsigned char, it remains the only possibility.
No. C allows the implementation(A C compiler) to create an internal integer type that is none of the standard integer types and alias that to (u)intX_t , using an implementation defined mechanism other than typedef.
Or it could use a typedef .e.g typedef __special_u8 uint8_t; where the particular compiler internally knows about __special_u8 and treats it differently from unsigned char.
As someone who has a been trying to get into modern C from higher level languages and looking for more modern sources than K&R, THANK YOU. (Even if you just posted the link here and have no relation to original author). (Of course, this post is not the only place to learn about modern C, but still, a very useful one).
There's 21st Century C by O'Reilly, which I enjoyed. I'm not qualified to say whether you should read it, but there seems to be a lot less backlash towards it than, say, Learn C The Hard Way. My main criticism is that the author wastes half the book on good development practice which is all well and good, but not if you're just there for the C. For developers not familiar with make, testing and so on, it's great.
As for the actual C content, frankly this was where I went to when I was wondering how to write better (modern) C, so I don't really know any better. The writing is almost certainly opinionated and as above, could be a lot more comprehensive.
(oops edit window left open for a while, see sibling!)
You might also find 21st Century C published by O'Reilly useful. It has a few chapters introduction to a "modern" development environment. Then it covers the main points of newer C99 language and syntax. I found it quite interesting as an beginner/intermediate programmer. It has a few things I dislike, particularly its chapter on autotools (not exactly a modern thing).
> LTO fixes the "source analysis and optimization across compilation units problem" by annotating object files with intermediate representation so source-aware optimizations can be carried out across compilation units at link time (this slows down the linking process noticeably, but make -j helps).
make -j doesn't help linking speed, unless multiple binaries are begin built.
gcc has `-flto=n` to use n cores for the link-time optimization. clang doesn't like that syntax though, but I think it does parallel LTO automatically.
AFAIK the bulk of link time optimization happens in the linker. The parallelizable compiler invocations just store extra information in the object files.
The linker invocation is a single make job, which is the unit of make -j parallelism.
I wrote an assignment in C for Operating Systems in my second year of CS. I did it mostly for the challenge and ended up with a distinction grade.
It was very challenging! I was learning how to do threading in C for what was one of the most complex assignments in my degree. However, I don't regret a moment of it. Even with a mark that was lower than my average programming assignment marks, I learned more in that assignment than I did in anything else while at uni.
Being forced to RTFM because it wasn't abstracted away for you was exhilarating once I understood what was going on. And man, it was fast! Much faster than others who had programmed in Java.
I wish I had more reasons to write C, because I love it.
>If you find yourself typing char [..] into new code, you're doing it wrong.
>The only acceptable use of char in 2016 is if a pre-existing API requires char (e.g. strncat, printf'ing "%s", ...) or if you're initializing a read-only string (e.g. const char hello = "hello";) because the C type of string literals ("hello") is char .
Heh, basically you need to use 'char ' for strings. I would far prefer if strings were unsigned so that I could use 'uint8_t ', but if you try it you'll get tons of conversion warnings and your code will look weird.
The other widespread use of char pointers is when you need to perform pointer arithmetic over a buffer (without any dereferencing) in steps of 1 (as sizeof(char) == 1).
Void pointers do not allow pointer arithmetic (GCC allows it as an extension).
If you want to access a buffer as raw bytes (which may or may not be octets), use `unsigned char`. Plain `char` may be either signed or unsigned, depending on the implementation.
He covers that in the next section "Signedness" and more thoroughly in "System-Dependent Types" where he recommands (properly) using uintptr_t and intptr_t:
The correct type for pointer math is uintptr_t defined in <stddef.h>.
Correct me if I'm wrong, but in my understanding, the primary purpose of uintptr_t is to type-pun a pointer into a compatible integer, inspect/modify the bits of the integer and then cast it back to a pointer of the same type. I don't feel like it's the right type for simple pointer arithmetic.
You're right, using uintptr_t for pointer arithmetic is a bad idea.
I've worked on systems (Cray vector machines) where converting a pointer to an integer type, performing arithmetic on the integer, and converting back to a pointer could yield meaningless results. (Hardware addresses were 64-bit addresses of 64-bit words; byte pointers were implemented by storing a 3-bit offset in the high-order bits of the pointer.)
Pointer arithmetic works. It causes undefined behavior if it goes outside the address range of the enclosing object, but in those cases converting to uintptr_t and performing arithmetic will give you garbage results anyway.
The code is still wrong for another reason: if (uintptr_t)ptrOld < (uintptr_t)ptrNew, then the subtraction is calculated modulo some power of 2 yielding a large number. The result of converting this to a signed integral type like ptrdiff_t is implementation defined.
In practice it works only if sizeof(ptrdiff_t) == sizeof(uintptr_t) AND the system uses two's complement for signed integers. Neither property is guaranteed by the standard.
I think the correct way to find the byte offset is this:
My head's been mostly in Java and Python the last three years, and my old work's C++ practices were woefully out-of-date. I really appreciate the work put into the CppCoreGuideline documents ... it's real complex, though.
I was dreading getting back to C++ with the Rule-of-3 and now the Rule-of-5 -- what a pain! Fortunately I ran into the Rule-of-Zero as described here: http://accu.org/index.php/journals/1896. The CppCoreGuidelines have a more restrictive Rule-of-Zero but allude to the link's form . . . the only way I'll get back to C++ is if I can avoid move-constructor and move-assignment for the majority of my work ... and the link provides the answer!
> The rules are designed to be supported by an analysis tool. Violations of rules will be flagged with references (or links) to the relevant rule. We do not expect you to memorize all the rules before trying to write code.
But I agree, a short guide for C++ with the most important stuff in it would be nice.
Usually I find something very objectionable in these C guides, but I like this one, and it's hit some of my favourite/most hated bugbears. In recent years it's nearly driven me mad seeing people use the old types either directly or re-implementing their own equivalent of stdint.h, complete with all sorts of platform switching and other nonsense.
We've got these facilities, they're not exactly bleeding edge (the clue's in the name C99), let's use them!
The document recommends "intptr_t" over "long" for system-dependent types, and I've considered doing this but I've been put off by the printing type of %"PRIdPTR" instead of %ld. I wonder if on modern platforms it'd be ok to use intptr_t and the printing type %zd (meant for ssize_t).
One trick I've been using a lot recently that saves me from forgetting to check error codes is the judicious use of __attribute__((__warn_unused_result__)) on functions that return an error code. It prevents you from being able to ignore the return value without doing something with it; even just assigning it to a variable and never checking the value in the variable isn't enough to silence the warning.
For me, this is the most important bookmark of the year. Dropping back into C after using pre C11 it is tailor made for me. Covers all the areas I needed: Correct non-K&R datatypes, new formatting options, best practices on arrays. Awesome.
I refuse to believe that the wicked Buffer Overflow has completely stymied humanity. Use array/string/buffer libraries. They're very small and very easy to test. Pass them around instead of a "char *" and a "size_t".
Now, if you want to talk about signed overflow, bitwise operations on signed types, wacky undefined behavior, or memory management, that stuff is pretty hard to deal with in C. But 99% of the time when people are shaking their fists at "shitty ol' C ruining the Internet for everyone", they mean buffer overflows, and it just shouldn't be an issue these days.
The biggest failure of this article is the multiple admonishments "never do <blah> (except for this, this, and that other case over there" - all advice of this form is worthless when your supposed thesis is making code readable and consistent.
Consistency means avoiding special cases. Variable length arrays are too prone to special cases - IMO, just don't use them. malloc vs calloc is too prone to special cases, that rule is worthless. The rule "never use memset" is full of special cases, also worthless. (And in fact, for your larger automatic initializations, gcc just generates code to call memset anyway.) I would say "always use memset when you need zeroing - then you'll never be surprised." The fewer special cases you have to keep track of, the better.
"Never cast arguments to printf - oh but for %p, always cast your pointers to (void *)" again, worthless advice with an unacknowledged special case.
The way to write code that always works right is to always use consistent methodology and avoid any mechanism that is rife with special cases.
The advice "always return 'true' or 'false' instead of numeric result codes" is worthless. Functions should always return explicit numeric result codes with explicitly spelled out meanings. enum is nice for this purpose because the code name is preserved in debug builds, saves you time when working inside a debugger. But most of all, it's a waste of time for a function to return a generic false/fail value which then requires you to take some other action to discover the specific failure reason. POSIX errno and Windows GetLastError() are both examples of stupid API design. (Of course, even this rule has an exception - if the function cannot fail, then just make it return void.)
When you do that, reading your code is like reading Dune for the first time. I have to keep jumping back and forth to the reference to figure out what's going on.
Because it wastes memory and decreases performance, that's why.
If there's a section of code in your function that may or may not be executed, depending on a conditional, then allocating the memory for variables used within that code block will be a waste if you end up not executing it. Now, you could argue that that block should be turned into a separate function, but now you're doing another function call (unless it gets optimized out) which wastes performance, and you're making the code more cluttered if the code block is relatively small and not really worth turning into a separate function.
Are not all the local variables allocated on the stack in the function prologue anyway? no matter where in the function they are defined. A typical function starts with subtracting the stackpointer to make space for the local variables if i'm not mistaken.
C90 allows declarations (almost) anywhere; you just have to use an explicit binding construct, which in C consists of a statement block. That is, instead of this:
{
int x = 3;
do_something();
int y = 4;
}
you write this:
{
int x = 3;
do_something();
{
int y = 4;
}
}
As a Lisp programmer, I prefer clear binding constructs which put the variables in one place. The (define ...) in Scheme that you can put anywhere makes me cringe; you have to walk the code to expand that into proper lambdas before you can analyze it. Other Lisp programmers don't agree with me. The CLISP implementation of Common Lisp, whose core internals are written in C and which historically predates C90, let alone C99, uses a "declarations anywhere" style on top of ANSI C. This is achieved with a text-to-text preprocessor called "varbrace" which basically adds the above braces in all the right places.
You get better locality with explicit binding blocks, because their scope can end sooner:
{
foo *ptr = whatever();
/* this is a little capsule of code */
/* scope of ptr ends */
}
{
foo *ptr = whatever_else(); /* different ptr */
/* this is another little capsule of code */
}
These tighter scopes are much more tidy than the somewhat scatter-brained "from here to the end of the function" approach.
I can look at that capsule and know that that ptr is meaningful just within those seven lines of code. After the closing brace and before the opening one, any occurrence of the name "ptr" is irrelevant; any such occurrence refers to something else.
This article focuses on clang and gcc, but note that MSVC's printf functions do not support %zu (or %zd) for size_t. MSVC uses %Iu. There is no standard "PRIuSIZE" macro for size_t, but Mozilla defines a polyfill for Firefox:
It's not exactly 100% complete in the literal sense, but everything covering the language spec and usage is there. I highly suggest working through it, as the book itself is incredible for going from beginner to expert in the short span it does.
I learned and used C at school. It appreciate what I learned but using it in the right way, using the latest functionality available, considering machine portability along with safe coding guidelines of code would have been better.
alloca isn't so bad if you have stack guards, which Microsoft has used extensively for a long time [1]. Historically Unix has lagged behind here, however, and so alloca is a security risk on Unix, which leads to the "alloca is evil" advice.
There is also the fact that you permanently lose one register in your function (ebp) if you use alloca(). This is important on 32-bit x86 but is rarely discussed.
There is essentially no implementation difference between VLAs and alloca().
No, I think alloca was considered evil before Elias wrote "Smashing The Stack". I bet if you go track down the 'alloca' man page from FreeBSD 2.0, you'll find that it's use was "discouraged" even back then.
There is no way to recover from a failed VLA allocation.
Also, VLAs are optional in C11, and may not be supported in otherwise-near-C99 compilers for small processors. (Of course, alloca() is not standard at all.)
> There is no way to recover from a failed VLA allocation.
There's no way to recover from any stack allocation.
#define TOO_BIG 1000000
This:
{
double big_array[TOO_BIG]; // not a VLA
}
is no safer than this:
{
size_t size = TOO_BIG; // size is not a constant
double big_vla[size];
}
If you use VLAs (variable length arrays) in a manner that permits arbitrary unchecked sizes, then yes, they're dangerous. If you carefully check the size so that a VLA is no bigger than a corresponding fixed-size array would have been, they don't introduce any new risk.
(I'm not sure why VLAs were made optional in C11, at least for hosted implementations. Support for them was mandatory in C99, and I'm skeptical that they impose any significant burden on compilers.)
It's a bit less evil because it's a declaration instead of pretending to be a function, but a lot of the criticism works for either.
Note that the overarching theme of C99 was making C a better language for numerical code (ie competing with Fortran), thus the introduction of VLAs (as well as _Complex, tgmath.h, restrict, and possibly other things I forgot ).
Having worked with a large code base, maintained over a period of time (25+) by multiple persons has learned me respect and that even old code have a value in virtue of being running without problems over time. A piece of software that is running without problems has a value in itself that have to be taken into account when considering the cost of maintenance.
Respect earlier developers and try to minimise _unnecessary_ changes that break version control history. Did earlier developers use names you don't like? Formatting you don't like? InconsistentCamelCasing or other_naming_paradigms you don't like? Let it be -- at least until you do a refactoring where _you_ claim responsibility.
It may not be a good point to uncritically and retroactively apply the good points from the original article in an old code base.
Would love to see a discussion of build systems for modern code. Trying to decide between autoconf/cmake/scons/ninja/gyp for new projects is like flipping a coin... They all have their advantages and disadvantages and there is no clear winner.
After too many problems with "unfriendly C", I finally gave in and now compile everything with -fno-strict-overflow (the Linux kernel does the same thing). I don't care about architectures that don't use twos-complement.
If you think the third party library is bug free I have news for you. The only difference is that you get to debug someone else's code instead, and you might not have the source.
My biggest tip for writing maintainable C for modern compilers is to avoid passing bare primitives between functions. Don't just typedef - wrap things in (single-element) structs that describe the contents. The compilers these days understand that they can boil off the wrapper when generating code.
Is it better to fully qualify structs and unions or to hide their names by typedefs? I frequently see people hide all struct types because it's visual clutter and extra typing.
In other words, given some struct a, should I do:
typedef struct a a_t;
I realize this is a bit of a style question, but I'm curious.
clang has a -Weverything flag that, unlike "-Wall", will enable all compiler warnings. This includes possibly contradictory warnings. :) For new C code, -Weverything is a good place to start.
- Add security measures provided by your compiler to your builds: -fstack-protector-strong on GCC. -fsanitize=safestack on Clang if you have a version that's new enough.
- Build the program with AddressSanitizer and run it with typical data. Run it through a fuzzer afterwards.
Even if you use int in that case, the compiler is free to pick the optimal underlying data type (byte, probably), provided that it has proven that the value will never go out of that data type's limits and the program doesn't do anything fancy like casting the int to void * and reading it byte by byte.
int8_fast_t meets this description (because the range for this particular loop only spans 10, but you should use the int16_ or int32_ versions as appropriate). int8_fast_t will likely use native-register-sized integers (probably 32 or 64 bit on many architectures).
This is IMO very close to the inference you describe, while still providing some stipulations about what you need.
int8_fast_t and its siblings are defined in <stdint.h>, described in this article (though it would be nice if the article mentioned them).
"How to C in 2016" Don't. C has outlived its usefulness. It's time to move on to more expressive languages. The fact that buffer overflows are still a thing in 2016 should be an embarrassment to the industry.
Personal opinion, people are generally way too concerned about buffer overflows in C. Meaning if you have issues with buffer overflows the root cause is you are seriously writing C the wrong way circa 2016.
When I hear buffer overflows I think.
Programmer trying to be smarter than he is.
Depreciated unsafe string functions.
Unnecessary pointer arithmetic.
No unit tests
Failing to use static code analysis tools
> Personal opinion, people are generally way too concerned about buffer overflows in C
My only concern about buffer overflows in C is that memory error continue to happen in software that I depend upon in my daily life.
Chrome, Firefox, Linux, OpenSSL, all these things suffer memory errors that compromise my security. Anyone doing security work in C in 2016 is in my opinion committing malpractice and putting user's at risk because their ego's can't take not fiddling bits by hand.
Perhaps, but Rust and Chicken Scheme are probably applicable in many of these cases (embedded microcontroller). Few people in the field give either of these any appreciation, despite being viable. I will agree that "Don't use C ever" is maybe a poor attitude to take as there are definitely projects that need C as a hard requirement; however, I think your snark is somewhat unjustified as we do have languages that compile to C or compile with no runtime that can beat C in some of these areas that traditionally have been dominated by C.
My experience is that Python handles this wonderfully on the reading side, because code will always be consistently formatted.
But on the writing side, it's harder (impossible?) to have tools that format an entire file, because without braces, various things could be at different levels of indentation. You have to indent as you go, and changing the level of indentation can be a little painful.
In curly brace languages, you can sometimes work very fast by crapping things onto the page and then formatting the entire file.
I don't know which I absolutely prefer, but there is a tradeoff to whitespace based blocks. That said, I think C & Java suffer for the combination of optional braces and lack of a standard format tool.
Python won't run without specified indentation, and leaves no freedom for brace positioning. On the C side, we've accumulated a variety of formatting styles [1].
Both of you make the mistake of confusing formatting with indentation. Indentation is just a tiny part of formatting. There are tools to automatically format Python (but they do obviously require that the indentation is somewhat correct before you run them) and they work fairly well. :)
That's a bit unfair. The standard is from '99, but the implementation happened differently than c11 (which already had many bits implemented in clang before it was released).
In practice a lot of c99 got into gcc after 3.0 (2001) and it still took a bit of time for any distribution to include it as stable. If I remember correctly, lots of distributions held on to 2.9x for as long as possible. I'd say realistic c99 support has ~10 years now, rather than 17.
Why? I write a whole lot of pure C, on microcontrollers (usually with an RTOS) and really cut-down embedded Linux systems that don't have a C++ library.
Why wouldn't you use classes if you have access to C++ though? A lot of the time I write C, I spend a lot of time and write a lot of code that would be far easier with classes and/or access to the STL.
I see this a lot shouldn't O3 produce better optimizations?
Is that people have large codebases? OR is it that O3 does some CPU/memory tradeoffs? OR something else?
with out RTFA -O3 is where a lot of the weird compiler deciding it's smarter than you optimizations come in, and most of the time it is smarter than you and sometimes it introduces weird behavior.
ALSO it generally produces a larger binary which means less of it fits in the cache. I know a while back it was often suggested to try O3 and OS because often OS ended up being faster.
The article seems to be implying that you should use both -O2 and -Os, but these are mutually exclusive options - if you supply both, only the last one specified has effect.
Never use int? Use uint8_t instead of char for strings? These are not best practices, this is a lot of terrible advice that will make many c people laugh at you.
C Programmers bragging about writing good C in this thread is just funny. A large majority of new code does NOT need to be written in C. There's nothing extra-ordinary about knowing how to write C. Learn to write HDL if you want to brag.
Since I like coding in C, I am mildly offended by the notion that you should not write code in it, but I really enjoyed that all of these things were put together in one place.
Learning C is intimidating to me because every time I see a post written by a long-time C programmer, there are 10 more long-time C programmers explaining how the post is incorrect. Most C books on Amazon have reviews wherein credible-sounding programmers criticize the mistakes of the author.
If publishers and programmers with popular blogs can't get a consensus on what constitutes correct C, how could I ever hope to produce it?
This is a valid observation. Setting aside all the technical flaws with the C language, the fact that most supposed experts can't agree on best practices for even the most trivial usage details does not inspire (for me, at least) much confidence that it would be a good language choice.
>If you find yourself typing char or int or short or long or unsigned into new code, you're doing it wrong.
int is a type that's at least 16 bits large. I can't see the problem with using it. If I specify that I want a type that's 32 bits large, my code won't run on 16 bit systems.
Why should I assume? Because the author doesn't believe in writing code that works on "20 year old systems"?
Except not really. You should use an int_fast16_t, which is the fastest integer type that has at least 16 bits. There's no reason to force a 16-bit integer which may carry performance implications when it's not necessary.
Here is an IRC log of a discussion[1] I had about this topic in the #musl IRC channel. The folks in this channel are the developers behind musl-libc[2]. The conclusion drawn is that some poor design decisions were made that make the fast types not worth using.
Thanks for that, that's pretty interesting to read about. Admittedly, I'm not sure I've ever actually used a int_fastN_t type in my code - I just kinda assumed they served their purpose. Reading that log I can see the obvious problems with those types which I never realized since I never really looked into them that much. int_least16_t seems like a perfectly fine substitute though, I don't really know why int_leastN_t and int_fastN_t would be different in the first place.
That said, there's really no way to correctly pick the "perfect" integer size for every platform - There's just too many variables. IMO I much prefer to just use 'int' when I know the values are going to be small - smaller then a 16-bit value - and I don't care about the actual width. But obviously it's a hot topic for some - As long as you follow the standard in regards to it's size then it doesn't really matter what integer value you choose to use. I prefer saving the fixed-width integer types for situations when I know I need such a size.
I'm not conviced. Maybe it really is that the wider type is faster, and int_fast16_t should be 64 bits. I just tried it out on a 64 bit system. The program is just a for loop.
If the loop variable is a int_fast16_t the loop increment step compiles to
The 64 bit width version is shorter, and so conceivably faster. Performance seems the same though, but that's probably because this program isn't a proper benchmark.
Modern processors are pretty crazy beasts. He right in saying it's impossible to say for sure without some hard numbers or profiling.
Comparatively, while one is three instructions and the other is one, that single instruction is still doing everything the other three are doing. The only reason they aren't both one instruction seems to be that `addw` must not being a thing (Why, I don't know). Since the I/O done by both will be nearly identical, and the addition's should be very comparable in speed, it's not crazy to say they should perform basically the same. If you compiled with -O2 I bet you'd see almost identical code, since it would probably remove the memory access.
I agree with you - As far as I'm concerned 'int' and 'int_least16_t'/'int_fast16_t' are basically the same thing, or at least close enough to the same thing that there's no reason not to use an int for standard simple cases. If I'm declaring a simple loop variable that's I know is going to have a small lower-bound, I just use an int. The chances are pretty good that 'int' maps to a good size for the platform, regardless of what platform it is and what it actually provides. 'int_least16_t' probably does too, but it's much more likely I have 'int' around then 'int_least16_t', and between the two 'int' is much more self-explanatory as to why I choose to use it.
I've read that if one uses int64_t with a compiler that targets a 32 bit system, doing various operations on the int64_t variable will be much slower because more machine code instructions need to be generated.
The point stands: Why not use int if one is looking to target any architecture that C will run on?
If 16 bits is enough, use an explicitly 16-bit integer. If you want something architecture-sized (e.g. because it's dealing with pointers/addresses), use things like `size_t` which makes the intent much more clear, and which are better defined.
In most places, if you're using an `int` for non-address-y stuff you're just wasting space on wider archs, or you're overflowing on less wide ones. The only exception off the top of my head is "I have an array of bytes, but want to bulk process more efficiently than u8 would allow". Things like `size_t` (or explicit SIMD types!) are still probably a better choice there, if only to simplify the rule to "just don't use int".
> I've read that if one uses int64_t with a compiler that targets a 32 bit system, doing various operations on the int64_t variable will be much slower because more machine code instructions need to be generated.
Yep, but presumably you picked int64_t for a reason. If you want a 32-bit variable, use int32_t.
It would be an extremely rare platform on which int16_t is noticeably slower than int. The very minor platform-specific performance improvement you might get from using int rather than int16_t is very rarely worth the risk of introducing an overflow on the few platforms where int is 16 bits[1], and not noticing it because int is longer than that on the platform you develop on.
[1] remember that a signed integer overflow is undefined behaviour which should be regarded as equivalent to a security flaw - the compiler is permitted to send all data your program has access to to the mafia if your program contains a signed integer overflow anywhere, and some compilers will
"int" doesn't reveal the intention of the programmer - they might just have used it because it's "int", not because it changes size based on what system you are targeting.
If they used 'int_fast16_t' from <stdint.h>, you have at least an idea that they considered that it might only be 16 bits, and that it can be a larger type on systems where operations on 64-bit types are faster.
`int` is at least 16 bits, `long` is at least 32 bits, `long long` is at least 64 bits, and `size_t` is big enough to hold the size of the largest object you can allocate - and in most cases that is all you need to know.
The exact width types can be useful if you're marshalling data to and from an exact-width format, like a network protocol or file format.
That does not contradict what I said. When choosing the type, all you need is a type that has at least the range you require, and the basic types give guaranteed minimum ranges.
The only advantage to choosing an exactly-N-bit type over an at-least-N-bit type is possibly saving some space in very large allocations. In most cases this won't be important - and when it is, the mininmum-width types from stdint.h (eg int_least32_t) are more apropos than the exact-width types.
The disadvantage is the impedance mis-match with all the existing libraries of code using the basic types.
IMHO, "int" is perfectly acceptable if you just want to use the platform's preferred word size (and are fine with the possibility of it being only 16bit). Also, "char" is perfectly acceptable too if you're dealing with ASCII. This is valid for both the signed and unsigned variants.
The general rule should be using the language types as long as you're not assuming something about their size, only then use "stdint.h" because it makes your assumption more explicit.
Now, nobody uses "short" or "long" to mean "something smaller or shorter than the platform's int", so these are usually something to avoid.
Is "the platform's preferred word size" meaningful? On x86_64, pointers are 8 bytes, but it's still faster to use smaller registers if you don't need the extra bits. For this system, at least, the correct type is the smallest one that is big enough to hold the maximum value your integer could be.
Bullshit, on x64 all 32 or 64 bit base arithmetic operations have the exact same cost, except the 64 bit version is one byte longer because of a REX prefix. This is why 32 bits is the natural int size for x64.
Well, at least division isn't. On reasonable recent Intel Broadwell, signed 64-bit division (IDIV) takes 59 cycles, while signed 32-bit division takes just 9 cycles.
Granted, divisions are rare, but that's still nearly an order of magnitude difference in that corner case.
32-bit and 64-bit addition, multiplication, comparison, logical operations etc. are in almost all cases just as fast.
You'll still need more storage on the stack for 64-bit variables. So more cache misses (and perhaps TLB misses and page faults) when corresponding boundary is crossed.
Advice forgotten:
- you should always avoid library whose API documentation are unaccessible or poorly written;
- unix man page is a great format but windows has good help format too (synopsis, signature, use case, error handling, meaning of constants, caveats),
- always check every function you are using from time to time ... human memory is not perfect, even open/calloc ....;
- don't copy paste example code blindly from man or stackoverflow ... (I remember windows C++ doc having horrible example with bugs);
Have a reference to the norm easily accessible and don't use options of the compiler which impact you don't fully know.
Most developer are poorly understanding multithreading and it can be tricky to make a portable library that has this property. Don't hesitate to stipulate in your documentation that you did not cared about it, people can then use languages with GIL (ruby, python ...) to safely overcome this issue.
Modern language are social, take advantage of sociability (nanomsg, swift, extension, #include "whatever_interpreter.h" ....).
Programming in C with dynamic structures is guaranteed to be like a blind man walking in a mine field: you totally have the right either to make C extension to modern language OR include stuff like python.h and use python data structure managed by the garbage collector from C.
Old style pre allocated arrays may not be elegant, but that's how critical system avoid a lot of problems.
DO NOT USE MAGIC NUMBERS...
USE GOTO for resource cleaning on error and make the label of the goto indentend at EXACTLY the same level the goto was written, especially when you have a resources that are coupled (ex: a file and a socket). Coupling should be avoid, but sometimes it cannot. Remember Djikstra is not that smart, he was unable to be a fully pledged physicist that can deal with the real world.
Avoiding the use of global variable is nice, but some are required, don't fall for the academic bias of hiding them to make your code look like it has none.
Copy pasting function is not always stupid (especially if you use only 1 clearly understandable function out of a huge library).
Be critical: some POSIX abstraction like threads seems to be broken by complexity: KISS.
Modern C compiler are less and less deterministic, learn about C undefined behaviour and avoid clang or GNU specific optimisations and CHECK your code gives the same results with at least 2 compilers.
Don't follow advices blindly: there are some C capos out there that dream an ideal world. Code with your past errors in mind and rely on your nightmarish past experiences to guide you to avoid traps. Feeling miserable and incompetent is some time good.
Resources are limited : ALWAYS check the results of resource exhaustion in C instead of relying on systems and always have error messages preallocated and preferably don't i18n them. Don't trust the OS to do stuff for you (closing files, SIGSEV/SEGFAULT ...).
KISS KISS KISS complexity is your enemy in C much more than in other languages. Modern C/C++ want you to build complex architecture on libraries that are often bloated (glibc).
Egyptians are nice looking.
Once you finished coding you have done only the easy part in C:
- dependency management is hard in C;
- have a simple deterministic build chain;
- you should write your man pages too;
- you should aim at writing portable code;
- static code analysis, maybe fuzzing, getting rid of all warnings is the last important part.
Coding in C is just about 15% of the time required to make great software. You will probably end up debugging it much more than you write it: write code for maintainability as your first priority and do not under any circumstances think that you can handle multiple priorities (memory use, speed, ...).
Above all : just write maintainable code, please. This rule is the golden rule of C and it has not changed.
I like this better than 95% of what's written in this thread (and better than the original post). My favorite part is "don't follow advices [sic] blindly". Fresh college graduates have a peculiar way of damaging code when they follow best practices without the experiences to back them.
I would add one thing.
Know your target. Batch programs, daemons, microcontroller programs, cross-platform programs are all different and you're going to end up with different styles.
Well that is the difference between writing code and bringing a solution that fits a problem.
I fully agree.
But IT has become shove sellers to other shove seller hoping for a new gold rush to happen.
Computers have still not proven to be an actual improvement to the economy. Positive impact cannot be measured. I guess I know why.
It looks like the lost bet of Diesel to empower the poorest to have access to the economy and fight equally with capitalists able to buy steam machine thanks to the banks: IP laws makes it impossible for a new economy to rise.
Those who do not learn from the errors of the past are doomed to reproduce them.
Wow, I wish I knew all of this when I started a big project last year... It's kinda hard to change it NOW. And with that I mean it would be endless rewriting of the entire thing.
To be pedantic about it, string literals in C are of type `char[]`. In C++ they are of type `const char []`. In either case, altering the contents of the string literal is undefined behavior.
The "why" for the second one is: that's the model of the C language. Programs are translated in units. The semantic analysis ends when a translation unit is analyzed. All that is left is to resolve the external references when multiple units are linked; nothing is re-translated. This is very clearly stated in a few sentences in that section of the standard which describes the translation phases one by one.
Link-time optimization is a language extension which alters the concept of a translation unit. In some ways, multiple translation units become one unit (such as: they inline each other's code, or code in one is altered based on the behavior of code in the other). But in other ways they aren't one translation unit (e.g. their file scope static names are still mutually invisible.)
One reason I don't use C99 is that many of its features aren't C++ compatible. Most of C90 is. With just a little effort, a C90 program compiles as C++ and behaves the same way, and no significant features of C90 have to be wholly excluded from use.
Portability to C++ increases the exposure of the program to more compilers, with better safety and better diagnostics.
In some cases, I can use constructs that are specially targetted to C or C++ via macros. Instead of a C cast, I can have convert(type, val) macro which uses the C++ static_cast when compiled as C++, a coerce(type, val) which uses reinterpret_cast, or strip_qual(type, val) which uses const_cast. I get the diagnostics under C++, but the code still targets C. I'm also prevented from converting from void * without a cast, and there are other benefits like detecting duplicate external object names.
This way of programming in C and C++ was dubbed "Clean C" by the Harbison and Steele reference manual.
C90 is a cleaner, smaller language than C99 without extraneous "garbage features" like variable-length arrays that allocate arbitrary amounts of storage on the stack without any way to indicate failure to do so.
The useful library extensions in C99 like snprintf or newer math functions can be used from C90; you can detect them in a configure script like any library.
The overall main reason is that I simply don't have faith in the ISO C process. The last good standard that was put out was C90. After that, I don't agree with the direction it has been taking. Not all of it is bad, but we should go back to C90 and start a better fork, with better people.
git checkout -b c90-fork c90
git cherry-pick ... what is good from c99-branch
It absolutely does. For instance, it can enforce strict aliasing across translation unit boundaries, breaking code which depends on that separation to protect it.
Here is something in C99 that seems like a great idea and improvement and is often lauded, yet turns out to be ... meh: designated initializers. Sure, these address problems like the wrong function pointer being assigned when the positional order is wrong: the programmer version of buttoning up your shirt wrong. These things are even Torvalds-approved and used in the Linux kernel (whose compilation will complain, if you mix declarations and statements, that they are not a C90 feature!!!)
But here is the thing: all the initialization comes from the initializer, not from the struct type, and the only default value you get is the good old 0 (0 integer, 0.0 floating point, null pointer). So, if you want any other default, designated initializers are basically useless.
The designers of a data type want to provide initialization mechanisms with defaults which they control from the site of the definition of the type. We can do that in C90 with macros. Macros ensure that everything is initialized. If we add a member that must be initialized with a non-default value everywhere, we add a parameter to the macro, and the compiler will catch all the places where the macro is called with the old number of parameters.
You provide all the variants in one place which do all the forms of defaulting which you need, and which stick the parameters into the correct place, and then just use them everywhere else.
Even if you have designed initializers, it's still good idea to wrap initialization that way; they don't replace its benefits.
Suppose I add a frobosity member, and there is no obvious way to default it. All the places where the structure is instantiated have to be analyzed case by case to choose the appropriate value. Designated initializers (alone) won't help; we add the member to the struct and nothing happens. It defaults to zero everywhere it is not explicitly mentioned with a designated or positional initializer. With the macros we just add it as a parameter:
But that brings very little to the table when we've gathered all this stuff in one place (which alone reduces mistakes) and we are initializing every member to a nonzero value.
Languages that have keyword initializers are great when the class supplies defaults for all the ones that the constructor doesn't specify.
(make-instance 'foo :frobosity 42) ;; 50 other slots defaulted for us
If you want to change a default, you don't want to go to all the places where the type is instantiated and update the initializer!!!
"The first rule of C is don't write C if you can avoid it."
Eek! Not what I was expecting. In the video game world there is a not so small contingent of "write everything in C unless you have a really damn good reason not to".
Great post though. Now go forth into the world and write C, my friends!
The best we can do is write simple, understandable code with as few indirections and as little undocumented magic as possible.
Applies very well to all programming.