The article is good, but I disagree with this part:
"To a large extent, the answer is: C is that way because reality is that way. C is a low-level language, which means that the way things are done in C is very similar to the way they're done by the computer itself. If you were writing machine code, you'd find that most of the discussion above was just as true as it is in C: strings really are very difficult to handle efficiently (and high-level languages only hide that difficulty, they don't remove it), pointer dereferences are always prone to that kind of problem if you don't either code defensively or avoid making any mistakes, and so on."
Not really, and not quite. A lot of the complexity of C when it comes to handling strings and pointers is the result of not having garbage collection. But it does have malloc()/free(), and that's not really any more fundamental or closer to the machine than a garbage collector. A simple garbage collector isn't really any more complicated than a simple manual heap implementation.
And C's computational model is a vast simplification of "reality." "Reality" is a machine that can do 3-4 instructions and 1-2 loads per clock cycle, with a hierarchical memory structure that has several levels with different sizes and performance characteristics, that can handle requests out of order and uses elaborate protocols for cache coherence on multiprocessor machines. C presents a simple "big array of bytes" memory model that totally abstracts all that complexity. And machines go to great lengths to maintain that fiction.
And C's computational model is a vast simplification of "reality." "Reality" is a machine that can do 3-4 instructions and 1-2 loads per clock cycle, with a hierarchical memory structure that has several levels with different sizes and performance characteristics, that can handle requests out of order and uses elaborate protocols for cache coherence on multiprocessor machines. C presents a simple "big array of bytes" memory model that totally abstracts all that complexity. And machines go to great lengths to maintain that fiction.
When C was invented, it was very close to reality. It isn't anymore, as you point out. (But as another commenter said, assembly language isn't that close to reality either)
Unfortunately hardware guys and software guys didn't really coordinate, and we just hacked shit up on either side of the instruction set interface.
It is kind of ironic that we write stuff mostly in "serial" languages. But the compiler turns into a parallelizable data flow graph. Then that's compiled back to a serial ISA. And then the CPU goes and tries to execute it in parallel.
It would be a lot nicer if we wrote stuff in parallel/dataflow languages, and the CPU could understand that! Some dilletantism with FPGAs made me realize how mismatched CPUs are for a lot of modern problems.
It's kind of like the idea that Java throws away its type information when compiling to byte code, and then the JIT reconstructs the types at runtime. We have these encrusted representations that cause so much complexity in the stack. C is (relatively) great, but it's also unfortunately one of these things.
People have serious problems learning how to write dataflow languages. Itanium exposed its complexity, and was a commercial failure. Similar complaints have been made about the Playstations that made heavy use of explicit parallelism.
It's interesting that you mention Java; some ARM systems have 'Jazelle', a system for directly executing Java bytecode. I don't know how widely used it is.
Good point about Itanium, but that doesn't mean people have problems writing dataflow languages. It's more about the compilers, because very little of the code we run is written in assembly language. Any new architecture will have to gain adoption by having a C compiler.
And there were academic dataflow computers as well.
I think D 2.0 makes everything immutable by default (single assignment), so it is close to a dataflow language. And that's no surprise, because Walter Bright designed it specifically to make writing a compiler as natural as possible, rather than having to make complicated inferences about serial code.
My point was that we can never make the jump because the stack of hacks we currently have works "good enough" and it is backward compatible. Generally technologies evolve by accretion, and I'm hard pressed to find an example of radical simplification.
I guess power concerns would be the only hope for something simpler.
It's kind if funny that Backus wondered, "Can programming ever be liberated from the von Neumann model?" This question also applies to hardware too.
That's precisely how the JVM behave, which aggravates a lot of people when writing code using generics. In C# I can do the following:
public T getInstance<T>() {
return this.Instances[typeof(T)];
}
Or, even better:
public T createInstance<T>() {
return new T();
}
Meanwhile Java requires this:
public <T> T getInstance(Class<T> cls) {
return this.instances[cls];
}
The type information above (the generic parameter T) is stripped after compilation, so the Class parameter is needed on the method to get a solid reference to the type we need to work with at run-time. This gets worse when we want to do the second example.
public <T> T createInstance(Class<T> cls) {
return cls.newInstance();
}
It gets even more fun when you throw in non-default constructors. Non-reified generics are a giant pain in the butt.
There are cases when reified generics are a giant pain in the butt and non-reified ones are the right solution. See Scala or F# and .NET interoperability.
Once you're in byte code, you don't have your static types anymore -- that is, types that are independent of control flow. You have types on individual instructions.
So basically you lost some type information before the JIT even sees it. Now, it doesn't actually matter for speed, as Mike Pall (LuaJIT author) would say. His point is that dynamically typed languages can be just as fast as statically typed languages, because you have more information at runtime.
I guess he meant JavaScript. JavaScript is dynamic and JS JITs try to reconstruct types at runtime to improve performance. Java is statically typed and, with a minor exception of generic type arguments, types are accessible both at compiletime and runtime.
C doesn't really have malloc or free: those are part of the standard library. You can happily code in C without malloc/free, or you can add a library that provides a garbage collected malloc. What C's type system is providing really is bare metal (although as you say, to the abstraction provided by the machine, not to physical reality) to a much more fundamental extent than even a heap allocator, and certainly claiming you could just swap in garbage collection and then have a string type is totally missing the point.
> C presents a simple "big array of bytes" memory model that totally abstracts all that complexity.
I don't understand what you mean by this. Machine code itself abstracts away the underlying hierarchical memory structure. Sure, some machine language might have instructions to manipulate the cache, but those are easily invoked from C, using either inline assembly or __builtin functions.
I stumbled across this paper describing a language that aims to do that. For some reason I was not that excited by the matrix multiplication use case, mainly because that is not the kind of application I'm interested in. I'd like to see examples of stuff a lot of cycles are burnt on in "web" data centers.
But I would like to see more work along these lines -- pointers appreciated.
Agreed; a language like Forth is much closer to the machine, tho' if you really want "the machine" then assembly is the way to go, and writing with a good macro assembler is surprisingly high level. I still pine for the days of Devpac.
> "To a large extent, the answer is: C is that way because reality is that way.
I thought C was so widespread that it eventually started to affect how some computer architectures were designed? If so it seems a bit disingenuous to say that it is only dealing with the reality that it was given.
The original machine that influenced C's model of computers was the PDP-11 (http://en.wikipedia.org/wiki/PDP-11). It had a mov instruction instead of load/store. It had no dedicated IO instructions. It could be treated as a sort of generic random access machine (http://en.wikipedia.org/wiki/Random-access_machine) and that is what C did and still does. So there was a reality that C simply modeled, and it was copied (with all sorts of modifications) many times.
Shhhhhh! If you let them know how fun it is then everyone will want to be C programmers :-)
I got to use my crufty C knowledge to useful effect when I discovered that there is no standard system reset on Cortex M chips. That lead me to trying to call "reset_handler" (basically the function that kicks off things at startup) which I couldn't do inside an ISR because lo and behold there is "magic" in isrs, they are done in "Handler" mode versus "Thread" mode and jumping to thread mode code is just wrong apparently. C hackery to the rescue, hey the stack frame is standard, make a pointer to the first variable in the function walk backwards on the stack to return address, change it to be the function that should run next, and return. Voila, system reset.
The whole time I am going "Really? I have to look under your covers just to make you do something anyone might want to do?" As a respondent to one of my questions put it "ARM is a mixture of clever ideas intermixed with a healthy dose of WTF?"
It's technically dependent on external hardware in your processor subsystem, but it should work if your implementer has half a brain (or, at least cares enough to read the integration manual). If it doesn't work, please flog your implementer publicly so that I can know to avoid them in the future...
Incidentally, even that tiny code snippet is uses a C extension (the __dsb intrinsic) which is either a great example of how C can be wielded to great power (I can generate raw instructions!) or how C is terribly handicapped (I need a special compiler extension or all my system code is horribly broken!). All depends on point of view, I guess...
I too had already been through that particular part of the ARM7-M manual, after coming up through the data sheet to the technical reference guide, to the Cortex M architecture manual, and yes to the base ARM7-M architecture specification.
There is "should" and "does" :-) I had followed the exact same sequence and discovered on my system it didn't cause a system reset. I escalated it a bit (in this case ST Micro) and the caveats came out, "Well if you system is correctly designed, if the actual reset pin isn't connected to something that is interfering, if the core is in a state where it can actually take a reset, ..." This being unlike pretty much every processor I've worked on, from PDP-8's, 11's, 10's, VAXen, IBM 360/370, Sun 1, 2, 3, 4, Motorola 6800, 68K, 68K, Z80, 8080, 806, Pentium *, the list goes on. Most actually have a reset instruction, usually privileged, to force a system reset. But the difference is that the reset sequence was guaranteed to force a system reset if it executed, all of those previous processors had a company that made the processor as opposed to simply licensed the processor to a third party. Since it is completely reproducible on my system I've got an action item to create a small test case for errata analysis.
Yes, that is the code they suggest, with assurances that it will work in pretty much any case. Except when it might not. It was a bit stunning for me, I still marvel at the notion. Like an add statement that will 99.999% of the time add its operands together[1] unless it doesn't.
[1] No there isn't such a thing in ARM it is just the way I hear "It will almost always reset the system."
Fair enough. The Cortex-M's that I've dealt with have generally been homed inside some larger chip, and so the "reset pin" notion was a bit more vague, and in such an environment ARM's stance makes a bit more sense as you really want the reset to reset the "subsystem" (including whatever other random hardware you glued to yourself today).
It's also true that ARM has heard some of these complaints, which is why there were steps toward standardizing some things--like, the SYSTICK interface--in v7-M. It's really been a step up since the ARM7TDMI, and I hope that it will continue with v8-M or whatever the next revision ends up being.
Personally, when I'm dealing with a discrete chip and I need to reset it I've found that the most reliable methods that don't set off the hack-o-meter too badly are to wait for or directly invoke watchdog hardware... but yes, that's essentially always device specific.
That is a very good point! If you are building some SoC and the CPU is just some small part of it, "system reset" can in fact be a very vague notion. I hadn't really been thinking that way and in that context the ARM position does make sense. As another person pointed out offline there is the watchdog timer, if you wanted you could set it and halt. Then it would kick you with a reset.
For anyone else curious on the __dsb() call, from ARM's docs:
The Data Synchronization Barrier (DSB) acts as a special kind of memory barrier. The DSB operation will complete when all explicit memory accesses before this instruction have completed. No instructions after the DSB will be executed until the DSB instruction has completed, that is, when all of the pending accesses have completed.
Exactly! Fortunately, the article didn't point out one of C's great advantages is that it's so lightweight it can fit into and run on all kinds of interesting systems where other languages are infeasible economically. Cortices rule!
> Except there are other safer languages to choose from, like Pascal and Basic compilers.
I'm curious, why do you say "safer"? These are languages for microcontroller programming. The things you do there are bound to be "unsafe", like peeking and poking memory for memory mapped i/o and disabling/enabling interrupts.
Unless, of course, all the possible things (i/o, timers, interrupts, etc) are wrapped in some kind of "safe" api so you essentially don't have access to low level facilities any more. The Arduino programming environment is kinda like this but you can still cause bad things to happen and if all else fails, hang the device with an infinite loop.
> I'm curious, why do you say "safer"? These are languages for microcontroller programming. The things you do there are bound to be "unsafe", like peeking and poking memory for memory mapped i/o and disabling/enabling interrupts.
There are things which are inherently safe by design in languages like Pascal -- e.g. in a naively-written program you can't write past the boundary of the UART RX buffer and thrash some other array in your program -- but your observation is fair.
> There are things which are inherently safe by design in languages like Pascal -- e.g. in a naively-written program you can't write past the boundary of the UART RX buffer and thrash some other array in your program -- but your observation is fair.
Which, of course, you can accomplish in C by wrapping your UART buffer handling code in functions that do bounds checking. And I assume that a micro controller Pascal or Basic dialect will have some kind of peek/poke from/to arbitrary memory addresses that can be misused just as a pointer access in C.
Safety is hard to quantify and measure and calling one language safer than another sounds more like an opinion than a factual claim, especially in this context.
The "out of the box" defaults of Pascals are something you have to explicitly develop, test and maintain in your C. Which means that even during the maintenance cycle, long after it's originally implemented, it can still break in your C.
> The "out of the box" defaults of Pascals are something you have to explicitly develop, test and maintain in your C. Which means that even during the maintenance cycle, long after it's originally implemented, it can still break in your C.
Let's not confuse languages and libraries/apis here (that may or may not be shipped with the compiler). There are libraries for C and related languages (e.g. Arduino) that actually do give you a "safe" way to deal with the hardware on microcontrollers.
It's still easy to shoot yourself in the foot in C with a bad pointer access (esp. because there are no helpers to work with strings) but I don't really see how a Pascal dialect with peek/poke would be inherently better.
I do agree that buying one of these Pascal/Basic software products that come with a fancy standard library that does safe access to the hw may help writing safer software but I don't see how that is an inherent quality of the language.
Don't you think that pascal's standard safeties will prevent mistakes ?
1) array bounds checking (which can be used for safe hardware access instead of peek/poke)
2) pascal-style strings (with actual support for them, both in the language and in the standard library) (meaning a missing \0 doesn't erase the entire memory)
> 1) array bounds checking (which can be used for safe hardware access instead of peek/poke)
How? Peek/poke at the wrong address will generate an error in any case, and any MMU-less platform worth using will have the memory-mapped peripherals into a lower region, where it doesn't get thrashed by writing past the end of a buffer. I have seen bugs occurring because of data located past a buffer getting thrashed, but I don't remember seeing one in the context of hardware access.
> 2) pascal-style strings (with actual support for them, both in the language and in the standard library) (meaning a missing \0 doesn't erase the entire memory)
That shouldn't happen in C, i.e. there are library routines you should use so that it doesn't happen. Not that string processing isn't a pain :-).
> 3) type-safe pointers
No complaints here :-)
> 4) no pointer arithmetic
That's not always good, but it does decrease the likelihood of certain types of bugs, so yep!
It happens every day and it will as long as there's C. Zero-terminated strings are part of the standard library and almost infinite number of other libraries. You can't pretend it doesn't exist as the most common convention.
Additionally all those safety mechanisms can be turned off for performance, but only on the exact spot where it really matters, instead of being scattered all around the code.
var
EGAVGAScreen : Array[0..41360] of Byte absolute $A000:0000;
Et voila : bounds-checked memory mapped hardware access.
(note: the very well known "Crt" unit uses a memory mapped video buffer like this. So if you programmed a turbo pascal program, chances were good it was using this trick for screen output. Advantage : the speed is unbeatable)
> Which, of course, you can accomplish in C by wrapping your UART buffer handling code in functions that do bounds checking.
Of course, and you pay other hefty prices in Pascal or Basic for getting this sort of stuff "out of the box". Pascal isn't my favourite systems programming language, either :-).
There are zero features in C that Pascal and Basic dialects for system programming don't support. The only difference is that you need to turn safety off explicitly.
I was doing low level coding in Turbo Pascal before I even cared about C.
In terms of performance? None, assuming a well-written compiler. Depending on dialect you run into other issues though, such as the array size being part of the type signature, which is definitely not nice. The lack of adequate tooling and portability is another issue. Not strictly a problem of the language itself, but a problem you end up facing if you write low-level code in Pascal.
Don't get me wrong, I wrote low-level code in Pascal, too. It's nice and I probably wouldn't grumble too much if I had to do it again, but there's a bunch of stuff that comes in the same package with using something other than a language widely considered adequate for systems programming.
I do conceed that even though bashing C is a pasttime of mine, I would use it if it is the best option for a given project, depending on the set of factors to be considered for the said project.
In real life projects, there should be no place for tooling religion anyway,
In some cases, we want freedom... and that's something that has been neglected a bit too much with new programming languages IMHO. Safety/security feels like it's the latest of the dumbing-down "let's treat programmers like idiots" fad.
"In fact, C may be part of the problem: in C it's easy to make byte order look like an issue. If instead you try to write byte-order-dependent code in a type-safe language, you'll find it's very hard. In a sense, byte order only bites you when you cheat."
All of those examples are possible in safe systems programming languages like Modula-2 and Ada, with the difference that only the tiny spot where it matters is marked explicitly unsafe.
Whereas in C there is no way to distinguish between unsafe and safe code.
2. Direct memory mapped hardware access. For example: device drivers, kernel, embedded systems, microcontrollers.
Of course doing this usually means your code isn't portable, sometimes not even to other versions of the same compiler. I often wish C had a defined order for bitfields, for example.
One of the things I do to help debug & test embedded code is to factor out code which isn't dependent on the hardware platform (eg communication protocols) into a library, and compile it for, and write test programs to test them on, the host. So even embedded code is often best written to be as portable as possible.
Who Simon Tatham? I've been a fan ever since I modified the PuTTY code to work in an embedded system. I did not realize he was at ARM these days, gives me hope.
I too hated C once to the point I actually kicked a PC over on its side out of anger. I never understood pointers, structures and string manipulation. I genuinely hated it and wanted it to go away. I deleted all the code I'd written then went for a walk.
Then suddenly... click.
Suddenly, in my mind appeared an abstract model for it. It was complete and possible for one person to understand.
Since then I understand what I'm doing rather than know how to drive the compiler.
That has never happened for any other language I've written and that includes Z80+6502+X86 ASM, C++, C#, VB3-6, PDS7, ksh, Python, Perl and PHP to name a few. Assembly is close but not abstract enough.
Enlightenment is probably the best word to describe it.
I never had the click about pointers, I guess the way they were introduced to me just made them feel intuitive.
I'll never forget the time when programming changed from mechanically entering statements, to controlling the flow of data through a program though (also in C). Definitely a brain-expanding experience :)
The points are great and this is generally a good primer for someone who wants to understand the C mindset.
The bit at the end is a bit off, though. It feels like the author is saying "yeah, C is weird and crufty for historical reasons and some people just use it because they're backward like that". Yeah, I write kernel drivers, but I also just plain like using C, for the same reason that I like driving a manual transmission and usually disable the safety features on stuff: C tries really hard to not get in your way.
I enjoy programming in Ruby and mostly enjoy programming in Javascript. But there are times when I think "this is an unnecessary copy...this is inefficient...I wouldn't have to do this if I were writing in C".
(There are also times where I think "this one line of code would be over 100 lines of C", but we won't get into that right now...).
Perhaps I can state a point simpler than another poster.
"this is an unnecessary copy...this is inefficient...I wouldn't have to do this if I were writing in C"
You should then ask yourself: does the inefficiency matter? Will it make the program noticeably slower? If not, then you can safely ignore the lack of machine efficiency and embrace the gain in programmer efficiency.
A valid question...sometimes it does. Sometimes it doesn't. Sometimes you think it won't, and then it does and you have to do some herculean things later to make it scale.
Also, I think it's a fallacy to say that higher level languages necessarily mean more programmer efficiency. When I'm doing network programming in C, here's what it looks like:
* Define a struct whose field layout matches the wire format
* Cast the incoming buffer to that struct
* Done
By contrast, Ruby requires me to marshal data and painstakingly extract each field...all because it tries to abstract away the fact that memory is a flat array of bytes.
Yeah, much of the time I'll be more productive in a higher-level language. But there are problem domains where that is not the case.
I fell in love with C, 25 years ago, but then I moved into enterprise applications using higher level languages. How is the job market place for C programmers? I would imagine that younger programmers don't go that route.
If you work on (real time) embedded systems, it is pretty much a C and C++ world, at least for the lower layer, and middleware part. You can pretty easily work in Automotive, Aeronautics, Robotics, Defense, etc...
Most of this impression is usually caused by the combination of the following things:
* fast startup
* programs in C usually do much less with more code than high level languages
Once the project gets really big and complex, C starts to get slower and harder to optimize than some higher level languages (e.g. dynamic dispatch tends to be slower in C than in C++ or Java).
>>programs in C usually do much less with more code than high level languages
Ok. So?
>>Once the project gets really big and complex, C starts to get slower and harder to optimize than some higher level languages
True. The speed boost isn't automatic. It's up to the programmer to write fast code.
I have nothing against high-level languages and garbage collection. They certainly have a place. But if you use high-level languages exclusively, you'll never (or rarely) have the special joy of seeing your program run at the full speed of the hardware.
There bigger a project is, the more stuff it does. If a language does this stuff slow(aka ruby), then it will not run faster no matter what buzz words you invoke.
It can't be as fast as in VM-based languages, because the code (typically) can't self-optimize / modify itself according to the usage patterns to inline dynamic calls. This is the stuff that VM can do, because it has much more information. This is one of the reasons a general sorting method like qsort is so slow in C compared to general sorting method in Java (Collections.sort). Sure, you can specialize manually or do some macros, but such manual approach gets hairy pretty quickly for something more complex than a simple sorting method.
std::sort in C++ is known to be faster than qsort() in standard C, because C++ templates allow the comparison function to be inlined. I find it doubtful, though, that Java has a generic sort method that regularly outperforms qsort.
I measured Collections.sort and it was ver close to C++ std:sort in performance, while C qsort was about 10x slower. It outperforms qsort for the same reason C++ does it - the call to comparison function is inlined.
There is a theoretical benefit in being able to optimize at runtime. But, in practice, these advantages are virtually always too small to outperform code compiled statically.
This is not a theoretical benefit - it is a very practical benefit, especially for object-oriented code with lots of indirection, virtual calls and dynamically loaded code. The reason it is not visible in microbenchmarks is because microbenchmarks are small and usually avoid indirection as much as possible, and even if there exist some, the code is all in one file so a static compiler can figure out all the call targets properly.
There is nothing about pointers that makes them unusually hard to keep track of. If you are prone to goofing on pointer arithmetic, then you're almost certainly going to run into problems even if you never use pointers, because arithmetic is hard to avoid in programming.
I don't use C for everything. When I do write C, I try to keep memory as simple as possible. I avoid allocating dynamic memory wherever possible, and when I do use malloc, I try to keep the logic around the pointers as simple as possible. This is beneficial all around, because malloc and free are not especially fast, and if you use them for everything, you won't see all that much benefit over a high-level language.
It's like building your own house or making your own clothes. There's no guarantee that you'll do a better job and get a better result than if you went the easy way.
First of all, in other similar languages like Modula-2 and Ada, there is no need to use pointer arithetic as much as C developers do.
Even in C, most developers that use it, are doing micro-optimizations without ever testing their assumptions in terms of performance.
Finally, as a single developer it is easy to keep track of most C traps, the problem is when a project has more than a few developers, with different skill levels. Then the party starts.
There are probably lots of people writing bad code in C, I'll grant you that point. But that isn't a property that's built into the C language itself. I think you are attacking a straw man. My original point was just that it's possible to write very fast programs in C.
>It is, as it makes very easy to blow your leg off.
I have never seen a bug-proof programming language. If a language lets you do anything at all, then it will let you write bugs. So I don't see it as a weakness that C allows you to write bugs. If you have easy access to memory, then you can easily corrupt memory.
I experienced the inner monologue you describe for several years. It haunted my dreams when working with Ruby. But one day a rhetorical thought suddenly dawned on me that has since changed my perspective quite dramatically...
"...if I'm being incessantly bothered by what I perceive as the nagging inefficiencies of some programming language's implementation, maybe I'm not thinking about or relating to programming languages (in the large) in the way I should be..."
If all programming languages are merely tools to communicate instructions to a computer, then why is human language not merely viewed by everyone as a means to an end as well? Surely, most would agree that language is more than simply an ends to a mean, and that language does far more than simply transmit information between parties. If efficiency, lack of ambiguity, etc., were the paramount goals of human language, surely formal logic, or perhaps even a programming language for interpersonal communication would be more fitting than natural language!
So why do we insist on communicating with each other with what is often such an abstract and ambiguity filled medium?
tldr; it is trivial, even natural for an literate individual with the proper context to understand concepts in language that seemingly transcend the words themselves. These notions would be (and are) exceedingly difficult to formalize, and any formal expression of these ideas would cause exponential growth of the output.
Ever try explaining a joke to someone who didn't "get it"? It takes a lot more "space" to convey the same sentiment than to someone who "got it".
So what has this crazy rant have to do with anything? Well, aside from revealing I am a complete nerd, it speaks to my approach to software engineering today.
We have to let go of the machine if we ever want to really move the state of the art forward.
There are an infinitude of expressible ideas, but lacking the proper medium to abstract the expression of these ideas formally (like natural language and our brains do, well, naturally) we will never get a chance to find out what we don't know!
"We're doing it wrong" is not exactly the sentiment I'm trying to express, but it's sorta that. Maybe.
Hope this comment made any sense. :) It's 4 AM after all.
I'm afraid I couldn't hang on for the ride...perhaps I'm like the one who doesn't get the joke and needs the much lengthier explanation!
It sounds like you're saying that programming languages are constrained on two ends: on one end by being too tied to the underlying microarchitecture, and on the other end by being interpreted by our minds which think about programming in terms of language features rather than Platonic ideals.
Assuming I've come at least close to understanding your point, I guess what you're saying is that by thinking too closely about what I'm trying to do at a low level, I'm negatively affecting my ability to write idiomatic Ruby code to do useful things?
This is probably true; it's one of the curses of being a kernel developer. I think you want people who care deeply about how bits are laid out in memory being the ones who are writing your operating system.
I wasn't trying to insult you or anything, my comment was just my (extremely) sleepy attempt to express an idea that I've had shuffling around in my mind for a while now.
At times there's nothing I want to do more than solder components onto a circuit board and make a radio or something. It's really, really gratifying to make something work that's so "magical" (from a certain point of view, radio is pretty magical to me) and completely understand how everything works from start (bare materials) to finish (a working radio!).
I guess what I was trying to say was that if I would ever want to make a CPU comparable to, say, what Intel produces today, I'd have to give up my soldering gun and any notion of manufacturing the CPU with any discrete process (like soldering individual transistors) and instead adopt an entirely new approach - like maybe electroplating - in any event, it's one that allows me to make incredibly powerful things at the expense of being able to "use my hands".
Experts are always going to need to know (and I mean really KNOW) the underlying fundamentals of their field regardless of how "high level" their work becomes - see theoretical physics, et al. With that in mind, I think people who care deeply about how bits are laid out in memory are exactly the same people who will always be at the forefront of computer science and software engineering - even if 99% of their practical output in life is at a level much higher than bits. :)
> It is trivial for an literate individual with the proper context to understand concepts in language that seemingly transcend the words themselves. These notions and are exceedingly difficult to formalize, and any formal expression of these ideas would cause exponential growth of the output.
My interpretation: "I find formal logic's ineptitude at resolving ambiguity disappointing. But humans resolve ambiguity without breaking a sweat. Is it possible to generalize logic to encompass ambiguity?" I think the answer you seek lies in Probability Theory.
> a word to the wise is sufficient. [1]
How does a brain quickly derive intended meaning from an ambiguous lexicon like the English Language? Realize that an infinitude of nuanced interpretations are equally possible, but not equally probable. Suppose Alice says to Bob "The sea/c". If the topic was Marine Biology, Bob will expect (assigns a high probability to the hypothesis) that Alice meant the ocean. If the topic was Typography, then Bob will expect that Alice meant the glyph. Similarly, computers which deal with ambiguity (e.g. speech interpreters, facial recognition) assign higher probabilities to some interpretations than others.
> If efficiency, lack of ambiguity, etc., were the paramount goals of human language, surely formal logic, or perhaps even a programming language for interpersonal communication would be more fitting than natural language!
Computers can communicate practically instantly, but humans are bottle-necked by how quickly we can move our lips. Therefore, I would expect spoken languages to be optimized toward articulating a little as possible. One technique is overloaded vocabulary. I think humanity prizes the ability to compress information down to a single word. This unfortunately comes at the cost of computer-level clarity. But I mean, "one-liners" do make for great movies, don't they.
> Ever try explaining a joke to someone who didn't "get it"? It takes a lot more "space" to convey the same sentiment than to someone who "got it".
As far as I know, humor is one of those things that scientists don't fully understand yet. But some have a rough idea. I'm convinced music and humor are related in the sense that they set up an ambiguous "expectation/motif/theme/ context", and then playing on that expectation.
Music is defined by tension and resolution: tension being ambiguity and resolution being validation. Google an analysis of Beethoven's 5th, and it will say that the opening intervals create tension because the key is uncertain to the listener. Google Music Theory, and you'll learn that the Chromatic Scale is built around the tension between the dominant and the tonic. Occasionally, rather than deliver the punchline, a composer will leave his or her listeners hanging on a suspended-chord or a leading-tone. To experience this cliffhanger, listen to a track with a bass drop, but turn it off right before the actual drop.
Similarly, humor resolves around setting up an ambiguous expectation, and resolving it. The proposed neural mechanisms vary. But jokes always seem to involve a set up, and punchline which is unexpected, yet satisfying. And I think this is because the context is resolved. pg shared a related idea in one of his essays about ideas: "That's what a metaphor is: a function applied to an argument of the wrong type." [2]
With the above in mind, I believe it's possible for today's computers to predict whether a human will find something funny or not. But unless they'll be taught which types of topics humans considered relevant (i.e. deixis), computers would find themselves at a significant disadvantage.
> We have to let go of the machine if we ever want to really move the state of the art forward.
Probability theory is already used in AI. That's really cool, but I don't think the art as a whole needs to move forward. Though both turing complete, speech and programming languages are optimized very differently. I've already pointed out the different constraints. But also notice that while programming primarily aims towards conveying instructions, human speech encompasses a wider spectrum of goals. Meticulous clarity will have a higher impact on instructions like "automate this task" than declarations like "broccoli tastes weird".
> There are an infinitude of expressible ideas, but lacking the proper medium to abstract the expression of these ideas formally (like natural language and our brains do, well, naturally) we will never get a chance to find out what we don't know!
I'm not sure exactly what this is getting at. Incidentally you may enjoy learning about Solomonoff Induction. [3]
Recently I had to write a program for an embedded Linux router which ran on a MIPS architecture and had a 2MB flash. I only had about 40kb of space to fit the application on. I was able to get a binary that was compiling to more than 1.5mb down to 20kb through using a combination of gcc tricks like separating data and code sections, eliminating unused sections, statically linking some libraries and dynamically linking against others. It once again gave me immense appreciation for having a language and toolchain that can give you this power for those 1% of problems your career might depend on.
For amusement, the relevant section of the Makefile I ended up with:
I'm unsure how many other languages/toolchains give you that sort of flexibility down to the linking level. Also it's self contained and doesn't require some kind of "virtual machine" or interpreter to run it.
> I'm unsure how many other languages/toolchains give you that sort of flexibility down to the linking level. Also it's self contained and doesn't require some kind of "virtual machine" or interpreter to run it.
Almost every language that has an ahead of time compiler to native code.
That makes me wonder why "eliminating unused sections" isn't a default, as it feels like the compiler is doing a lot of unnecessary work if it's generating 1.5M of output that actually has only 20k of useful stuff in it.
It's only that big because of static linking. Typically you're never statically linking your applications, but in this case it's necessary if you want to use libraries but only want the space-overhead from the code and data you use from the libraries.
It reminds me of a quote I read in the book, Expert C Programming: Deep C Secrets (an excellent book on C btw), that read: "Static linking is now functionally obsolete, and should be allowed to rest in peace." I kinda chuckled a bit when I saw.
> But those aren't the reasons why most C code is in C. Mostly, C is important simply because lots of code was written in it before safer languages gained momentum...
I disagree. Certainly in the FLOSS community, I don't think this is true.
C is a lowest common denominator. No higher level language has "won". So if you want the functionality in a library you write to be available to the majority, you will need to make it available (ie. provide bindings for) a number of high level languages. The easiest way to do this is to provide a C-level API. This works well because the higher level languages are all implemented in C. This isn't because C is more popular, but because it is a low level language. The easiest way to provide a C-level API is to write the code in C. So: library writers often write implementations in C.
There are three alternatives:
1) Independently implement each individual useful piece of functionality in every high level language. This does happen, but more general implementations tend to move quicker, since they have more users (because they support multiple high level languages) and thus more contributors. The number of contributors might dwindle because of the requirement to code in C, but I don't think this has happened to a significant enough extent yet.
2) Implement libraries in a higher level language and then provide bindings to every other popular higher level language. This can be done, but I haven't seen much of it. Higher level languages seem to make it easier to provide bindings to a C-level API rather than APIs written in a different higher level languages. This may be something to do with impedence mismatches in higher level language concepts.
3) A higher level language "wins", and everyone moves to such an ecosystem. This can only happen if other higher level languages lose. I don't think there is any sign of this happening.
This works well because the higher level languages are all implemented in C
Not actually true; there are many high level self-hosted languages, OCaml, Haskell, Forth, Lisp, etc etc. But all these languages generally prioritize having a good C FFI. It is interesting that e.g. Thrift, Protocol Buffers et al don't seem to have made much of an inroad here.
> It is interesting that e.g. Thrift, Protocol Buffers et al don't seem to have made much of an inroad here.
I'm actually working on a tool called Haris to deal with this very problem. I'm looking to do structured binary data serialization in a way that's efficient, lightweight, and portable (in that it conforms to the C standard). Keep an eye out, I'll probably be posting it on Hacker News within the next few weeks.
Thrift, protobufs don't really work for in-process communication because the cost of serialization can get pretty high - you can easily blow most of your CPU time just serializing protobufs across language boundaries (been there, done that). I think Cap'n Proto may be an interesting approach - the memory layout is the same in all languages, and Kenton is reportedly working on 0-copy IPC - but it remains to be seen how it will turn out.
> This isn't because C is more popular, but because it is a low level language.
I consider C to be high level enough to be reasonably portable and expressive, but at the same time low level enough that it can do a lot of things higher level languages can't.
> And there's no simple excuse for the preprocessor; I don't know exactly why that exists, but my guess is that back in the 1970s it was an easy way to get at least an approximation to several desirable language features without having to complicate the actual compiler.
Clearly this guy has never had to deal with a large, complicated code base in C. Dismissing the preprocessor as a crutch for a weak compiler shows a significant ignorance about the useful capabilities that it brings.
I assume when he says "no simple excuse", it's more pointing to the massive problems that the mere existence of the macro pre-processor introduces for reasoning about the text of any C or C++ program, for programmers, tools, and compilers.
I've worked in a code base where, tucked away in a shared header file somewhere up the include chain, a programmer had added the line
#define private public
(because he wanted to do a bunch of reflection techniques on some C++ code, IIRC, and the private keyword was getting in his way)
Now regardless of whether that's a good idea, if you are reading C or C++ code, you always have to be aware, for any line of code you read, of the possibility that someone has done such a thing. Hopefully not, but unless you have scanned every line of every include file included in your current context recently, as well as every line of code preceding the current one in the file you're reading, you just can't know. Clearly this makes giant headaches for compliation and tools, as well.
So yeah, of course every mid to large C / C++ program uses the macro pre-processor extensively. You can do useful things with it, and there's no way to turn it off and not use it, anyway, given the way includes work in C / C++, so you might as well take advantage of it.
But it's not an accident that more recent languages have dropped that particular feature.
The only time I have seen actual conditional compilation used in C# was hilarious - I was given a pile of code that had preprocessor directives used to make private methods public so that they could be unit tested. However, if you switched the compile time flag to the "production" state nothing compiled....
I doubt he meant that it was not useful, rather that the usefulness might have been better served as a function of the compiler rather than some disconnected transformation tool.
Of course this thought process would eventual bring you down the road of macros systems such as those in Lisps, but that's going to be more difficult with a language lacking homoiconicity of code and data.
There are macros in many non-homoiconinc languages (eg. Rust, Dylan), and there are add-ons for several of the languages lacking them (eg. SweetJS for Javascript, MacroPy for Python).
This the reason, which stops me from going back to C. After coding in Java (mostly) for past 10 years. I wanted to switch back to C or C++. Mainly to save on ton of memory being used which I think is unwarranted.
So I experimented with a new service, and coded it in all three C, C++ and Java. When I did this I had not coded in C++ for 10 years, but it did not hurt at all. I could switch back easily with almost no great difficulty. There were some minor inconveniences of foregoing the Eclipse editor. I think, I might have missed Autocomplete the most.
But within hours after I started, I was getting my previous feeling of the Vi(m) editor coding of C++ back. And with the benefit of having STL (vectors, strings, etc.) I did not feel much discomfort.
But coding the same service in C was painful. And it was mainly because of not being able to basic things on strings easily like copy and concatenate.
But thankfully I still managed to do it. And on comparing the three services for latencies and memory usage, I found little difference between C and C++.
So eventually that service was deployed in C++ and still runs the same way.
This above episode happened about an year back, and recently I am using Go to do a lot of services (new as well as moving some old). Mainly I have been motivated by the promise of an easier C, which it seems to offer.
Some services, coded in Go, I have deployed and are already running very well. But even now, I need some more experience on the results side, to have a definitive opinion on whether Go is indeed C with strings lib (and other niceties) for me.
Most languages get string processing (and its closely related cousin, localization) wrong, even the ones with string classes, so I don't really get my jimmies rustled on C's anemic native string support.
On large enough projects, you end up with all kinds of custom logic around user-entered and user-facing strings, so the lack of native string processing is really only a drawback for tiny and proof-of-concept projects, which aren't really what you use C for anyway.
That being said, the right way to do string processing usually ends up looking a lot uglier than the way we are used to.
Sorry for the delay in replying. No I did not, actually. See I was coming back after a while, I had quickly shifted to C++ (after coding briefly in C) in my career, so did not remember using any libraries.
I am sure, my task would have been easier if I had used some lib. But my main concern (and goal) was performance and memory usage comparison.
Knowing a bit of C but often programming in just about any other language, I was recently inspired to work with lower-level languages like C++ thanks to a bunch of talks from Microsoft's Going Native 2013. Specifically Bjarne Stroustrup's The Essence of C++: With Examples in C++84, C++98, C++11, and C++14 -- video and slides at http://channel9.msdn.com/Events/GoingNative/2013/Opening-Key...
C++ really has changed and is changing from what I learned back in university. It's quite exciting. They seem to be standardising and implementing in C++ compilers the way HTML5 is now a living standard with test implementations in browsers. See also: http://channel9.msdn.com/Events/GoingNative/2013/Keynote-Her...
He's a university prof. When's the last time they ever heard applause? ;-) Really, in part, I think it was because he was trying to finish a thought and was going to summarize the feature again in two slides. But yeah, I'd have cheered. It felt like C++ the Steve Jobs keynote, in a way. All the good parts, right in front of you, shipping "soon".
For the Google index, I'd also like to note that LLVM 3.4 released three weeks ago, has support for C++14 in Clang. Incidentally it also mimics Visual C++'s compiler from Microsoft in Visual Studio. I'd expect it to ship with an Xcode 5.1 alongside iOS 7.1. Now that I think about it, I should run to Apple's developer site and see if that's true. :)
Edit: Xcode 5.1 beta 4 ships with Apple LLVM version 5.1 (clang-503.0.9) (based on LLVM 3.4svn) according to Google searches. So I guess that's a yes. I'm off to try it now. :)
C is a great language- it let's you get down and dirty with the computer.
However, the one huge downside to programming in C is having to deal with strings. Let's face it, C strings are absolutely terrible. For such an important feature, the string implementation of null terminated char* is just miserable to work with. See:
http://queue.acm.org/detail.cfm?id=2010365
That is a problem with the stdlib and not so much the language. There is nothing stopping you from creating a more robust string implementation that stores size and does bounds checking for some operations(You can't easily/safely enforce anything like string immutablity at the language level... unless anyone out there can think of a way). I've seen it before, but not often. Usually the C code I'm working on does very little string manipulation and speed/size matters.
Well, to be fair, strings are much nicer to work with when you have a suitable overload for '+', and that very much is a problem with C the language. Same goes for 3d vectors in C as well.
That's your opinion. I (and many other C programmers) find it refreshing that when I see a "+", a couple numbers are going to be added together. Nothing else could possibly happen, and I don't have to cross-reference the types of the operands to figure out whether a method in some far-off source file is going to be called.
C is willing to hide a lot of numeric type casting details about doubles vs floats vs unsigned ints vs chars behind the magic of the "+" character, so maybe you're willing to endure a lot more type-dependent compiler magic than you're letting on here, despite so graciously speaking for many other C programmers. In fact, "p+1" might very well mean "p+4" if p is an int. Or maybe it means "p+32" if p is a FILE (on my compiler). Or maybe it means... Wow! Wait a second! That seems pretty type dependent to me, come to think of it! But that's just my opinion.
Look back; I never said the + operator wasn't type dependent. It clearly is. It's certainly true in C that no matter what, the + operator adds two numbers together. The exact nature of that addition depends on the types of the operands, but the rules for that are dead-simple, and most importantly, aren't extensible; I can't include a header file that will change those simple rules.
i.e. Once you understand how pointer arithmetic works, and you know how C's type promotion scheme works, the meaning of any addition expression is basically evident, and you have a few guarantees about the behavior of the program (for example, I can reasonably expect that my addition won't take more than a couple clock cycles, depending on what kind of casting needs to be done and the like). In languages that support operator overloading, + can literally mean anything.
BUT clearly there is a world of difference between saying "it would be nice to have char* + char* as a short hand for string concatenation somehow" - which is roughly what I was saying - versus saying "all operators in a language should be arbitrarily overloadedable", which is roughly how you responded.
Now, I'm not saying there is an obvious way to handle "char* + char" meaningfully or safely in C. But on the other hand,
char stackString = "some literal";
is generally handled in an entirely different fashion from
int myVar = 7;
if both are declared as local stack variables - and C programmers generally have no problem learning that "=" is going to mean something quite special when declaring string literals this way, compared to other data types. Because as you say, it's just one more language rule.
The only real places where I miss operator overloading in C are when dealing with strings in performance non-critical places, any time I'm using 3D vectors, and any time I'm writing matrix math. If those were handled as primitives in the language (and as the string literal example shows, C already does go partway down that road), I'd very happily part with arbitrary operator overloading.
Uh, in what way is that "entirely different" from the int declaration? In both cases, you're just copying a couple words from one place to another, which is what the = operator does. There's nothing "special" about that declaration; we're just writing a pointer into a variable on the stack.
My feeling is that you're uncomfortable with C's treatment of strings because you don't entirely understand the memory model.
However, this is an argument that Bjarne often uses in the context of C++, and it drives me a bit batty. For example, he'll wax eloquent on how it is fine that there is no multidimensional array type built into the language, because look how easy it is to write one yourself, or use one somebody else wrote.
And, at the first approximation, that is surely true. Heck, I'm taking time out to post this from working on a Kalman Filter class I wrote that uses a hand coded multidimensional array class.
But, the problem is is that there are thousands of string libraries out there, and thousands of multidimensional arrays, and so on. And they don't play nicely with each other.
Heck, we use char, std::string, CString, and QString all in the same project. You can guess the evolution - a bunch of old library code written 10 years ago who didn't trust/like the new-fangled std::string stuff. External library code written in C. Then code written in modern C++ with an aim to be portable (std::string). Then some MFC UI code, and more code written by MFC people that didn't care about strewing that dependency in places where it didn't belong. And now we are in Qt, and I have to really clamp down on the code reviews to keep QString from straying beyond the UI components. Ugh.
Hey, I love C, and this is not a rant against the language. C strings have their place. I can't tell you how many times I've written code along the lines of:
char* c = some_data();
c += header_size;
name = extract (c, ',');
You get the idea. At one time that was the state of the art way to do string processing without the cost of a lot of extra creation/deletion. You just use pointer arithmetic, move along a data source, scraping it as you go.
But these days we aren't so interested in that kind of scraping, and are far more interested in higher level problems. And we have no standard to fall back on, in C. Sure, I can elect to use a third party library, under the almost never to be satisfied requirement that every line in our project is written in house, or that every third party library either uses the library I chose, or that it can inter-operate with it seamlessly. Frankly, I've never been in a situation where either of those held. So we end up either reverting to the mean (c strings), or making endless conversion calls to switch from one form to another.
> If you've used Java or Python, you'll probably be familiar with the idea that some types of data behave differently from others when you assign them from one variable to another. If you write an assignment such as ‘a = b’ where a and b are integers, then you get two independent copies of the same integer: after the assignment, modifying a does not also cause b to change its value.
This is incorrect when it comes to Python. a and b will be two different names for the same integer object, which is stored in a single memory location. The difference is that Python guarantees that integers are immutable.
Arguably, to a user of the language, these are imperceptible from independent. Changing one cannot change the other (except perhaps through some exotic double-underscore-prefixed function with which my vague knowledge of Python is unfamiliar)
> But if a and b are both variables of the same Java class type, or Python lists, then after the assignment they refer to the same underlying object, so that if you make a change to a (e.g. by calling a class method on it, or appending an item to the list) then you see the same difference when you look at b.
His point is that there isn't a difference in python: in both cases you're just changing labels, but in the case of integers they're pointing to immutable objects.
I was 13 and having written assembly for years I finally got a machine that was actually equipped for running a full-blown C compiler. Compiling was slow and the produced code was slow but all I could think of was how easily I could generate [assembly] code with just a few lines of C. Loops, pointers, function calls, conditionals... just like that. Wow. So productive.
C felt like writing assembly but with much better vocabulary. C was to assembly language what English was to the caricatured "ughs" of the stone age.
I often compared the output of the compiler to what I would've written myself: the output was bloaty, the compiler was obviously not very smart, but it did do what I wanted and the computers had just got fast enough to be able to actually run useful programs written in C without slowing down the user experience. So you couldn't necessarily distinguish a program written in C from a program written in assembly, and you could "cheat" by choosing C instead. That was so exciting!
The thing is, however, that since these trivial insights of my youth it turns out that C actually never ran out of juice.
I still write C and I'm enjoying it more than ever.
In C, I've learned to raise the layers of abstraction when necessary and writing C in a good codebase is surprisingly close to writing something like Python except several dozen times faster and you can build your memory layout and little details the best suitable way you want, in various meaningful contexts.
I love doing all the muck that comes with C. String handling, memory management, figuring out the best set of functions on top of which to compose your program, doing the mundane tasks the best way in each case, and never hitting a leaky abstraction like in higher level languages.
The thing is, the time I "waste" doing all that pays me back tenfold as I tend to think about the best way to lay out my program while writing the low-level stuff. Because such effort is required there's a slight cost in writing code which makes you think what you want to write in the first place.
In Python you shove in stuff into a few lists and dicts, it just works and you will figure out later what was it that you really wanted and clean it up. But often you're wrong because it was so easy in the beginning. In C, I have to think about my data structures first because I don't want to write all that handling again for a different set of approach. And that makes all the difference in code quality.
However, I don't think you could impose a similar dynamic on a high-level language. There's something in low-level C that makes your brain tick a slightly different way and how you build your creations in C rather than in other languages reflects that. The OP said it very well: C reflects the reality of what your computer does. And I somehow love it just the way it is.
I've worked most of my career in higher level languages but I've never set C aside. It has always been there, even with Python, C++, or some other language. Now I'm writing C again on a regular basis and with my accumulated experience summed into the work it's truly rewarding.
I was much the same - started out with Z80 asm, moved onto x86 shortly after that, and never really liked HLLs (including C) until I was almost 18 - I always felt I could do better than the compilers at the time (and I did), so there wasn't any reason to move up. I still use C and x86 asm frequently, more the former now, but I'll sometimes go back to something I wrote in C before and start rewriting bits of it in asm just to see how much smaller I could make it.
> Because such effort is required there's a slight cost in writing code which makes you think what you want to write in the first place.
It also tends to make you think of the simplest, minimal design that works, and that translates into more efficient and straightforward code. Higher level languages make some things really easy, but then I always feel a little disappointed by just how much resources I'm wasting afterwards.
Please do people a favour and either format this as code or add more linebreaks, at the moment it makes stuff rather uncomfortable to read (until one clicks ‘Fit to Width’).
Yeah sorry, I wrote it like that to make it more confusing...
But it's actually not too confusing. A brief explanation:
The first 75%ish, the bit with all the +s, is filling up the elements of the BF array with the ASCII value of all of the letters contained in the sentence. And the second 25%ish is just scrolling to the right location and printing out the letters, hence lots of <>s and .s. During the fill up stage each ASCII value is made using two of it's factors, in an attempt to reduce the number of characters. So each ASCII letter looks like this N+[-<M+>] where N and M are the two factors chosen. For example 32, an ASCII space, is ++++++++[-<++++>], N=8, M=4.
I'm sure this isn't the most character efficient for short sentences, but it might not be too bad for paragraphs.
PS, in analysing that I have noticed there is a pointless extra > at the start.
if you're not familiar with simon tatham, do poke around his site [http://www.chiark.greenend.org.uk/~sgtatham/] - he has an eclectic and delightful assortment of code and writing. probably best known for putty, but the rest of it is a lot of fun to browse through.
The article does exactly what it sets out to do: introduce C to programmers used to more modern languages.
I started programming in C again a few months ago after a 15 year hiatus and the language I remembered loving seemed strange and tedious. This would have been a great reminder of the many differences that after a while you just take for granted. Something similar would be useful for most languages but just more so for C (or say, FORTRAN).
My only quibble would be that while malloc/free are covered many variables are simply automatically aloocated and deallocated on the stack. C's dual approach to memory management is yet another frequent source of confusion.
I love C. It wasn't my first language to jump in to, but it was eye opening to see the power of pointers and low level operations. Java just couldn't get me close enough to the system.
The article only discusses the 'extremities' (C vs. Python/Java, etc...) when there is an obvious and popular 'compromise': C++, which has most of the discussed advantages of both sides. (Although it has some drawbacks; it is a bit more difficult to master than either C, Java or Python.)
As someone who swore off c after a college class and an experience with perl (three cheers for memory management), this was a great intro article to the idioms of c.
I first learned "true" programming with Perl. My next language was C to learn how to program microcontrollers. Incidentally, they remain my two favorite languages even after Python, C++, Java, Tcl, shell, LISP, and multitudes of assemblies. I feel these two languages cover most of my uses.
I just wish the built-in XS Perl<>C integration was simpler. I need to look into some CPAN modules for a more ctypes-like interface. Better yet, it should use libclang to autogenerate bindings!
interesting read. one of the later comments is a bit off the mark though:
" As a direct result of leaving out all the safety checks that other languages include, C code can run faster"
C is fast not just because of missing safety checks but because more generally you don't pay for features you don't use. Things like function calls and reading data are not complicated by run-time type logic for instance - this is very important, its why you can write an Objective-C class which has the same content as a bunch of C functions and the C functions will be (sometimes very significantly) faster.
This is one example, but many language features in high level languages suffer from similar performance problems - by being super generic and ultra late binding they can never perform as fast as a clean implementation which knows everything at compile time.
If you want dynamic late binding type functionality in C you have to do it yourself...
The really nice thing about Objective-C, of course, is that you can just dip in and out of plain C as the mood takes you (e.g. where speed matters, or perhaps where you're just doing numerical stuff and C is less verbose). I've come to C via Objective-C, and (like many other commenters here) have found it incredibly satisfying.
What a great explanation. I have been doing some low-level Go programming recently (including implementing the writev syscall), and I think this document would also be useful for Go programmers.
This is really great. I have found myself saying some of these same things when explaining things. Going to keep this in my pocket to use in the future. Thanks!
I was bit bothered that with all the talk about malloc it was never highlighted that not all memory needs to be manually freed: local (stack) variables are quite safe, why it is a common pattern to give pointers of local variables to functions to store their results in. There are some common exceptions for this when you just need to call malloc, but these should be treated as exceptions.
I really enjoyed this article. I started out programming in C, then quickly on to Java.
I didn't appreciate the language at the time, but with hindsight, the fact that you need to worry about memory allocation and performance means you've a better understanding of what's happening on the underlying system.
I completely agree. I took one semester of Java then one semester of C when I was first learning to code. I think that extra semester of C gave me a significant advantage over some of the other programmers in my next school because I had a much better understanding of what was going on under the Java straight-jacket.
"the length of the array isn't stored in memory anywhere"
This is probably not true. For arrays on the heap, the size (or an approximation e.g. number of pages the array spans) would have to be stored somewhere in order for the array to be deallocated. For arrays on the stack, the size is either known at compile time, or else it was at least available when the array was allocated and could be kept in the stack frame (and in many cases would be kept in the frame anyway).
Not only that, but the common pattern of passing a pointer to an array and its length as arguments to a function implies that most of the time C programmers keep the length of the array stored somewhere. You are really talking about niche cases where the length of the array is truly and inherently unavailable.
Really this has more to do with the fact that C is meant to do as little as possible for programmers -- it is supposed to be "close to the machine."
I'm sorry, but I think the article's version is actually closer to the truth.
Regarding stack arrays, while the compiler certainly knows how large an array is (of course it has to, you're using the length in the code it parses), it will almost certainly not write code or immediates that hold the size anywhere to the resultant binary. If you allocate three arrays, each of 12 doubles, a compiler can simply emit "sub esp, 288" and be done with it. It also won't stop you from referencing memory at an unreasonable offset from any of those arrays. (Compilers can warn you though.)
Passing the array along with its size further goes to prove the original statement, not refute it. If you have to do something manually, it strongly implies that the language/runtime isn't doing it for you.
Regarding arrays on the heap, the runtime even then does not require you to place a size anywhere. What likely happens is the right cell from a bucket of the closest size to the allocation you need is returned to you and marked in use. It will be as large as the size you're allocating, or larger. The only artifact of its original size is which span of memory it's located at. When you free() that allocation, that block simply gets marked as free, no size needed.
> What likely happens is the right cell from a bucket of the closest size to the allocation you need is returned to you and marked in use. It will be as large as the size you're allocating, or larger. The only artifact of its original size is which span of memory it's located at.
Almost every segregated-fit allocator I've ever seen still stores the size of each cell in the bucket header, although of course you only need to store the size once for the whole bucket since each cell is the same size. Still, it's stored somewhere. Also, only some allocators are segregated fit. dlmalloc, for example, which is very widely used, uses boundary tags, storing the allocation size with the allocation itself.
"What likely happens is the right cell from a bucket of the closest size to the allocation you need is returned to you and marked in use"
...implying that the size of the block is, in fact, available to the allocator. This might not be stored explicitly (it could be stored as a pair of pointers) but it is not unavailable. It is also necessarily available to the deallocator, which must in some way be aware of what it is marking as free. As I said, this might only be an approximation of the size of the array, though for bounds checking purposes it would usually be good enough (since nothing else should be allocated in the "excess" space that is marked as in-use).
"Passing the array along with its size further goes to prove the original statement, not refute it. If you have to do something manually, it strongly implies that the language/runtime isn't doing it for you."
Take a closer look at the original statement. He did not merely say that the size is unavailable to programmers, he said it is not available at all. That is not really true -- it is more that the compiler does not do anything useful with the information.
To put it another way, a compiler could conceivably emit bounds-checking code but still not provide programmers with an explicit way to get the size of an array. The result would be the same: programmers would still be forced to pass array sizes around, to avoid a call to abort (or whatever behavior occurs when a bounds check failed).
No it couldn't, because the compiler cannot make assumptions about the heap allocator beyond what the C spec says, which is basically just the function signatures. An allocator that never frees memory is entirely within the C spec and would need not know individual allocated sizes after allocation.
Or, if a compiler decided to anyway, it would have to link a compiler-specific standard library because it's entirely implementation-dependant what and where bookkeeping information is stored. In turn, this means you couldn't safely link code compiled by different compilers. Bjarne might think that fine and dandy, but C never had that level of damage.
The fact that C programmers pass around explicit lengths kind of illustrates the point the author is trying to make, and that you're arguing with. The allocator stores enough state to make "free" work, but that state is not idiomatically available to C programmers. It would be a code smell if a C programmer pawed around malloc to get the size of a block. If you need to know the size of something in C, you arrange to always know it.
It would be way, way, worse than a smell. It'd be a clear-the-room-this-sh*t-is-baad type of situation. At least in my book, but I can be somewhat delicate in my sensitivities sometimes. :)
And yes, of course it's true that C APIs use explicit lengths because they're nowhere to be found at run-time. We didn't all just miss that and add an extra argument for fun.
Consider the case where you have a struct containing something in addition to an array. If you malloc sizeof the struct, then the array size is likely not to be found anywhere.
In an attempt to be pedantic for no reason you have opened yourself to criticism by other pedants ;) The words heap and stack are not even mentioned in the C standard specification. You're conflating the implementation with the language itself. The post was about the language.
I have often wished for this. If you can pass a pointer to free() and it frees the previous allocation, it must "know" how big it was, so why is there no way to ask it? You can obviously stash it yourself at malloc() time in a lookup table, but still.
That's a lot of words just to be pedantic. What a heap allocator does in its bookkeeping is not necessarily "storing the length of the array" - and whatever it does isn't sensibly available to the running program. Yes, the size of a static array is known at compile time, but it's length is still not available to the running program unless the programmer explicitly arranges for it to be.
I looked into this. It seems that OS X provides malloc_size. [0] I wonder if other *BSDs have it or if it is a XNUism. I can't find an equivalent for glibc, which I find disappointing given how immense glibc is. Even if it is potentially a "shoot yourself in the foot" feature, I can see many judicious uses of it.
And even with such a size query, it is still obviously not something nice and built into the language. But C folks are used to that! =P
malloc_size won't give you the size of the array though, it'll give you the number of bytes malloced. malloc is free to return any number of bytes, so long as it's at least the size requested.
Perfectly cromulent C, and I submit the result of the call is not what you might hope it will be.
The article is correct. There is no way in the C language to get the size of the array. Any implementation is, of course, free to build in anything extra in the memory allocator, but it is not part of the language. Specifically, you cannot add anything to the array itself, because memory layout is prescribed. ptr[0] must point to the first byte of memory that was allocated, and if you added some kind of table you'd be pointing to that.
>C is quite different, at a fundamental level, from languages like Java and Python.
I suppose, but functional languages are even more different. From Python, the ascent to Lisp or Haskell is much more difficult conceptually than the descent to C.
The Java language is designed to be compiled to a Java Virtual Machine (JVM) which is a software implementation of a computer that doesn't exist (easy there existential freaks, I'm trying to keep this brief). The VM abstracts away the hardware differences between platforms, in theory allowing the Java language greater freedoms. Simple things like endian-ness are no longer a language issue as the VM says all machines are big endian. Compare this to C where the code is compiled to a native format that is very much tied to the hardware/OS platform.
Python (as most commonly implemented) is also compiled to bytecode that gets executed by a virtual machine. So it's at the same "level" as Java, really.
Java and Python are on the same level in relation to the hardware. They both get compiled to bytecode instructions that get executed by a virtual machine (basically a software program that is pretending to be a processor). The source code for that virtual machine is written in C or C++. In order to execute Java code or Python code, the machine must have the VM installed. But it doesn't matter what type of processor you are running the code on, just as long as it has the VM installed.
A C/C++ compiler takes your source code and then compiles & assembles it to binary machine instructions that are executed directly by the processor. It is not necessary to have a C/C++ compiler installed on a machine in order to execute that binary. But the code must have been compiled to the machine language that type of chip understands. That's why you can't take a program that was compiled for an Intel chip and then go run it on an ARM chip. The different chip won't understand the instructions. So you have to compile the C source separately for each type of target chip.
That's why C is considered more low-level than Java and Python. The output of the compiler is executed directly by your processor hardware, instead of by a VM.
That's because I was responding at a level of detail appropriate for the parent comment:
I thought Java was on a much lower level than Python. Why is C more low-level?
At that level of understanding, I think it is just fine to assume that Java means Oracle JVM, Python means CPython, and C/C++ means gcc (or your choice of C-to-native compiler). Yes there are many alternative implementations of these languages, but for the most part those are pretty obscure and a beginner does not need to know or care about them just yet.
While I agree with your comment in spirit, I would like to point out that this is how beginners get messed up with implementation and language concepts.
Well, did you read the article? It describes manual memory management, which is probably the most obvious reason C is lower level. It also talks about how strings are stored directly as arrays of characters as opposed to objects wrapping character arrays. Finally, C pointers are closer to how the computer operates (they're basically memory addresses) than anything in Python or Java.
What constitutes a "low level language" is a matter of perspective, however. From the perspective of writing binary instructions by hand, all three languages are "high level." ;)
Manual memory management, mostly. Also, since C is a simpler language, its building blocks map more closely with what actually happens on hardware level (or they used to... back in the 1970's).
Then, there's the fact that you do not have direct access to native OS primitives from Java (i.e. system call vs stdlib function), but it has to do more with the run-time environment than the language itself.
This looks like a pretty good summary. The stylistic way it's written as if about a 'foreign' language sure makes me feel old though. Twenty years ago, this was completely normal. When I went to work, the code was this. This is what there was. It wasn't "low level" or esoteric - just a nice language to feed through the compiler to get executable code.
I went straight from Apple BASIC, FORTH, and 6502 assembler to Turbo Pascal, and C seemed esoteric to me at first, and even unneccessary -- compared to Turbo Pascal that is. Then when DOOM was released compiled with Watcom C, I realized this 32-bit C compiler thing was a big deal.
Some time ago, I did some Free Pascal programming and was surprised that it was much nicer than C. But C won the war because of Unix OSs, not because it's such a good language.
What baffles me is that this is the type of coding young developers associate with C and C++, whereas we had lots of options to choose from, as you well list.
Customers have been demanding books about the WWW, so book stores have been flooded with titles on whichever garbage-collected language is hot for web development at the moment.
Since the dot-com boom, your average novice trying to put food on the table has been unlikely to pick the C title off the shelf when the Java or Ruby book is far more likely to fetch a job in six months. Until recently, the perceived cost of C/C++ tools for mainstream platforms also made the simple Java downloads far more attractive.
Just getting started learning to program myself in 2000, I remember Perl and Java were the things, and PHP a year later. I had to go out of my way later on on to learn C.
Yes. In fact in my local Waterstones, where supposedly pay a little more so that the knowledgeable staff can help you, I recently saw a sign for books: "C, C+, C++".
Yes! It makes me feel like my entire CS education (early 90's) was one long hazing ritual with grad students and professors laughing at us in the backroom...while we're busy trying to fix one-off indirect pointer errors.
valgrind / gdb should have made this painless, especially for small self-contained uni projects :) I remember being a teaching assistant for some C++ courses while being a student and the "Are you a wizard?" expression on the undergrads faces when I'd recompile their segfaulting app with debug, run gdb and tell them exacly which line was causing the problem.
Want to have fun learning C? Add Lua to the mix. ;)
All the joy and performance of C - wrapped up in a nice little language that lets you Just Get On With It. Plus, anyone who can sort out putting the LuaVM into a new set of libraries, thus creating a Framework, is one step closer to Developer God, in my opinion .. ;)
"To a large extent, the answer is: C is that way because reality is that way. C is a low-level language, which means that the way things are done in C is very similar to the way they're done by the computer itself. If you were writing machine code, you'd find that most of the discussion above was just as true as it is in C: strings really are very difficult to handle efficiently (and high-level languages only hide that difficulty, they don't remove it), pointer dereferences are always prone to that kind of problem if you don't either code defensively or avoid making any mistakes, and so on."
Not really, and not quite. A lot of the complexity of C when it comes to handling strings and pointers is the result of not having garbage collection. But it does have malloc()/free(), and that's not really any more fundamental or closer to the machine than a garbage collector. A simple garbage collector isn't really any more complicated than a simple manual heap implementation.
And C's computational model is a vast simplification of "reality." "Reality" is a machine that can do 3-4 instructions and 1-2 loads per clock cycle, with a hierarchical memory structure that has several levels with different sizes and performance characteristics, that can handle requests out of order and uses elaborate protocols for cache coherence on multiprocessor machines. C presents a simple "big array of bytes" memory model that totally abstracts all that complexity. And machines go to great lengths to maintain that fiction.