It's really backwards compatibility of instruction sets / architectures that imposes most of these limitations. Processors that get around them to some degree like GPUs do so by abandoning some amount of backwards compatibility and/or general purpose functionality and that is in part why they haven't displaced general purpose CPUs for general purpose use.
Regarding cutting off backwards compatibility to improve the design, Intel's Itanium (affectionately called "Itanic") was a very progressive approach to shift the optimization work from the CPU (and the compiler) to just the compiler. I'm not sure what the reasons for its failing were, though.
IMO C is close to low-level because it's relatively easy to imagine the resulting unoptimized assembly given some piece of code (which is why some people jokes about C being a macro assembly).
Maybe this old debate should get an slight update... and this could be the starting point: Is modern x86 assembly still "low-level"? :)
Some of the simpler optimizations, sure. But modern backends do many incredibly sophisticated optimizations that are way beyond any kind of simple pattern-matching-and-substitution model.
Even fundamental "optimizations" like register allocation use quite sophisticated algorithms. Optimal register allocation is NP-complete, so compilers use heuristics on top of graph coloring algorithms to do their best.
Most other optimizations rely on type analysis, data flow analysis, liveness analysis, etc.
One of the things that really excites me about Rust is that it's single mutable reference enforcement means you can run run `restrict` 100% of the time if you wanted which is a non-trivial performance boost. I think it's not enabled today but from previous discussions it sounds like that's just a matter of plumbing through the right things to LLVM.
Every time I've seen that rolled out in a C/C++ codebase someone invariably forgets about pointer aliasing and you spend a week tracking down some non-deterministic behavior.
Transmeta's x86 CPU's were even doing translations and optimizations dynamically between x86 and their internal representations.
I think every modern x86 CPU is doing exactly that.
The compiler should offer a macro for that. Then the question is whether to take the specification or the implementation at which point it's an absurd question to begin with. You could compare -O0 binaries, bypassing the optimization question, too.
High/low is not fine grained enough. IIRC, prolog for example would be dubbed a fifth generation language, after assembler, goto hell macro compilers, structured functional programming, and DSLs. Now coq and the like seem to be of yet a higher order (pun intended, sorry).
If you use GCC then __builtin_add_overflow() is what you are looking for:
"The compiler will attempt to use hardware instructions to implement these built-in functions where possible, like conditional jump on overflow after addition, conditional jump on carry etc."
Was there ever a point in the Itanium's history where there were Itaniums that ran mainstream software with better performance than equivalently priced x64 processors?
But I guess you said mainstream. So unless you count database engines, I suppose the answer is "No."
Today you can get the same vector performance using SSE4 and AVX. Almost all of Itanium's good stuff has been rolled into Xeon.
I am not an expert on computer history, but my feelings on the matter are as follows:
It's hard for certain domains, like handling millions of web requests. For most computational stuff where you're just blowing through regularly-shaped numerical computation (like for example ML, or signal processing), it's not that hard, but arguably the compilers of the time were still not quite up to it (there's a lot of neat stuff that's getting worked into the LLVM pluggable architecture these days). Of course ML wasn't really a thing back then, and intel didn't seem interested in putting itaniums into cell towers.
One way to think of the OOO and branch predict processing that current x86 (and arm) do is that they are doing on-the-fly re-JITing of the code. There is a lot of silicon dedicated to doing the right thing and avoiding highly costly branch mispredicts, etc. During itanium's heyday, there was a premium of performance over efficiency. Now everyone wants power efficiency (since that is now often a cost bottleneck). Besides which, for other reasons Itanium wasn't as power efficient as (ideally) the chosen architecture could have achieved.
So don't do it at compile time? That's really a very weak argument against the Itanium ISA, and honestly more of an argument against the AOT complication model. Take a runtime with a great JIT, like the JVM or V8, and teach it to emit instructions for the Itanium ISA. (As an added advantage these runtimes are extremely portable and can be run, with less optimizations, on other ISAs without issue.)
The problem, as always, is that nobody with money to spend ever wants to part with their existing software. (Likely written in C.) In 2001 Clang/LLVM didn't even exist, and I'm not familiar with any C compilers of the era that had so much as a rudimentary JIT.
The flaws of OOO and SpecEx are evident with the overhead required to secure a system (spectre, meltdown) in a nondeterministic computational environment, and there is certainly a power cost to effectively JITting your code on every clock cycle.
As the definition of performance is changing due to the topping out of moore's law and shifting paralellism from amdahl to gustafson, I think there is a real opportunity for non ooo, non specex in th future.
Most of what OoO and speculative execution are doing for performance on modern CPUs is hiding L2 and L3 cache latency. On a modern system running common workloads it's pretty unpredictable when you're going to miss L1 as it's dependent on complex dynamic factors. Cell tried replacing automatically managed caches with explicitly managed on chip memory and that proved very difficult to work with for many problems. There's been little investment in technologies to better use software managed caches since then because no other significant CPU design has tried it. It's not a problem LLVM attempts to address to my knowledge.
Other perf problems are fundamental to the way we structure code. C++ performance advantages come in part from very aggressive inlining but OoO is important when inlining is not practical which is still a lot of the time.
Sure some things like nginx gateways and basic REST routers will have to handle highly dynamic demands with shared tenancy, but the trends seem to me to be away from that. As you say, this is all dependent on the structure of code; and I think our code is moving towards one where the perf advantages won't depend on OoO and specex for many more cases than now.
Well JITs do actually outperform AoT compiled code today. Java is faster than C in many workloads. Especially large scale server workloads with huge heaps.
Java can allocate/deallocate memory faster than C, and it can compact the heap in the process which improves locality.
In practice it's quite hard to do really meaningful real world performance comparisons because real world code tends to be quite complex and expensive to port to another language in a way that is idiomatic. My general observation is that where performance really matters to the bottom line or where there is a real culture of high performance code C and C++ still dominate however. This is certainly true in the fields I have most experience in and where there are many very performance oriented programmers: games, graphics and VR.
Seems to me like if, in practice, JIT provided better performance then by now people would be rewriting their C/C++ code in Java and C# for speed.
For that matter, does Java code execute faster or slower with an AOT compiler than with HotSpot? I did a quick Google search but couldn't find an answer, except for JEP 295 saying that AOT is sometimes slower and sometimes faster :(
i am not that familiar with the C# runtime and i know C# has user definable value types, but i'm not sure what their limitations are.
It's a little bit faster, not faster by enough to matter. If you're going to rewrite C/C++ code for speed you'd go to Fortran or assembler, and even then you're unlikely to get enough of a speedup to be worth a rewrite.
New projects do use Java or C# rather than C/C++ though.
But not for speed reasons.
Java is in no way faster than well written C/C++
But yes, the point you make, is valid, it is much harder to write C/C++ well, because of the burden of memory management.
So if you lack the time or skilled people, it might make sense to choose Java out of perfomance reasons.
Specially after the Midori and Singularity projects, and how it affected the design of C# 7.x low level features and UWP AOT compiler (shared with Visual C++).
Also Unity is porting engine code from C++ to C# thanks to their new native code compiler for their C# subset, HPC#.
In fact C# always supported AOT compilation, just that Microsoft never bothered to actually optimize the generated code, as NGEN usage scenario is fast startup with dynamic linking for desktop applications.
While on Midori, Singularity, Windows 8.x Store, and now .NET Native, C# is always AOT compiled to native code, using static linking in some cases.
As for GC, C# always offered a few ways to avoid allocations, it is a matter for developers to actually learn to use the tools at their disposal.
With C# 7.x language features and the new Span related classes, it is even easier to avoid triggering the GC in high performance paths.
> I agree with Ousterhout's critics who say that the split into scripting languages and systems languages is arbitrary, Objective-C for example combines that approach into a single language, though one that is very much a hybrid itself. The "Objective" part is very similar to a scripting language, despite the fact that it is compiled ahead of time, in both performance and ease/speed of development, the C part does the heavy lifting of a systems language. Alas, Apple has worked continuously and fairly successfully at destroying both of these aspects and turning the language into a bad caricature of Java. However, although the split is arbitrary, the competing and diverging requirements are real, see Erlang's split into a functional language in the small and an object-oriented language in the large.
I still strongly think Apple is taking the wrong approach with Swift by not building on the ObjC hybrid model more.
Nobody is picking Java/C# over C/C++ for performance reasons.
The few cases where raw performance down to the the byte level and ms matter are pretty niche.
Not that it can't be done so much as getting programmers to accept it is can't be done.
I was also under the impression that there hasn't been much improvement in compiling C/C++ in a long time. It would be interesting to compare the performance of gcc from 15 years ago versus gcc today, on a real world piece of code. I suspect you wouldn't see much difference (aside from the changes in C dialect over time), and some added features in the new version. Has anyone run this experiment?
Itanic was also in-order (at least as far as dispatch), meaning anytime an instruction was stalled, so were all instructions in the same bundle or after it.
One "non-low-level" idea on Itanic, which never really panned out in practice, was for the assembler to automatically insert stop bits ;; marking assembly code "sequence points", instead of the programmer having to do it manually. But in practice, everyone did it manually, because they'd rather know how well their bundles were being used, and whether they could move instructions around in order get the full 3 instructions / bundle (6 instructions / clock).
And explicit stop bits did not provide any advantage to future hardware by marking explicit parallelism, because at every generation everyone was concerned about obtaining maximum performance on the current machine, which involved shuffling instructions into 6-instruction double-bundles, often at the expense of parallelism on future implementations (which never went beyond two bundles / clock).
You mean C compiler X has the feature of Y. There are lots of compilers and that's not part of the language.
We're talking about concrete things in the real world here, not philosophizing about the language spec.
That's exactly what's going on.
That's different from C.
In the history of x86, most new optimizations have preserved the semantics of code. For instance, register renaming isn't blind; it identifies and resolves hazards.
In C, increasing optimization has broken existing programs.
C is like a really shitty machine architecture that doesn't detect errors. For instance, overflow doesn't wrap around and set a nice flag, or throw an exception; it's just "undefined". It's easy to make a program which appears to work, but is relying on behavior outside of the documentation, which will change.
Computer architectures were crappy like that in the beginning. The mainframe vendors smartened up because they couldn't sell a more expensive, faster machine to a customer if the customer's code relied on undocumented behaviors that no longer work on the new machine.
Then, early microprocessors in the 1970's and 80's repeated the pattern: poor exception handling and undocumented opcodes that did curious things (no "illegal instruction" trap).
I think that's a fair conclusion though, I don't think the article is misleading.
x86 assembly is a high level language. It's analogous to JVM bytecode. Modern x86 processors are more like a virtual machine for x86 bytecode.
If you take this position, then having the distinction between "low level" and "high level" languages becomes pointless, and we have no way to distinguish between languages like x86 assembly and C and languages like Python and Haskell. This is why we use the terms "low level" and "high level": some of these languages have a lower level of abstraction than others. The fact that it's not giving you a great idea of exactly what's happening in the transistors is irrelevant: "low" and "high" are relative terms, not absolute.
The author's point is that what's happening in the transistors is relevant — not controlling it is what led to Spectre and heavy performance losses if you aren't smart about cache usage. Thinking of C as a "low-level language" makes it easier for people to overlook that fact.
I mean that should be blindingly obvious anyway looking at the actual history of programming languages and CPUs but here we are in 2018 insisting that we must have exactly three categories with exactly the definitions of: assembly;C/forth; everything else.
And, is nothing but actual binary machine code a "low level language"? I guess it's the lowest, I don't _think_ you can go lower than that... but someone's probably gonna tell me I'm wrong.
In all seriousness, the "assembly is high level" argument is ridiculous and robs the "low level" vs "high level" categorization of all meaning.
Or maybe we should relax the definition of "low-level language" a bit?
That's only true in the way everything is analogous to everything.
In early CPUs, nothing was optimized. They just executed your instructions. Now there are nontrivial optimizations and rewriting, just like the JVM.
Advanced hardware will re-order and pipeline instructions based upon data dependencies.
Sure it's not doing the exact same things the JVM does with bytecode but the point is that x86 assembly is not the language of the hardware. It's a language that the hardware+firmware knows how to interpret and optimize at runtime similar to what your JVM does with java bytecode.
An x86 cpu, as the article points out, spends inordinate resources looking for ILP. It's not a compiler in any reasonable sense of the word, while a JVM is.
It is not "low level" because it is an abstraction or virtual platform that the processor exposes and then interprets using its own internal resources and programming interface (microcode). The x86 interface does not map closely to the actual hardware, just as the article states. It exposes a flat memory model with sequential execution and only a handful of registers.
Much the same way that the JVM exposes a virtual machine that doesn't directly map to any of the platforms that it runs on. It's an abstraction that is interpreted or compiled at runtime.
I don't understand why you think the two are so different just because the JVM is higher level.
'An abstraction that is interpreted or compiled at runtime' is so broad it's exactly the what I said up top - it's analogous in the way everything is analogous to everything else. It's the sort of thing that might be true if you squint but offers somewhere between zero and negative insight.
CPU's are adopting JIT like tendencies in order to increase performance. Instruction reorder, register renaming, branch prediction, etc.
> if you squint but offers somewhere between zero and negative insight.
The insight I bring from this is that we should look moving those features out of the hardware and into the software level. Let us take advantage of them in our compilers and virtual machines.
The JVM can beat C in many scenarios because it can make optimizations based upon runtime information that a static compiler will never have available.
Imagine what we could do if we weren't chained to the x86 abstractions.
Even ignoring the fact that the JVM is typed, memory safe and with builtin GC (all things that were tried architecturally in the past and abbandoned), there is still a large difference between the scope and variety of non-local optimizations perfomed by any non-toy VM and the local, strictly realtime, constrained to a small window, set of reordering done by an OoO engine. Even tracing, which is used by some JITs, has been largely abbandoned in the CPU world.
Transmeta and Denver-like dynamic translation is closer to the behaviour of a software JIT and it is certainly considered drastically different from mainstream OoO.
The long and the short of it is, an x86 cpu is not really VM-like and a JVM is decidedly unCPU like. The analogy only works if you generalize it so much it becomes a uselessly mushy tautology or you ignore basic aspects of how each of these things work.
JITting is also called "dynamic translation", which is what a CPU does with microcode.
Whether that's a full compiler or not is beyond pedantic -- and irrelevant to the parent's point.
which is what a CPU does with microcode.
Either you know something about current x86 CPUs that I don't or words and technical terms are, indeed, not important and have no meaning.
But, to be fair, C is not that low level. In fact, when I first learned it, it was considered a high-level language because CPUs we used it with didn't have functions with parameters, only subroutine jumps.
C reaches into the realm of low-level languages because it allows you to arbitrarily read from and write to the "state" of the context you live in, but it also allows you to express constructs that have no counterpart even on the most complex CPU architectures (even if they have things that disagree fundamentally with C's point of view).
Except for RISC, that has mostly always been the case, when we look back at all those mainframes and their research papers.
You can write to null in C, your operating system rejects it. You can write past your allocated memory, your OS rejects it. It's not like just because it's written in C it gets to read the kernel memory - it's just that you can try.
All memory access in userspace is mediated by the MMU, so nothing in C gets "direct" memory access - but it does allow you to screw up your own memory space pretty well... I'm not sure that's C's fault though
To me that is about as low level as you can get without bypassing the OS.
correct, but that lack is not an argument for C being low level.
Calling out a specific language as not being something leads one to ask, "Well what is?". In this case, there is no qualifying alternative, so the title might as well be, "There is no low-level language for CPUs".
The article does basically answer this. The last section is about what it would mean to design a chip such that a low level language were possible to design for it.
It's not a category that can't exist, it just doesn't have any members right now.
I’d imagine it’s relativly primitive compared to whatever shaders are compiled to on modern GPUs, but it was humbling to have to manage things like separate, per core, disjoint register files which can only be read 4 cycles after write. The cores are heterogeneous, so there is special hardware for exchanging register reads between cores if necessary.
Cache hierarchies are directly accessible with CLFLUSH, INV, WBINVD x86 instructions; we may count also PREFETCHx, but they call it "a hint". FENCE instructions touch even the multicore part of system.
Many low-level CPU concepts leak to higher layer. A bright example is false sharing, which may manifest even in Java or C# programs.
Modern Intel CPUs basically emulate x86; there are many layers of abstraction between individual opcodes and transistor switching.
I think it's a strong insight that insight that chip designers and compiler vendors have spent person-millenia maintaining the illusion that we are targeting a PDP-11-like platform even while the platform has grown less and less like that. And, it turns out, with things like Spectre and the performance cost of cache misses, that abstraction layer is quite leaky in potentially disastrous ways.
But, at the same time, they have done such a good job of maintaining that illusion that we forget it isn't actually reality.
I like the title of the article because many programmers today do still think C is a close mapping to how chips work. If you happen to be one of the enlightening minority who know that hasn't been true for a while, that's great, but I don't think it's good to criticize the title based on that.
I wonder how much I could save (and how many more sims I could run) if my codes were rewritten in a language that has an abstract system that is much more cleanly and simply translated to what the computer actually does in 2018.
Wiki definition, also what I was taught in my first CS class:
"A low-level programming language is a programming language that provides little or no abstraction from a computer's instruction set architecture—commands or functions in the language map closely to processor instructions. Generally this refers to either machine code or assembly language."
The term is evolving to match the time, as shown by the author's interpretation already being higher level than the original intention despite the goal of preventing exactly that.
Sure, agreed, but I don't think it's super interesting that words evolve in meaning over time.
What I find strange about the comments here is that some people think the article's title is bad even though my experience is that many people today do think "C is a low level language" is a reasonable thing to say.
Now, sure, people say C is low level, and compared to Java it sure is. But it isn't low-level.
Obviously, it would be very hard to shift the incumbent model in reality. We just have to look at the lack of prosperity for the Itanium and Cell processors to see how hard it is to achieve success. But imagine if new computer languages had been created just for these processors. Commercially this would make little sense but it might be possible to create languages that fully used these processors yet retained simplicity for developers. Or maybe it isn't possible to beat the clarity of sequential instructions for human developers or maybe Out Of Order processing is the optimal algorithm. There are other changes coming too such as various replacements for DRAM that either integrate more closely with the CPU (such as 3d chips) [1,2] that by reducing the latency of main memory, could actually bring us back closer to the C model of the computer? or just change computing entirely...
That could be deployed as a new language, or adding features to existing ones, like value types in Java, or even compiler switches that relax some C rules for faster speed. Imagine -fpointers_cant_be_cast_to_ints or -freorder_struct_fields.
I'm wondering what will happen as GPU's become more general-purpose. What's next after machine learning?
Would it be possible to make a machine where all code runs on a GPU? How would GPU's have to change to make that possible, and would it result in losing what makes them so useful? What would the OS and programming language look like?
As a group of professionals, it is highly beneficial for us to be interested in these things. People who design languages and compilers do it largely on what is perceived as being demanded, and us as programmers are the ones that create the demand for new languages.
To put in other words, if programmers aren't aware of what's going wrong with our current languages, they cannot express their need for new languages. So, there's less incentive for researchers to produce new ways of programming computers. It is much more tempting to "please the masses" in a way that causes this local-maximum problem. It's much more interesting to research problems that translate into mainstream use than academic things that nobody actually uses.
- Hardware designers are conservative. They aren't likely to implement a different hardware architecture because "programmers demand it" (really?), unless there's existing research showing how it can be done and a compelling reason why customers will buy the chips.
Maybe apps will be allowed to run longer in the background, if there are always extra CPU cores available that don't consume much power.
However, in the opposite direction where a gpu becomes more like a cpu, if streams could do some level of limited branching without slowing the whole thing down, it opens the door to threading frameworks and design patterns where you write a loop in code, every thread gets it's own copy of memory, and on the threading front, it just kind of works for a lot of generic code.
Then if gpus added some sort of piped-like summation-like instruction, in the cases in a loop where variables need to be shared, they can still be added, subbed, mul, div, or mod, easily and quickly, allowing for what looks and acts like normal code today, but is actually threaded. That would kind of bring code back to where it is today.
Who knows? It's kind of fun to speculate about though.
Actually the wheel of reincarnation seems to have stopped at least for the time being. It seems that there is a fundamental, hard to reconcile, disconnect between a latency optimised engine like a CPU and a throughput engine like a GPU. Hybrids CPUs like larrabee or extensions like AVX512 do not seem to be enough.
Short term, probably the best we are going to see is separate CPU and GPU cores in the same die (or more likely jus the same package), but even that is likely suboptimal.
Yeah, this is a good insight. The height of the overall stack has grown. The lowest low level is lower than it was in the 60s and the highest high level is higher. So we need more terms to cover that wider continuum.
I also feel like the author is trying to say something about how imperative scalar (meaning 'operates on one datum at a time') languages are causing more trouble than they're worth. Sophie Wilson said something similar in her talk about the future of microprocessors . This implies that declarative and functional semantics would be more amenable to parallelization, as the author mentions in the article, as well as allowing the compiler more freedom to deduce a suitable 'reordering' of operations that would better fit the memory access heuristics the machine is using.
How is the C memory model a leaky abstraction here? What better way do you suggest? Are we not fine coding sequential (in memory) datastructures in C?
C does give you the ability to control those costs, but controlling how you lay out your data in memory and controlling imperatively in which order you access it. But the language doesn't show you those costs in any way.
I think it's pretty clear. Access memory sequentially, and you can expect to hit the cache. Access more memory than the cache size in a random order, and you can expect to pay memory access latencies (100s of CPU cycles).
I doubt you would be willing to manage the cache yourself in every line of code. That would be a lot of code. Some programmers might want to tune cache eviction behaviour by changing it in a few controlled points in time. But not in a way that couldn't be exposed to assembler/C. (I don't know if that's even realistic from a hardware architect's point of view).
Except memory is virtual. Memory location 0x1000 might be forward, or backwards compared to 0xFFF, depending on the state of the Translation-lookaside buffer (TLB).
Ever notice how (when ASLR is disabled), programs all start at the same location?? (https://stackoverflow.com/questions/14795164/why-do-linux-pr...)
Hint: Virtual address 0x0804800 doesn't belong at physical address 0x0804800. The OS can put that memory anywhere, and the CPU will "translate" the memory during runtime.
This means that in an obscure case, going forward a linked list (ie: node = node->next) may involve FIVE memory lookups on x86-64
* Page Map Lookup
* Page Directory Pointer lookup
* Page Directory lookup
* Page Table lookup
* Finally, the physical location of "node->next".
An even more obscure case (looking at maybe address 0xFFFC, unaligned) may require two lookups, for a total of 10-memory lookups (the page-directory walk for page 0xFFFC, and then the page-directory walk for 0x1000).
There is a LOT of hardware involved in just a simple "node = node->next" in a linked list. Its not even CPU-dependent. Its OS configurable too. x86 supports 4kb pages (typical in Linux / Windows), 2MB Large Pages and 1GB Huge Pages.
Does that matter though? I would assume that the "prefetcher" (or whatever it's called) can make its predictions in terms of virtual memory.
Regarding the linked list, it has been common wisdom for a long time that one should prefer sequential memory instead of linked lists. A nice benefit is that this simplifies the code as well :-). Growing and reallocating sequential buffers might not be possible in very dynamic/real-time and decentralized architectures - like a kernel - though.
Yeah. Meltdown means that the OS (as a security measure) wipes away the TLB whenever you make a system call. So all of those cached TLB entries disappear every syscall due to Kernel Page-Table Isolation.
And then the CPU core has to start from scratch, rebuilding the TLB-cache again.
So last year (when we didn't know about Meltdown), a system-call was fast and efficient. This year, on Intel and ARM systems (vulnerable to Meltdown), system-calls are now forced to wipe TLB. (But not on AMD-systems, which happened to be immune to the problem)
Both AMD and Intel implement x86 instruction set, and now the performance characteristics is different between the two boxes for something as simple as "blah = blah->next".
The important bit is that C is still quite "high level", and indeed, even Assembly language "lies" to the programmer through virtual memory. The OS (specifically page-tables) can interject and have some magic going on even as assembly-code looks up specific addresses.
The simple pointer-dereference *blah is actually incredibly complicated. There's no real way to know its performance characteristics from a C-level alone. It depends on the machine, the OS, the configuration of the OS (ie: MMap, Swap, Huge Page support, Meltdown...) and more.
You might argue that a modern computer is more like programming a tightly-bound, nonuniform multi-processor system. And I'd agree. But C doesn't much to help program such a thing.
When different things are mapped into the address space, that's an abstraction the programmer (or the user) consciously made. It should be possible to figure out the performance characteristics there.
Of course, many programs work on various machines with their own performance characteristics. You should still be able optimize for any one of them by querying the hardware and selecting an appropriate implementation. If you want to put in the work.
I don't think assembler/C is such a big problem here. But then again, I'm not a low level guy (in this sense) for now.
I.e. if a feature exists in C, it probably exists in every language most programmers are familiar with. (I worded this statement carefully to exclude exotic languages like Haskell or Erlang).
Thus C, while not low-level relative to actual hardware, is low-level relative to programmers' mental model of programming. If this is what we mean, it's still true and useful to think of C as a low-level language.
That said, it's important to keep the distinction in mind -- statements like "C maps to machine operations in a straightforward way" have been categorically wrong for decades.
I don't think that's true.
Off the top of my head, C has: array point decay, padding, bit fields, static types, stack allocated arrays, integers of various sizes, untagged enums, goto, labels, pointer arithmetic, setjmp/longjmp, static variables, void pointers, the C preprocessor.
Those features are all absent in many other languages and are totally foreign to users that only know those languages. A large part of C is exposing a model that memory is a freely-interpretable giant array of bytes. Most other languages today are memory safe and go out of their way to not expose that model.
I suspect that your definition of "exotic" is exactly "not like C".
Of course, some of those tricks are only allowed in SYSTEM/UNSAFE blocks on these languages.
"Most programmers" are not familiar with these. (A bit sad in the case of Modula-2).
Programmers' mental model of programming is not a homogeneous set. I'm pretty comfortable in LabView, for example; a language that is extremely parallel (the entire program is composed of a graph of producer / consumer nodes and sequential operation, if desired, must be explicitly requested).
Well crazy isnt the correct word... mainstream use has changed the future of an old language...
The reason is that in these domains (e.g. game consoles, supercomputing), you know ahead of time the precise hardware characteristics of your target, you can assume it won't change, and can thus optimize specifically for that ahead of time.
This isn't true for "mass-market" software that needs to run across multiple devices, with many variants of a given architecture.
Cell was a failure in large part because this proved to be less true / less relevant than its designers thought.
Source: many late nights / weekends trying to get PS3 launch titles performing well enough to ship.
At the time I was doing most of my SPE work (helping to optimize launch titles at EA prior to the launch of the PS3) most titles weren't taking much advantage of them at all. We were a central team helping move some code that seemed like it would most benefit over, I was particularly involved in moving animation code to the SPEs. There weren't really any options for libraries to help at that point, other than things we were building internally, so it was almost all manual work.
Later on in the PS3 lifecycle people moved more and more code to the SPEs. To my knowledge most of that work was largely manual still. For a while I was project lead on EA's internal job/task management library which had had a big focus on supporting use of the SPEs but my involvement in it was mostly during the early part of the Xbox One / PS4 generation. The Frostbite graphics team in particular did a lot of interesting work shifting GPU work over to the SPEs (I think some of it they've talked about publicly) but I wasn't directly involved in that.
There are some classes of very regular algorithm where you could probably predict everything (and handle the memory hierarchy) statically, such as GEMM, but it's not very common.
We apologize for this inconvenience.
Please contact us with any questions or concerns regarding this matter: firstname.lastname@example.org
The ACM Digital Library is published by the Association for Computing Machinery. Copyright � 2010 ACM, Inc.
In that regard, a case can be made that when you're writing in C, you're writing exactly as close to the bare metal as if you're writing in, say, Go or Haskell.
No, you really can't. This is childish black and white thinking. The computational model of C is built on an interface exposed by the hardware. Go and Haskell build many additional abstractions on top of that same model.
This article could have had a fruitful discussion about what the author is trying to say, but by choosing such a clickbait title, he managed to turn it into a discussion on semantics that wants to deny useful distinctions, because in some context (not the context in which it's actually used), it doesn't fit.
This kind of linguistic wankery really pisses me off, because it's useless and rests on a misunderstanding of how people actually use language (which is to say, in context and often in relative terms).
Haskell, Go, et. al. are understood to have complex runtime machinery atop the x86 instruction set. It's an erroneous belief that C does not (and one that I've seen developers get bitten by repeatedly as they try to manage threaded C code).
I think it just leads to quibbling over the boundary of low level, as is happening here.
I think it's just important to know that the definition changes over time relative to the state of the art. C was once considered high level. In the future, if programming languages evolve to a more natural language state, then sending serial instructions to the computer in a strange code will seem very low level to such programmers.
Anecdotally I have encountered loads of programmers who actually believe that there is a straightforward correspondence between C code and what the machine is actually doing, which is wrong.
So regardless of how you want to define "low-level", understanding this point is useful.
That really depends on your definition of "the machine". If "the machine" is "hardware", then sure. But if software is a considered piece of logic onto itself, that, when pitted against a sound model of an architecture will result in a series of logical steps, it's different: there is a very straightforward correspondence between the model-machine and assembly/C. Whether there is a 1-1 correspondence between the model-machine and any accidental hardware it is implemented is not that relevant.
So if low-level is defined as the lowest level that any hardware abstraction functions exactly as its logical function and not its silicon, it tells you exactly what the machine should be doing, even if it's doing it through different means.
I believe one of the points the article is making is that, no, there is not. Pin down a C developer's actual understanding of how C is converted into x86 or x86-64 assembly, and you find that nobody actually has that abstraction riding around in their heads, because the abstraction they do have riding around in their heads would make for some unacceptably non-performant code. Even if we disregard the fact that x86 is emulated on modern hardware, the C -> (Clang + LLVM / gcc) -> assembly path is deeply complicated.
That said, I wouldn't look to those as examples for how to design a good "low-level" CPU language, as CPUs and GPUs solve very different problems.
Partly because I really like the PDP-11 architecture, and it's 'separated at birth' twin the 68K, it greatly influenced me in how I think about computation. I also believe that one of the reasons that the ATMega series of 8 bit micros were so popular was that they were more amenable to a C code generator than either the 8051 or PIC architectures were.
That said, computer languages are similar to spoken languages in that a concept you want to convey can be made more easily or less easily understood by the target by the nature of the vocabulary and structure available to you.
Many useful systems abstractions, queues, processes, memory maps, and schedulers are pretty easy to express in C, complex string manipulation, not so much.
What has endeared C to its early users was that it was a 'low constraint' language, much like perl, it historically has had a fairly loose policy about rules in order to allow for a wider variety of expression. I don't know if that makes it 'low' but it certainly helped it be versatile.
Sounds like a GPU?
> Running C code on such a system would be problematic, so, given the large amount of legacy C code in the world, it would not likely be a commercial success.
It seems like ATI & NVIDIA are doing okay, even with C & C++ kernels. GLSL and HLSL are both C-like. What is problematic?
Memory layout, thread scheduling, and barriers are not features of the C language and have nothing to do with whether your C is “normal”. Those are part of the programming model of the device you’re using, and apply to all languages on that device. Normal C on an Arduino looks different than normal C on an Intel CPU which looks different than normal C on an NVIDIA GeForce.
Which reminds me I'd love to see a computer running exclusively from a GPU-like CPU.
And no, Xeon Phi's don't count. They are cool, but look too much like normal PCs.
They didn’t call it a GPU then, but the SIMD architecture is quite similar at a high level.
Larrabee was going to be a GPU-like CPU. https://en.m.wikipedia.org/wiki/Larrabee_(microarchitecture)
Here’s a more modern GPU based computer: https://www.nvidia.com/en-us/self-driving-cars/drive-platfor...
If you meant something that sits on your desktop and runs Linux, then yeah it’s uncommon but not unheard of to run it on a SIMD system. The trend is absolutely definitely going toward SIMD being used in general purpose computing. Even if you don’t want to count any of my examples, you will see the “normal” PC become more GPU-like in the future than it is today.
I keep telling people to get used to develop on Xeon Phi's and nobody seems to listen ;-)
Today's Xeon Phi is tomorrow's Core i9.
NVIdia always allowed multiple language on CUDA via PTX, with the offerings for C, C++ and Fortran coming from them, while some third parties had Haskell, .NET and Java support as well.
Yet another reasons why many weren't so keen in being stuck with OpenCL and C99.
When the spectrum of the context is unambiguous, that's not an argument for finding a way to make it ambiguous.
LISP machines in the 60s, Java machines in the 90s, many others.
For whatever reason, successful general purpose silicon has almost always followed a C-ish model.
It's also worth noting that Fortran runs quite well on C-ish style processors.
To drive this juxtaposition home, I'd point to PALcode on Alpha processors in which C (and others) can very much be a low level language. Very few commercial processors let you code at the microcode level.
The overarching premise is then brought home by GPU programming, which shows that you don't necessarily need to be writing at the ucode level if the ecosystem was built around how the modern hardware functioned.
Possibly relevant is this (short?) discussion from 2011 about a CPU more closely designed for functional programming.
Thus the fundamental limitation is that the processor has only a C ABI. If there were a vectorisation and parallel friendly ABI, then it would be possible to write high level language compilers for that. It should be possible for such an ABI to coexist with the traditional ASM/C ABI, with a mode switch for different processes.
It uses UltraSPARC T1 and above processors as an example for a "better" processor "not made for C", but this argument makes no sense at all. The "unique" approach in the UltraSPARC T1 was to aim for many simple cores rather than few large cores.
This is simply about prioritizing silicon. Huge cores, many cores, small/cheap/simple/efficient die. Pick two. I'm sure Sun would have loved to cram huge caches in there, as it would benefit everything, but budgets, deadlines and target prices must be met.
Furthermore, the UltraSPARC T1 was designed to support existing C and Java applications (this was Sun, remember?), despite the claim that this was a processor "not designed for traditional C".
There are very few hardware features that one can add to a conventional CPU (which even includes things like the Mill architecture) that would not benefit C as well, and I cannot possibly imagine a feature that would benefit other languages that would be harmful to C. The example of loop count inference for use of ARM SVE being hard in C is particularly bad It is certainly no harder in the common use of a for loop than it is to deduce the length of an array on which a map function is applied.
I cannot imagine a single compromise done on a CPU as a result of conventional programming/C. That is, short of replacing the CPU with an entirely different device type, such as a GPU or FPGA.
I met a guy back in college, a PhD who went to work at Intel, who told me the same thing. In theory, the future of general purpose computing was tons of small cores. In practice, Intel's customers just wanted existing C code to keep running exponentially faster.
Neither of these statements are true, unless "Legacy" refers to the early days of UNIX.
Tasks that parallelize poorly do not benefit of many small cores. This is usually a result of either dealing with a problem that does not parallelize, or just an implementation that does not parallelize (because of a poor design). Neither of these attributes are related to language choice.
An example of something that does not parallelize at all would be an AES256-CBC implementation. It doesn't matter what your tool is: Erlang, Haskell, Go, Rust, even VHDL. It cannot be parallelized or pipelined. INFLATE has a similar issue.
For such algorithms, the only way to increase throughput is to increase single-threaded performance. Increasing cores increase total capacity, but cannot increase throughput. For other tasks, synchronization costs of parallelization is too high. I work for a high performance network equipment manufacturer (100Gb/s+), and we are certainly limited by sequential performance. We have custom hardware in order to load balance data to different CPU sockets, as software based load distribution would be several orders of magnitude too slow. The CPU's just can't access memory fast enough, and many slower cores wouldn't help as they'd both be slower, and incur overheads.
Go and Erlang of course provide built-in language support for easy parallelism, while in C you need to pull in pthreads or a CSP library yourself, but the C model doesn't make parallel programming "very difficult", nor is C any more sequential by nature than Rust. It is also incorrect to assume that you can parallelize your way to performance. In reality, the "tons of small cores" is mostly just good at increasing total capacity, not throughput.
I disagree that tasks performed by a computer either don't parallelize or the cost of synchronization is too high. At a fine-grained level, our compilers vectorize (i.e. parallelize) our code -- with limits imposed by C's "fake low-levelness" as described in the article -- and then our processors exploit all the parallelism they can find in the instructions. At a coarser level, even if calculating a SHA (say) isn't parallelizable, running a git command computes many SHAs. The reasons why independent computations are not done on separate processors -- even automatically -- come down to programming language features (how easy is it to express or discover the independence, one way or another) and real or perceived performance overhead. Hardware can be designed so that synchronization overhead doesn't kill the benefits of parallelization. GPUs are a case in point.
The world is going in the direction of N cores. We'll probably get something like a mash-up of a GPU and a modern CPU, eventually. If C had been overtaken by a less imperative, more data-flow-oriented language, such that everyone could recompile their code and take advantage of more cores, maybe these processors would have come sooner.
> "Legacy" code in this context is code that was written in the past and is not going to be updated or rewritten.
In that case, I would not say Legacy code is sequential. For the past few decades, SMP has been the target where sensible/possible.
> At a fine-grained level, our compilers vectorize (i.e. parallelize) our code.
Vectorization is a hardware optimization designed for a very specific use-case: Performing instruction f N times on a buffer of f_input x N, by replacing N instantiations of f by a single fN instance.
If this is parallelization, then an Intel Skylake processor is already a massively parallel unit, which each core already executing massively in parallel by having the micro-op scheduler distribute across available execution ports and units.
In reality, vectorization has very little to do with parallelization. Vectorization is much faster than parallelization (in many cases, parallelization would be slower than purely sequential execution), and in a world where all the silicon budgets goes to parallelization, vector instructions would likely be killed in the process. You can't both have absurd core counts and fat cores. If you did, it would just be adding cores to a Skylake processor.
(GPU's have reduced feature sets compared to Skylake processors not because they don't want the features, but because they don't have room—they just specialize to save space.)
> At a coarser level, even if calculating a SHA (say) isn't parallelizable, running a git command computes many SHAs.
And this is exactly why Git starts worker processes on all cores whenever it needs to do heavy lifting.
This has been the approach for the past few decades, which is why I twitch a bit at your use of "legacy" as "sequential": If a task can be parallelized to use multiple cores (which is not a language issue), and your task is even remotely computation expensive, then the developer parallelize the problem to use all available resources.
However, if the task is simple and fast already, parallelization is unnecessary. Unused cores are not wasted cores on a multi-tasking machine. Quite the contrary. Parallelization has an overhead, and that overhead is taking cycles from other tasks. If your target execution time is already met on slow processors in sequential operation, then remaining sequential is probably the best choice, even on massively parallel processors.
Git has many commands in both those buckets. Clone/fetch/push/gc are examples of "heavy tasks" which utilize all available resources. show-ref is obviously sequential. If a Git command that is currently sequential ends up taking noticable time, and is a parallelizable problem (as in, computing thousand independent SHA's), then the task would be parallelized very fast.
Unless something revolutionizing happens in program language development, then it will always be an active decision to parallelize. Even Haskell require explicit parallelization markers, despite being about as magical as programming can get (magical referring to "not even remotely describing CPU execution flow").
> Hardware can be designed so that synchronization overhead doesn't kill the benefits of parallelization. GPUs are a case in point.
I do not believe that this is true at all. That is, GPU's do not combat synchronization overhead in the slightest, lacks features that a CPU use for efficient synchronization (they cannot yield to other tasks or sleep, but only spin), and run at much lower clocks, emphasizing inefficiencies.
After reading some papers on GPU synchronization primitives (this one in particular: https://arxiv.org/pdf/1110.4623.pdf), it would appear that GPU synchronization is not only no better than CPU synchronization, but a total mess. At the time the paper was written, it would appear that the normal approach to synchronization were hacks like terminating the kernel entirely to force global synchronization (extremely slow!) or just using spinlocks, which are way less efficient than what we do on CPU's. Even the methods proposed by that paper are in reality just spinlocks (the XF barrier is just a spinning volatile access, as GPU's cannot sleep or yield).
All this effectively make a GPU much worse at synchronizing than a CPU. So why are GPU's fast? Because the kind of tasks GPU's were designed for do not involve synchronization. This is the best case parallel programming scenario, and the scenario where GPU's shine.
I'd also argue that if GPU's had a trick up their sleeve in the way of synchronizing cores, Intel would have added it to x86 CPU's in a heartbeat, at which point synchronization libraries and language constructs would be updated to use this if available. They don't hesitate with new instruction sets, and the GPU paradigm is not actually all that different from a CPU.
> The world is going in the direction of N cores. We'll probably get something like a mash-up of a GPU and a modern CPU.
It's the only option, due to physics. If physics didn't matter, I don't think anyone would mind having a single 100GHz core.
However, it won't be a "mash-up of GPU and a modern CPU", simple due to a GPU not being fundamentally different from a CPU. A GPU is mostly just have different budgeting of silicon and more graphics-oriented choice of execution-units than a CPU, but the overall concept is the same.
> If C had been overtaken by a less imperative, more data-flow-oriented language, such that everyone could recompile their code and take advantage of more cores, maybe these processors would have come sooner.
A language that could automatically parallelize a task based on data-flow analysis (without incurring a massive overhead) would be cool. I don't know of any, though. I seems optimal for something like Haskell or Prolog, but neither can do it.
However, tasks that would benefit from parallelization would already be easy to tune to a different amount of parallelism, and parallelizing what is poorly parallelized is not useful on any architecture.
Parallelization hasn't really been a problem for at least the last two decades, and I certainly can't see it as the limiting factor for making massively parallel CPU's. However, massively parallel CPU's are not magical, and many problems cannot benefit from them at all. It will almost always be trading individual task throughput for total task capacity.
The meaning of a high level language is to do with abstraction away from the hardware. C programmers often wince at languages that are highly abstracted away from the hardware. But those are what are "high level" languages. Especially languages that remove more and more of the mechanical bookkeeping of computation. Such as garbage collection (aka automatic memory management). Strong typing or automatic typing. Dynamic arrays and other collection structures. Unlimited length integers and possibly even big-decimal numbers of unlimited precision in principle. Symbols. Pattern matching. Lambda functions. Closures. Immutable data. Object programming. Functional programming. And more.
By comparison C looks pretty low level.
Now I'm not knocking C. If there were a perfect language, everyone would already be using it. Consider the Functional vs Object debate. (Or vi vs emacs, tabs vs spaces, etc) But all these languages have a place, or they would not have a widespread following. They all must be doing something right for some type of problem.
C is a low level language. And there is NOTHING wrong with that! It can be something to be proud of!
Basically it says that the C abstract machine has very little in common with most existing processor.
moreover it makes the point that in the last decades of research for CPUs the focus was "make C go fast" wich ultimately cause meltdown.
The reason C won wasn't that it forced CPUs to adhere to its particular execution metaphor, but that it happened upon a metaphor that could be easily expressed and supported by CPUs as they evolved over decades of progress.
 Basically: byte-addressable memory in a single linear space, a high performance grows-down stack in that same memory space, two's complement arithmetic, and "unsurprising" cache coherence behavior. No, the last three aren't technically part of the language spec, but they're part of the model nonetheless and had successful architectures really diverged there I doubt C-like runtimes would have "won".
CPU's, on the other hand, are designed to be much more generic with decent performance for any task.
And there's nothing special about emulating a GPU on a GPU; you could emulate a CPU architecture just as easily, at a much higher level than you get from an FPGA, and so perhaps faster than you'd be able to get from today's FPGAs. And, if you're mapping GPU shader units 1:1 to VM schedulers, you'd also get a far higher degree of core parallelism than even a Xeon Phi-like architecture would give you. (The big limitation is that you'd be very limited in I/O bandwidth out to main memory; but each shader unit would be able to reserve its own small amount of VRAM texture space—i.e. NUMA memory—to work with.)
I'm still waiting for someone to port Erlang's BEAM VM to run on a GPU; it'd be a perfect fit. :)
However, "good performance" on a CPU is much much worse than "good performance" on a GPU. CPU's just achieve their mediocre performance on a larger set of usecases. GPU's are specialized devices that are very good at specialized activities.
The processor is doing an ok job filling these instructions, but a language and compiler can do a much better job. C doesn't collect any information about data dependencies, and instead just pretends all instructions are sequential. Even code which contains tight loops of sequential commands can be optimized, because you have an entire program and operating system running around that sequential code.
I don't think its fair or correct to say that C is the real issue. Recently there have been languages like erlang and support for more functional models that make concurrent code a lot easier to write. The first real consumer multicore processors were only released a bit over 10 years with Intel's Core 2 duo's. Of course SMP systems existed before that, Sun had them for years, but they were relatively niche. Still, Java, C++, C#, are all languages that produce much easier to maintain code if they are single threaded. Recent darlings like JS and Python are single threaded out of the box.
The large majority of languages in use today are not designed to be concurrent as a first principle. True multicore systems have been around for decades, software and mindshare is now starting to catch up and use tools that make concurrency easy.
I have operational computers of a variety of architectures at home, including the oldest generations (6502, 680x0), Sparc, Symbolics, DEC Alpha, MIPS 32- and 64-bit, etc., and even an extremely rare (and unfortunately not-running) Multiflow, the granddaddy of VLIW.
My favorite part of the original article was the final section. I wish we had a modern CPU renassiance akin to what was going on in the 80s and 90s, but the market dominance of x64 and ARM seems to be squelching things, with optimizations to those architectures rather than novel new ones (with possibly novel new compiler technologies). 64-bit ARM was a nice little improvement, though.
Erlang is decades old. It's 32, only 16 years younger than C.
I think a better way of stating this is "99% of the code that matters is written in C, or in a language designed with a similar target architecture as C in mind". Certainly a lot of code that matters is written in C++, Objective-C, and Java, but the same points hold true for all of those.
The SPUs on the PlayStation 3 were an experiment in user managed caches and that proved to be a difficult thing to make effective use of even in games where you know more context than a lot of code can assume.
To some extent, didn't Intel go down this road with VLIW: trying to shift the burden of making code fast onto the compiler, instead of the CPU?
But if that's the argument, then not even assembly is sufficient, as control over speculative branching and prefetch is only accessible via microcode in the CPU.
I think the argument is improperly framed. This is a discussion over public and private interface. The CPU is treated as a black box with a public interface (the x86+ instruction set). Precisely how those instructions are implemented (on chip microcode) is a private matter for the chip design team, which if correctly implemented, does not matter to the user, as the results should be correct and consistent. Obviously, a poor implementation can lead to Spectre or Meltdown. But for the most part the specific transistors & diodes used to sum a set of integers, or transfer a word from L2 to L3 cache, etc. shouldn't matter to us. If the compilers are relying on side effects to alter behavior of the internal implementation based on performance evidence, then that is a boundary violation.
C is low level. It remains "universal assembly language".
While precisely how those instructions are implemented (on chip microcode) is a private matter for the chip design team, we do care how much resources it takes to implement these instructions, since if we can enable a more efficient implementation then we can get better price/performance.
For example, have research CPUs been built that optimise for Erlang rather than C that provide better efficiency for the same amount of programmer effort than an X86 CPU running C?
For example, when writing high-performance CPU-bound code it's usually important to keep in mind how wide cache lines are, but C doesn't expose this to the programmer in a natural way.
A modern Intel processor has up to 180 instructions in flight at a time (in stark contrast to a sequential C abstract machine, which expects each operation to complete before the next one begins). A typical heuristic for C code is that there is a branch, on average, every seven instructions. If you wish to keep such a pipeline full from a single thread, then you must guess the targets of the next 25 branches.
The Clang compiler, including the relevant parts of LLVM, is around 2 million lines of code. Even just counting the analysis and transform passes required to make C run quickly adds up to almost 200,000 lines (excluding comments and blank lines).
Sadly, too many programming languages try to be the end all be all. C is language that is great for working at the system domain.
Ideally, we would have small minimalist languages for various problem domains. In reality maintaining and building high quality compilers is a lot work. Moreover, a lot of development will just pile together whatever works.
That aside, you could build a computer transistor by transistor, but it's probably more helpful to think at the logic gate level or even larger units. Heck even a transistor is just a of piece of silicon/germanium that behaves in a certain way.
So there are levels abstraction, but is an abstraction low-level? I think term probably came about to refer lower layers of abstraction that build what ever system your using. So unless your using something that nothing can be added upon. Everything, even what people would call high level can be low-level.
Heck, people call JS a high level language, but there are compilers that compile to JS. This makes a JS a lower level system that something else is built upon. This just again shows why I would say that low-level is often thrown around with connotation that is not exactly true.
* "A programming language is low level when its programs require attention to the irrelevant."
* Low-level languages are "close to the metal," whereas high-level languages are closer to how humans think.
* One of the common attributes ascribed to low-level languages is that they're fast.
* One of the key attributes of a low-level language is that programmers can easily understand how the language's abstract machine maps to the underlying physical machine.
So basically the entire article's premise (the title) hinges on the last bullet- which can be contested. All the other mentioned attributes can be applied to Java, C, C#, C++. So failing the last bullet point doesn't apply to just C.
In other words, a programmer who sits down and uses C and not Java might think, "I am being forced to pay attention to irrelevant things and think in unnatural ways, but that's because I am writing fast code using operations that map to operations done by the physical machine. In a higher-level language like Java, more of these details are out of my control because they are abstracted away by the language and handled by the compiler."
I think the article does a great job dismantling this point of view, and telling the story that C is not so different from Java, aside from being unsafe and ill-specified.
Compared to something different like Erlang, Haskell, Lisp