
Linus Torvalds on Garbage Collection (2002) - AndrewDucker
http://gcc.gnu.org/ml/gcc/2002-08/msg00552.html
======
ekidd
Shortly before Linus wrote this article in 2002, I wrote an XML-RPC library in
C that used reference counting. By the time I was done, I'd written 7,000+
lines of extremely paranoid C code, and _probably_ eliminated all the memory
leaks. The project cost my client ~$5K.

The standard Python xmlrpc library was less than 800 lines of code, and it was
probably written in a day or two.

Was my library about 50 times faster? Sure, I could parse 1,500+ XML-RPC
requests/second. Did anybody actually benfit from this speed? Probably not.

But the real problem is even bigger: Virtually every reference-counting
codebase I've ever seen was full of bugs and memory leaks, especially in the
error-handling code. I don't think more than 5% of programmers are disciplined
enough to get it right.

If I'm paying for the code, I'll prefer GC almost every time. I value
correctness and low costs, and only worry about performance when there's a
clear business need.

~~~
ilcavero
~$5k for 7 KLOC of bug free C code is a steal, that's impossible to do in less
than a couple of months

~~~
kragen
It's nearly impossible to do in any timeframe. I can think of perhaps two
examples in human history where I think it's been done: qmail and seL4. And
there may still be bugs in qmail. There may be a few other non-public projects
that have achieved less than one bug per 7000 lines of C, but probably not
more than one or two.

~~~
fanf2
<http://www.dt.e-technik.uni-dortmund.de/~ma/qmail-bugs>

There are also some DNS-related bugs that are not on this list.

~~~
kragen
Thanks! I hadn't seen those, although I knew of 1.4. I think 1.1, 1.3, and 1.4
are actual bugs, if the reports are accurate; I'm pretty sure Dan disagreed
with Wietse about 1.1. Three bugs in about 15000 lines of code doesn't quite
rise to the level of less than one bug per 7000 lines of code, so maybe that's
only been done once, in seL4.

What are the DNS-related bugs?

------
barrkel
Reference counting is GC; a poor form if it's the only thing you rely on, but
it is automatic memory management all the same.

Generational GC will frequently use the (L2/L3) cache size itself as its
smallest generation, meaning it shouldn't suffer from the pathologies talked
about by Linus here.

What GC really gives you, though, is the freedom to write code in a functional
and referentially transparent way. Writing functions that return potentially
shared, or potentially newly allocated, blobs of memory is painful in a manual
memory management environment, because every function call becomes a resource
management problem. You can't even freely chain multiple invocations (y =
f(g(h(x)))) because, what if there's a problem with g? How do you then free
the return value of h? How to you cheaply and easily memoize a function
without GC, where the function returns a value that must be allocated on the
heap, but might be shared?

Writing code that leans towards expressions rather than statements, functions
rather than procedures, immutability rather than mutability, referentially
transparent rather than side-effecting and stateful, gives you big advantages.
You can compose your code more easily and freely. You can express the intent
of the code more directly, letting you optimize at the algorithm level, while
the ease of memoization lets you trade space for speed without significantly
impacting the rest of your program. Doing this without GC is very awkward.

GC, used wisely, is the key to maintainable programs that run quickly. You can
write maintainable yet less efficient programs, or highly efficient yet less
maintainable programs, easily enough in its absence; but its presence frees up
a third way.

~~~
illumen
GC code runs differently depending on the code running around it. It causes
the code to be non deterministic, and introduces a side effect.

I've written code in a reference counted language(python) which processes
about a gigabyte of data per second from the network, with hard real time
requirements - all on one machine with multiple cpus/cores. The code is fully
unit tested, doc tested, and functionally tested. It's also short, runs on
multiple platforms and has been maintained by other people than myself. My
personal experience is that you can write highly efficient maintainable code
with reference counting.

Reference counting manages memory automatically for you, but it also lets to
manage memory manually when needed too. For many situations, it's the best of
both worlds.

~~~
Peaker
Did you compare the performance of Python with that of other languages that do
use GC (e.g: Haskell)?

------
jfr
> _A GC system with explicitly visible reference counts (and immediate
> freeing) with language support to make it easier to get the refcounts right
> [...]_

To be a little pedantic on the subject, such a system (reference counting and
immediate freeing) is a form of automatic memory management, but it is not GC
in any way. Garbage collection implies that the system leaves _garbage_
around, which needs to be _collected_ in some way or another. The usual
approach to refcounting releases resources as soon as they are no longer
required (either by free()ing immediately or by sending it to a pool of unused
resources), thus doesn't leave garbage around, and doesn't need a collector
thread or mechanism to.

There are partial-GC implementations of refcounting, either because items are
not free()d when they reach zero references, or to automatically detect
reference loops which are not handled directly.

I agree with Torvalds on this matter. GC as it is promoted today is a giant
step that gives programmers one benefit, solving one problem, while
introducing a immeasurable pile of complexity to the system creating another
pile of problems that are still not fixed today. And to fix some of these
problems (like speed) you have to introduce more complexity.

This is my problem with GC. I like simplicity. Simplicity tends to perform
well, and being simple also means it has little space for problems.
Refcounting is simple and elegant, you just have to take care of reference
loops, which also has another simple solution, that is weak references. I can
teach a class of CS students everything they need to know to design a
refcounting resource management system in one lesson.

GC is the opposite: it is big, complex, and a problem that the more you try to
fix it, the more complex it becomes. The original idea is simple, but nobody
uses the original idea because it performs so badly. To teach the same class
how to design a GC system that performs as well as we expect today, an entire
semester may not be enough.

~~~
stcredzero
_I agree with Torvalds on this matter._

In a way, I do as well.

 _GC as it is promoted today is a giant step that gives programmers one
benefit, solving one problem, while introducing a immeasurable pile of
complexity to the system creating another pile of problems that are still not
fixed today. And to fix some of these problems (like speed) you have to
introduce more complexity._

There are plenty of contexts where speed is a non-issue. In those cases, GC
has been a huge win. The conceptual simplicity is the important part. The cost
of the resources that would be saved with explicit and optimized memory
management would be far outweighed by the resources required to implement such
things.

 _The original idea is simple, but nobody uses the original idea because it
performs so badly._

This is simply not true.

In the context of IO-bound enterprise systems, I've seen generational GC
perform admirably, almost magically. As a lark, I've put infinite loops into
such apps that do nothing but allocate new objects, and unless you are doing
an exceptionally intense operation, you couldn't tell the difference. Properly
tuned generational GC can be a truly fantastic seeming thing!

However, I will agree that the concerns Linus highlights are real, and that
refcounting systems, like the one in iOS are by far better choices in many
contexts.

EDIT: The above system I victimized, I only victimized in the TEST
environment, but it was populated with something like 2-week old production
data. The application in question is a traditional client/server desktop app
used by a major energy company and had 800 active users at the time, handling
millions in transactions every minute.

IDEA: If someone had an augmented ref-counting system with a runtime
containing an optional cycle-detector and something like LINT but for the
runtime reference graph, one would get most of the benefits of GC with the
efficiency of the ref-counting system. I half expect someone to tell me that
this already exists for Python.

~~~
kragen
> I've seen generational GC perform admirably, almost magically. As a lark,
> I've put infinite loops into such apps that do nothing but allocate new
> objects

While I agree that generational GC can perform spectacularly well, what you're
describing is close to the case it's optimized for, not close to its worst
case. The worst case is that you allocate lots and lots of small objects and
then write a pointer to all of them into a tenured garbage object.

> I half expect someone to tell me that this already exists for Python.

Yes, that's how Python works, except that I don't know what you mean by
"something like LINT but for the runtime reference graph."

~~~
stcredzero
_While I agree that generational GC can perform spectacularly well, what
you're describing is close to the case it's optimized for, not close to its
worst case_

Here's the thing: Most of the rest of the app was rather close to the case
it's optimized for.

 _I don't know what you mean by "something like LINT but for the runtime
reference graph."_

Something that tells you that you've created a reference graph with a cycle,
you have a memory leak, or that you're using references in some other stupid
or suboptimal way. I'm not even sure if there's a way to automatically detect
anything like the last category, though the first two are certainly
detectable. Basically, you take most of the infrastructure of GC, and you just
turn it into a runtime advisor to warn devs and testers of mistakes.

~~~
kragen
Great Circle commercialized the Boehm collector back in the 1990s, and I seem
to recall that most of their customers were using it to tell them when they
had a memory leak or reused freed memory, not to remove the need for reference
counting altogether. But it didn't tell you about cyclic references, unless
they resulted in a memory leak.

------
famousactress
We should _really_ encourage eachother to put the date in the title when
submitting old articles to HN. It's a total brainf*k to read through the
entire article, and not realize the context it was in.. or to just glance at
the title and assume the topic is a current one. Just saying.

[Edit] Not that I have a problem with older posts, btw.. I actually really
like them most of the time. But the date would give everyone a better
opportunity to evaluate whether they want to read the article, and would be
reading it with reasonable context.

~~~
lambda
Is it just me, or does the title say (2002) to give you context?

~~~
famousactress
That was added later (presumably in response to this comment).

------
sklivvz1971
It's 2011, FFS. This kind of mindset is really self defeating in the long
term. Sure, hand optimizing is better. Having a gazillion lines of shit legacy
code and technical debt to fix because you hand optimized for the 90's, it's
not so great. I'll keep my GC and sip a Mohito on the beach, while Linus keeps
on fixing Linux's "optimizations" ten years from now.

~~~
shin_lao
The alternative to a GC is not "hand optimizing". There are several patterns
such as reference counted memory and RAII that are far from complex to use.

If you've written any data intensive application you know that a GC doesn't
solve memory issue. It's just a different strategy.

~~~
sklivvz1971
It's a strategy _I_don't_need_to_concern_with_ :-)

------
loup-vaillant
So. Programs that use Garbage Collection tend to be slow.

Cause: Hardware don't like it.

Solution: fix the hardware?

Seriously, I'm afraid we're stuck in a local optimum here. It is as if
machines are optimized for the two dominant C/C++ compilers out there, and we
have then to optimize our program against that, closing the loop. Shouldn't
compilers and hardware be designed hand in hand?

~~~
vog
I think you're referring to the following paragraph:

 _One fundamental fact on modern hardware is that data cache locality is good,
and not being in the cache sucks. This is not likely to change._

However, this did not happen to please the "two dominant C/C++ compilers". The
reason is much more fundamental:

One of the most expensive parts of the hardware is memory, and fast memory is
a lot more expensive to produce than slower memory. So we have the choice
between using the same (and thus slow) memory throughout the system, or
combining different kinds of memory so that at software has at least the
_chance_ to run faster. This is a fundamental issue, and the only thing you
can do is trying to find the optimal share for each kind of memory.

But no matter how well you choose: software will only be able to exploit this
if it is designed for locality.

If you can fix that (i. e. if you can find a cheap way to produce gigabytes of
fast memory that makes chaches obsolete) the current compilers won't stop you
from exploiting it: The code that is optimized for locality will still run as
fast, and the code that can't be optimized for locality will run orders of
magnitudes faster.

So we aren't in a local optimum at all. You can still optimize further "just"
by producing faster and cheaper hardware.

~~~
kragen
_One of the most expensive parts of the hardware is memory, and fast memory is
a lot more expensive to produce than slower memory. So we have the choice
between using the same (and thus slow) memory throughout the system, or
combining different kinds of memory so that at software has at least the_
chance _to run faster. This is a fundamental issue, and the only thing you can
do is trying to find the optimal share for each kind of memory._

Although all modern high-performance (edit: I mean non-embedded-
microcontroller) computers work this way, it's not the only possible way. The
Tera MTA takes a different, cacheless approach.

First, the problem with modern RAM in desktop machines is not that it sucks at
_bandwidth_. You can get your bandwidth arbitrarily high by multibanking.
Multibanking requires more buses or point-to-point links, but that's a
tolerable cost.

The problem with modern RAM is that it sucks at _latency_ , compared to what
the CPU would like. Well, what do you do about latency? You make your requests
earlier, and make sure you have other things to do in the meantime, when they
get back. The Tera did this by having 128 sets of registers – 128 hardware
threads — and switching to the next thread on every cycle. That means that, if
all the thread slots were full, every thread only executed an instruction
every 128 cycles, which is plenty of time to hide the latency of a slow memory
fetch, as long as the memory bandwidth was adequate.

So basically every thread gets to pretend that it's running on a machine with
zero-latency RAM — memory that's as fast as the registers. And pointer-chasing
becomes as fast as looping over an array.

There are some other advantages to this design. Pipelining logic is very
simple, because unless your pipeline gets insanely deep, you never have two
instructions in the pipeline from the same thread, so you don't have register
hazards.

(Cache is still beneficial in such a design, since it reduces the bandwidth
that the links to main memory need to support. But the Tera didn't use it.)

I don't really understand why the Tera MTA failed in the market, and I suspect
the problems were commercial rather than technical — customers had to take a
big risk by porting their software to an unproven HPC platform, a platform
whose performance characteristics were completely unlike anything else in the
market (and unlike anything you can buy today). So customer uptake was
insufficient to provide the cash flow needed to keep updating the design to
keep up with Intel and AMD.

The _technical_ reason such a design might fail would be if the silicon
resources needed to support an entirely independent core were comparable to
the silicon resources needed to support a hardware thread. Consider the
GreenArrays GA144 chip: 144 independent cores, each with a tiny amount of
independent RAM, on the same chip. Such a chip will be at least as fast as a
chip with 144 independent register sets, but a single execution pipeline — in
the worst case, it's bottlenecked on getting data out of RAM, and only one of
its cores is usable, making it just as fast, while in the best case, it runs
144 times as fast. So a chip with 144 register sets needs to be _cheaper_ —
i.e. smaller — than the GA144. (Well, or easier to program, but presumably you
can use more mainstream multicore chips to prove the example instead.)

~~~
ramchip
Isn't there a much simpler reason - that programs are generally single-
threaded? I imagine it would take a lot of work to port basic tools like a web
browser to such an architecture without it running much slower. A lot of
applications do little parallelizable number-crunching, but a lot of branching
and sequential operations. How could you make parsing XML or HTML fast on
this? What about a text processor or a compiler?

~~~
kragen
It's true, if Intel could make a single processor core that went eight times
as fast on single-threaded code, they would do that instead of making eight-
core chips, unless the cost difference was horrific. But they can't, so if you
want your code to go faster, you have to find a way to parallelize it.

This started happening in earnest about 20 years ago in the supercomputer
market, which is where the Tera was sold. About 10 years ago, it started
happening in desktop CPUs (check out Herb Sutter's article about "the end of
the free lunch") and now it's starting to happen in embedded microcontrollers,
with the Parallax Propeller and the GreenArrays chips.

As it happens, the Tera didn't lose to faster single-threaded supercomputer
CPUs. By the time the MTA came out, even the most stalwart defenders of the
fast-single-threaded-performance approach, like the Cray SV1 and the NEC SX-5,
had succumbed to the necessity of CPU parallelism. But the approach that was
taking over the supercomputer market at the time was actually far _more_
parallel, and far more difficult to program efficiently --- NUMA machines and
then Beowulfs.

So that's why I don't think it was single-threaded programs that made the Tera
fail in the supercomputer market.

The question of how to meaningfully parallelize XML and HTML parsing,
compilation, text processing in general, and web browsers are very interesting
indeed. It's not obvious how to do it, but it might turn out to be tractable.
A group at Berkeley was doing some research on it in 2007 and 2008:
<http://www.eecs.berkeley.edu/~lmeyerov/projects/pbrowser/>

~~~
ramchip
My point was not aimed at the Tera specifically, but rather at this
architecture as a solution to the memory problem in general. As you say,
supercomputers have been using multi-threaded code for a long time now, so
it's not a technical problem for this case, but it can be for the kind of
applications 'regular users' may need.

Thanks for the references. I'm in embedded systems but I hadn't heard of the
Parallax Propeller before, it's an interesting architecture.

~~~
kragen
I suspect that the killer feature of Tera-like architectures for embedded
systems could be their determinism. Embedded microcontrollers like the LPC2100
series are starting to get caches, branch prediction, and the like, and that
makes me really nervous, because it's going to make worst-case timing very
hard to reproduce in testing. (In a way, this has been the case since the
introduction of interrupts in the 1960s, but I think it makes the problem much
worse.)

If multithreaded or massively multicore processors were a viable alternative,
you could get deterministic timings without sacrificing throughput. You could
even do away with interrupts. The GreenArrays chip doesn't have interrupts at
all; instead, its cores go into a low-power shutdown state whenever they're
waiting on I/O.

But that's all pretty speculative.

------
jasongullickson
What he's advocating sounds a lot like how things work in the iOS world, in my
experience.

~~~
BenoitEssiambre
And thus the insanely smooth user experience on iOS. As an Android developer,
this is _the one thing_ I feel makes it difficult to have a polished user
experience on Android compared to iOS.

If you think long and hard about each place you call 'new' in java Android
apps, it is possible to get a smooth interface. However, the language doesn't
encourage it by default like on iOS and you have to put time into it you
usually don't have.

For users, milliseconds of jitters and blockiness everywhere is a huge turn
off. It feels like the device is struggling to handle simple tasks and it
destroys the belief in the UI metaphor of physical objects that have inertia
and flow, stretch and bounce when you touch and flick them.

IMO, because it makes such a huge difference to users, UI programming should
always happen in a high performance language and environment. One way to
attain this performance is to eschew garbage collection. This is also a
problem in web browsers where the UI is often written in javascript.

At the very least those who make programing environments, languages or GCs
used for building user interfaces should optimize how their environments
promote good use of local hardware cache and acceleration. There is a reason
why UI rendering often relies on video hardware that is not totally unlike a
desktop high performance computer.

~~~
kenjackson
_And thus the insanely smooth user experience on iOS. As an Android developer,
this is _the one thing_ I feel makes it difficult to have a polished user
experience on Android compared to iOS._

Yet for WP7 its also silky smooth. The main issue on WP7 for user apps isn't
GC at all, but rather the network (creating long lists of images that you're
getting from a web service). Once people learned some techniques for dealing
with that on the device the experience for 3rd party apps was just as silky as
iOS apps -- yet a full generational GC.

The GC is an excuse (unless its not well written), not the reason.

------
joeyespo
I think this is another case of everybody thinks about garbage collection the
wrong way:
[http://blogs.msdn.com/b/oldnewthing/archive/2010/08/09/10047...](http://blogs.msdn.com/b/oldnewthing/archive/2010/08/09/10047586.aspx)

From the article: "Garbage collection is simulating a computer with an
infinite amount of memory. The rest is mechanism."

Whether or not it's reference counting or generational, the goal is still to
simulate infinite memory. That way, you can focus on the high-level problems
instead of the technical memory-related details. So it's not necessarily a bad
mindset to have.

------
manveru
Might be worth mentioning Tcl in this context, as it uses reference counting
for the GC [1].

It also doesn't allow circular data structures, which are quit hard to
implement if all you have are strings anyway.

[1]: <http://wiki.tcl.tk/3096>

~~~
zerohp
Perl also uses reference counting, and cyclical data structures cause memory
leaks unless you explicitly decrease the reference count with the weaken
function from Scalar::Util.

~~~
IgorPartola
And Python. AFAIK, there are no plans to fix it at this point. Also Python is
notorious for allocating lots of small objects.

~~~
stonemetal
Python has a mark and sweep garbage collector(since 2.0) to catch that which
reference counting misses. You can disable it if you know you don't make any
cyclical objects.

------
wladimir
[2002]

Though his argument about cache does still hold.

~~~
MatthewPhillips
I think his opinion might not have changed. In his latest "C++ sucks" rant in
that "why is git written in C" thread he points out not having GC as being one
of the detriments of C++.

~~~
awj
For the complexity it adds not having GC _is_ a detriment of C++. The language
makes reasoning about allocation, object ownership, etc, much harder. It does
this in the course of providing features intended to help you structure and
manage larger pieces of code, and thus accidentally works to defeat that
purpose.

I don't think Linus ever intended to declare GC as evil or wrong. He was
calling it out as a very poor choice that people persist in making for certain
domains.

------
Vlasta
I like him mentioning the programmer's mindset associated with GC being a big
danger. Some people consider GC a magic bullet and refuse to think about
what's happening under the hood. I do not consider that a good habit.

~~~
marshray
It seems to be far easier (i.e., possible) to go from a manual-memory-
management style of development to an automatic one than the other way around.
I've known plenty of Java-CS-degree programmers who just never could get the
hang of writing C/C++ code without leaking stuff (and not just memory).

~~~
alextingle
This is a key point that I think lots of GC proponents gloss over. Memory is
not the only resource that needs to be managed. Open file handles, sockets,
user sessions, whatever... they all need to be managed in much the same way as
memory.

------
albertzeyer
When I read this, I immediately thought about std/boost::shared_ptr. This is a
bit ironic since Linus hates C++ so much.

shared_ptr is a really nice thing in C++. (For those who don't know: It is a
ref-counting pointer with automatic freeing.) And its behavior is very
deterministic. In many cases in complex C++ applications, you want to use
that.

~~~
alextingle
Or as we like to call it, std::shared_ptr

------
__david__
I like the way the D language approached this. It's garbage collected but it
also has a "delete" function/operator. That way you can use garbage collection
if you'd like, or you can manually free memory when you think it's worth it.

That seems like a reasonable compromise and I'm surprised that more languages
don't do it.

------
iskander
I'm very suspicious of anyone (even Linus) claiming that gcc is slow because
of its memory management. The codebase is crufty and convoluted--- it's
probably slow for a thousand different reasons. If you refactored into a clean
design and rewrote the beast in OCaml (or any other language with a snappy
generational collector), you'd probably get a large performance boost.

~~~
astrange
I think your argument consists of "gcc is slow because I don't understand the
code layout".

One of the problems affecting the C frontend and backend is poor cache
locality due to pointer chasing in their data structures, and they currently
do switch between GC memory and manually allocated zones (obstacks) to improve
this.

------
ww520
This is like arguing assembly is better than high level languages because it's
faster with explicit control. The thing is 99% of the time it doesn't matter.

In most cases, GC-based programs have good enough performance to get the job
done. For the 1% case, sure use the C/C++/Assembly to have the explicit
control and performance. Doing things in non-GC systems because of potential
caching problem sounds like a case of premature optimization.

~~~
astrange
I really like it when people reply "usually this doesn't matter" to a
discussion in a project where it _does_ matter.

~~~
ww520
I got the impression that Linus was discussing GC in general usage in the long
post, not specific to the GCC compiler. He even brought up the copy_on_write
example, which was a kernel or file system usage.

If the goal is to speed up GCC, there are a long list of things to do before
you have to worry about L1/L2 cache miss. Header file processing (or re-
processing) is one of the biggest time sinks during compilation.

~~~
astrange
cpp avoids reprocessing:

<http://gcc.gnu.org/onlinedocs/cpp/Once_002dOnly-Headers.html>

Processing header files in the first place is certainly a problem affecting C,
but I believe optimizations take more time. Parsing is a much larger problem
for C++.

~~~
ww520
I meant reprocessing the same set of header files for every single cpp file
including them. Precompiled headers suppose to speed it up. But they still
need to be read in and re-created in memory for each cpp file. Why not just
compile all the cpp files in one process rather than spawning off a compiler
process for every file? That can ensure reusing all the header files processed
and are still in memory.

My point is: run the profiler, see what are the bottlenecks, and pick those
areas for optimization. Rather than speculating that L1/L2 cache misses are
causing the major delay. If they ran the profiler and L1/L2 cache misses are
really the problem, then I have nothing else to say.

~~~
astrange
> But they still need to be read in and re-created in memory for each cpp
> file.

This is just mmap for Clang. For GCC it... isn't. The other one you mentioned
is called a compile server and nobody seems to have cared enough to implement
it.

> If they ran the profiler and L1/L2 cache misses are really the problem, then
> I have nothing else to say.

<http://gcc.gnu.org/ml/gcc/2011-04/msg00315.html>

~~~
ww520
Compile server, interesting concept. If no one cares to implement it, that
means performance is not a high value item for people to improve upon.

That's an interesting problem in doing a lot of lookup from main memory since
the data structures are too big to fit in L1/L2. But that usage is not
allocating a lot of short lived objects and freeing them, and reusing the
freed memory right the way for next allocations. That's what Linus arguing for
to reused the L1/L2 for the short lived objects, and thus GC is inappropriate.
I would imagine compilers typically allocate objects that have long lifetime,
such as declarations that have scope through the whole compile cycle. Also if
memory allocation/deallocation is really a performance problem, then don't
deallocate, just reuse the buffers. Reusing the same set of buffers would make
sure they are hot in L1/L2.

Anyway it has been an interesting discussion.

------
mckoss
Didn't Linus forget

    
    
        newnode->count = 1;

~~~
ciupicri
Maybe it's done by the _copy_alloc_ function.

------
joshhart
Here are a couple of reasons why I think it's not so clear cut:

1\. If garbage collection was that damaging to the cache, Haskell wouldn't be
nearly as fast as C. 2\. Copy-on-write data structures are nice because the
immutability allows for concurrent access without locking.

Granted, this was from 2002 and Linus may no longer feel so strongly about the
topic.

~~~
jedbrown
Show me a memory-intensive kernel in which Haskell runs close to the
performance model. Sparse and dense matrix kernels would be a good place to
start. Our C code for sparse matrix-vector products and sparse triangular
solves gets better than 90% of STREAM bandwidth (based on an assumption of
optimal cache reuse, STREAM is about 85% of hardware peak).

Dense matrix kernels should get better than 90% of FPU peak. Unlike sparse
kernels, dense kernels are no longer bandwidth limited, but cache reuse in
both L1 and L2, as well as friendly TLB behavior is important to good
performance.

It would be interesting to see any Haskell implementations that are
competitive. I suspect that the very first thing you will do when trying to
get performance is to ditch the functional paradigm and start writing code in
an assembly-level monad.

~~~
dons
> It would be interesting to see any Haskell implementations that are
> competitive. I suspect that the very first thing you will do when trying to
> get performance is to ditch the functional paradigm and start writing code
> in an assembly-level monad.

Or teach the compiler about the algebra of arrays and matrices, so it can do
the things to the code, that we'd write by hand.

E.g.

* [http://www.cse.unsw.edu.au/~benl/papers/stencil/stencil-icfp...](http://www.cse.unsw.edu.au/~benl/papers/stencil/stencil-icfp2011-sub.pdf)

* [http://www.cse.unsw.edu.au/~benl/papers/repa/repa-icfp2010.p...](http://www.cse.unsw.edu.au/~benl/papers/repa/repa-icfp2010.pdf)

* <http://www.cse.unsw.edu.au/~dons/papers/stream-fusion.pdf>

* <http://www.cse.unsw.edu.au/~chak/papers/acc-cuda.pdf>

In all these cases arrays codes are written in a function style, accompanied
with special purpose optimizations and/or code generators (in the case of GPU
code), layered over an imperative arrays primitive layer, using a memory
effects monad.

~~~
jedbrown
This is a worthwhile research topic, but it doesn't really answer my question.
From the first paper you cite:

 _The single threaded Handwritten C version is about 45% faster than our best
Haskell result, which is achieved with 3 threads._

Meanwhile, there is no performance model so we don't know how good the C
version is. The paper doesn't even report a simple fraction of FPU or
bandwidth peak. It is not using SSE instructions so it cannot possibly be
better than 50% of FPU peak (the limit is actually lower because this kernel
is/should be bandwidth limited). As for parallelism, I'll quote Bill Gropp [1]

 _The easiest way to make software scalable is to make it sequentially
inefficient._

[1] [http://books.google.com/books?id=2Da5OcnjPSgC&lpg=PA21&#...</a>

~~~
dons
Oh, I'm certainly not arguing that you're going to beat hand tuned
straightline code. I'm just pointing out that dropping into assembly isn't the
only possibly path.

~~~
jedbrown
If you're not within 10% for these kernels, chances are that memory is being
used differently. This gets to a further matter which I think is perhaps the
greatest failure of current multi/many-core programming paradigms: assuming a
flat memory model. Efficient parallel computation has much less to do with
computation than with data movement. Recent and future architectures have
deeply hierarchical memory systems so any paradigm that does not expose the
location of physical pages (within some appropriate abstraction) will have a
hard time delivering consistent, understandable performance. Performance
should not vary an order of magnitude based on whether memory was faulted
(allocation is irrelevant) using a batch of threads with different affinity
than those that access it later. But this is the current state of affairs.

I would very much like to see a paradigm where a memory distribution (roughly
a high-level representation of the mapping to physical memory) was a first-
class concept. Suppose that new memory could be allocated or remapped to have
certain compatibility relative to the mapping of another block. Then you could
associate tasks with certain coupling between two distributions.

~~~
rayiner
> If you're not within 10% for these kernels, chances are that memory is being
> used differently

More likely imperfect strictness analysis, etc. Haskell is a pure functional
lazy language, after all. Getting within 65% of C's performance on a tight
numeric kernel is heroic.

~~~
jedbrown
Different strictness analysis results in using memory differently. Note that
some of Don's references are embedding a DSL that gives a high-level interface
to very low-level code (e.g. CUDA) specific to this problem. They should have
control over strictness.

------
teh
Slightly related: He mentions that when the containing structure of a sub
structure goes away you can free all the resources. The guys behind Samba 4
developed talloc [1] which is build around that idea.

[1] <http://talloc.samba.org/talloc/doc/html/index.html>

~~~
neilc
Arena or pool memory allocation is a very old idea, and has been implemented
by many different systems (e.g., Apache/APR, PostgreSQL, lcc).

------
kerkeslager
> _In contrast, in a GC system where you do _not_ have access to the explicit
> refcounting, you tend to always copy the node, just because you don't know
> if the original node might be shared through another tree or not. Even if
> sharing ends up not being the most common case. So you do a lot of extra
> work, and you end up with even more cache pressure._

It's possible that things were different in 2002, but I don't really think
this is the case now. In general, I make the node immutable and never copy it
(copying an immutable object makes no sense). In a well-designed code base,
mutations happen within the function where the data is created (read: on the
stack, where cache locality is a given). Immutability also addresses Linus'
concerns with thread-safety. And that's not accounting for concerns which
Linus DOESN'T mention, such as increased development speed and correct program
behavior.

I'm not the only one saying this. Josh Bloch, for example, recommends
immutability and cites cache reasons
([http://www.ibm.com/developerworks/java/library/j-jtp02183/in...](http://www.ibm.com/developerworks/java/library/j-jtp02183/index.html)).
And many languages (Haskell, Clojure) are designed heavily around avoiding
mutation and sharing nodes within data structures.

This talk of copying nodes to avoid your objects changing out from under you
sounds a lot like what I call "writing C in Java". Linus is looking at this
from the perspective of, "If they took away explicit memory management from C,
this is how I would do it." But OF COURSE if you just bolt a feature like GC
into a language that didn't have it before, it won't work well. Effective
cache usage in a GCed system requires other language constructs (like
immutability).

Now, after all that, I won't make the claim that immutability in a GCed
language like Java or C# is faster or even as fast as C with explicit memory
management: it would take a lot of profiling code and comparing its
functionality to make that claim with any kind of certainty. But it doesn't
seem like Linus has done that profiling and comparison either.

------
KirinDave
That was 2002. Here is the state of the art in 2008:
<http://cs.anu.edu.au/techreports/2007/TR-CS-07-04.pdf>

Unsurprisngly, things have changed. Many of Linus's complaints were valid, and
we've learned how to address them.

------
mv1
I find it sad that, to this day, one has to spend so much time worrying about
memory management to get decent performance. I've yet to work on a performance
oriented project where I didn't need to write at least a couple custom
allocators to reduce memory management overhead.

GC systems are no better in this regard. I was told of an interesting hack in
a Java program that implemented a large cache of objects by serializing them
into a large memory block so that the GC saw it as one big object and didn't
traverse it. This resulting in dramatically reduced GC pause times (10x+).
When needed, objects were deserialized from the array. Disgusting, but
effective.

------
LarrySDonald
So.. Essentially man the F up and live without GC in the parts that are going
too slow instead of saying "Oh it's cool, just wait ten years and hardware
will be fast enough to run this anyway". Use GC for stuff that needs to be
simple and is fast enough anyway, don't bog down code that's too slow with it.

------
earino
Guy who writes kernel code cares about performance, film at 11.

------
mmcconnell1618
I'm quite sure the machine code generated by my compiler isn't nearly as good
as it could be if I hand coded it but the efficiency of not writing in machine
code far outweighs any potential performance gains.

~~~
Rusky
You have a good point that applies in a lot of places, but here you're aiming
it at a straw man. Linus advocates reference counting and even suggests
building it into the language would be a good idea. That's hardly hand coding
anything- it's just a different strategy for GC (which is actually done).

It is getting harder and harder to beat compilers with hand-coded assembly
without an enormous amount of effort, though.

------
VladRussian
"All the papers I've seen on it are total jokes."

Couldn't agree more. We were actually laughing in the office when an office
mate brought up such a paper many years ago.

"I really think it's the mindset that is the biggest problem."

Linus is a superhero 20+ years working on the supertask of changing people's
mindset.

------
mfukar
Hacker News, another place where 10-year-old emails are submitted as _news_.

~~~
wbhart
Maybe the OP considered it news because it is tangentially related to this
patent issue to do with automatically expiring data in hash tables.

~~~
mfukar
I assumed that's what the comment section is for.

