
The cost of Linux's page fault handling - LaSombra
https://plus.google.com/+LinusTorvalds/posts/YDKRFDwHwr6
======
evmar
Linus wrote: 'Even a fully built kernel ("allmodconfig", so a pretty full
build) takes about half a minute on my normal desktop to say "I'm done, that
pull changed nothing I could compile".'

Since it's just after a git operation all the file state should be warm in the
disk cache. It really shouldn't take that long. The Linux kernel (at least by
looking at [1]) is about the same size in terms of file count as Chromium, and
we got this operation down to about a second by using better tools (i.e. non-
recursive make and then eventually a replacement).

I appreciate that the kernel has its own requirements (it sounds like his no-
op builds are still running shell scripts, something you ought to avoid in
your critical path) and also it's great that he's running it this way in part
to help profile a "normal" workload... but I'm also a bit sad to see so much
time spent waiting for something slower than necessary, as well as time spent
optimizing what feels like the wrong thing.

[1]: [http://larjona.wordpress.com/2011/06/15/numbers-about-the-
li...](http://larjona.wordpress.com/2011/06/15/numbers-about-the-linux-
kernel-2-6-38-2/)

~~~
josephg
From further down in the comment thread, Linus says that speeding this up
should help other workloads too, and he's sick of using make replacements:

\---

+Peter oh, it's absolutely true that 'make' is a pig, and does too much, and
we don't exactly help the situation by using tons of GNU make features and
complex variables and various random shell escapes etc etc.

So there's no question that some "makefile compiler" could optimize this all.
But quite frankly, I've had my fill of random make replacements. imake, cmake,
qmake, they all solve some problem, and they all have their own quirks and
idiocies.

So while I'd love for 'make' to be super-efficient, at the same time I'd much
rather optimize the kernel to do what make needs really well, and have CPU's
that don't take too long either.

Because let's face it, even if the kernel build process was some super-
efficient thing, real life isn't that anyway. I guarantee you that the "tons
of small scripts etc" that the kernel build does is a real load somewhere
totally unrelated. Optimizing page faults will help other loads.﻿

~~~
gwern
> Because let's face it, even if the kernel build process was some super-
> efficient thing, real life isn't that anyway. I guarantee you that the "tons
> of small scripts etc" that the kernel build does is a real load somewhere
> totally unrelated. Optimizing page faults will help other loads.﻿

This is true as far as it goes, but a silly argument. Let's apply a reversal
test
([http://www.nickbostrom.com/ethics/statusquo.pdf‎](http://www.nickbostrom.com/ethics/statusquo.pdf‎)).

Suppose the kernel build were as efficient as Chrome's is claimed to be on
this page (<5s) and wasn't stressing his system. Would Linus then approve of
anyone submitting patches to deliberately slow down the Linux kernel build
just to show up slownesses in page fault and encourage kernel devs to spend
time on optimizing that?

No, of course not! That would be idiocy and the person submitting the patches
would probably be banned by Linus in his titanic rage. However, since the slow
build & page faults is the status quo, Linus is making lemonade of it...

~~~
drewcrawford
So, that is actually an interesting thought experiment, thanks for that.

However I'm not sure it is directly applicable here. There are two courses of
action that could solve this problem:

A, an action that improves kernel builds

B, an action that improves several workloads

For A and B of similar cost, it makes sense to do action B in preference to
action A.

Your argument speaks to A being of positive utility, but given a finite
knapsack of effort smaller than the set of possible actions that fit in the
knapsack, a greedy algorithm that places any positive item into the knapsack
is not optimal.

~~~
gwern
I don't follow your reasoning here, but let me expand my observation further:
Linus can A. improve his build system (low-hanging fruit which the Chrome
numbers suggest could yield an order of magnitude better performance), or B.
he can search among a variety of difficult unlikely-to-yield-major
improvements (like yelling at Intel engineers 'make it go faster!') which will
improve his build system and also other hypothetical loads (which are unlikely
to be large gains if any at all; what, is Intel too ignorant to try to make
the TLB fast?). He is claiming B is better in part because of the hypothetical
loads makes it better in total.

Most people would consider A a more reasonable reaction, especially after
hearing that Linus's best idea for doing B is apparently going all the way
down to the hardware level in search of some improvement. We can see this by
intuitively asking what people's reactions would be to a proposal to induce B
if the equilibrium were already at A.

~~~
nkurz
On the other hand, Linus is one of a handful of people in the world who may be
in a position to get results by yelling at Intel engineers to 'make it go
faster!'. This isn't because Intel is too ignorant to do things on their own,
but because practically everything can be optimized further, and Linus may
have enough sway to focus the engineers' attention on the problem that he
wants solved.

Personally, I don't care much about the speed of the Linux kernel build
system, but I do care about the speed with which page faults are handled by
the CPU. Even if the chances of success are lower, if he is able to succeed in
speeding up every page fault on future Intel processors, I would consider that
a much greater good.

The real problem (as I see it) is that I think he's trying to optimize the
wrong thing. His worst-case test is based on trying to repeatedly fault in an
uncacheable page: every lookup TLB lookup fails at every level of the cache.
Likely, Intel has chosen to optimize the real situation where page
translations are cached when they are repeatedly accessed.

------
saurik
Is anyone else horribly sad that this kind of content and discussion is ending
up on Google+? I often find myself combing the mailing lists of key projects
like the Linux kernel to figure out why things happened the way they did, and
Google+ comments are not searchable in the same way; and even if they were, I
am having a difficult time believing that the content will not just end up
gone entirely after another ten or twenty years. I know some people point out
that using "modern" communication channels decreased friction, latency, etc.
but for some use cases the value of records and centralization is high enough
to warrant moving slower or more painfully.

~~~
Igglyboo
Is anyone else horribly sad that this kind of content and discussion is ending
up on the internet instead of written down in books?

But seriously mailing lists are pretty archaic, not saying that g+ is great
but it's possible to have solid data redundancy and centralization without
living in the past.

~~~
mwcampbell
Why do you equate discussions on mailing lists with living in the past? Sure,
Google+ is newer, but that doesn't make it better. On the contrary, mailing
lists are better in at least two ways: they're decentralized, and they don't
require proprietary software.

------
weinzierl
The title is a bit misleading. If I understood Linus correctly he is saying
that a page fault on modern processors is costly. There is nothing the OS can
do to make a page fault itself (or iret) faster.

    
    
      > It's interesting, because the kernel software overhead 
      >for looking up the page and putting it into the page 
      >tables is actually much lower. In my worst-case situation 
      >(admittedly a pretty made up case where we just end up 
      >mapping the fixed zero-page), those 1050 cycles is 
      >actually 80.7% of all the CPU time.

~~~
sounds
I think it's interesting how he compares Haswell to his 32-bit Core Duo.

Haswell: 1050 cycles / 80.7% CPU time on his microbenchmark

Core Duo: 940 cycles / 58% CPU time on his microbenchmark

~~~
Zenst
Agreed, given one is 32 bit memory addressing and the other 64 bit. You would
expect the 64 bit to be upto twice as long and adding optimisations in the
middle, but not equal. After all a 32 bit memory address is twice as much as a
64 bit one to handle behind the scene.

Also he only used one compiler and would test with another compiler to confirm
the results. Just to eliminate a possible compiler quirk in a quick way
compared to checking the machine code.

------
dchichkov
Interestingly, a soft page fault can be caused not only by a real page fault
event, but also by NUMA autobalancing. Which actually could be a major source
of kernel interference to purely user space processes. I sometimes see these
NUMA-originated soft page faults very high up in a (CPU/cache) profiler for
sensitive real-time processes that should avoid contexts switches and L1/L2
cache pollution at all costs.

As a side note NUMA autobalancing can be disabled by running:

    
    
        echo 0 > /proc/sys/kernel /numa_balancing_scan_period_min_ms
        echo 0 > /proc/sys/kernel/numa_balancing_scan_period_max_ms
        echo 0 > /proc/sys/kernel/numa_balancing_scan_size_mb
        echo 1000000 > /proc/sys/kernel/numa_balancing_scan_period_min_ms
    

Or booting the box with the kernel command line that includes:

    
    
        numa_balancing=disable

------
Peaker
If "git status" is empty, and takes milliseconds to know that -- "make" should
take roughly the same amount of time to know the same.

There are some better build systems which will fulfill this requirement, but
fail others.

This is why I am implementing buildsome [1], that gives far better guarantees
[2] about the build's correctness while making it easier to specify.

Empty builds only check "mtimes" of all files, and no more than that.

[1]
[https://github.com/ElastiLotem/buildsome](https://github.com/ElastiLotem/buildsome)

[2]
[https://github.com/ElastiLotem/buildsome/raw/master/doc/Pres...](https://github.com/ElastiLotem/buildsome/raw/master/doc/Presentation.pdf)

~~~
reubenmorais
> If "git status" is empty, and takes milliseconds to know that -- "make"
> should take roughly the same amount of time to know the same.

If absolutely nothing changed, you wouldn't need to build in the first place.

He's talking about the case of "nothing changed _that needs to be built_ ".
Configuration files, sources that aren't built on that platform, etc. Which
means the build system still needs to traverse directories and figure
dependencies out, etc.

~~~
Peaker
If nothing needs to be built, my statement still stands. The build system
needs to cache its dependency computations.

Empty build would be same time as git status.

~~~
dangerlibrary

      #ifdef win32
      ...some changes we don't care about on linux
      #else
      ...no changes
      #endif
    

so you'd still need to traverse all the headers, at a minimum.

~~~
reubenmorais
Though in this case, things do need to be rebuilt. No C build system is _this_
clever, as far as I know :)

~~~
taeric
Actually, they sort of are. If the conditional not taken on the current
configuration includes another header and that was where the change was, then
there will be no dependency between the current file and that header on this
platform/configuration.

~~~
Peaker
That case is the easy case (though caching of included files is problematic in
most build systems due to inability to express dependency on the inexistence
of the included name in a previous include directory).

The hard case is changing defined macros in a way that doesn't matter but does
pass extra -D flags to the compilation units. You can detect this by ad hoc
preprocessor aware logic, or you can have a separate build step for
preprocessing and do content aware rebuilds that avoid rebuilding if the
preprocessed text is identical.

~~~
taeric
Right, I was just saying that some of the build environments are "smarter"
than you might give credit. I think you'd be surprised how many people don't
know about gcc -M and friends.

~~~
Peaker
One thing almost nobody knows about gcc -M is that it is wrong :( And for 2
different reasons:

A) If the headers are supposed to be auto-generated, it will uninformatively
fail

B) If everything is successful, it will not tell you about all the paths it
depends on _not existing_.

gcc -M -Ia -Ib foo.c

Will tell you about b/x.h being a dependency, but will not tell you about
a/x.h not-existing being a dependency.

~~~
taeric
A) doesn't make sense. If the header is autogenerated, then it has a
dependency specified elsewhere. So, I'm not sure how that is a failure of -M.
(That is, it will tell you that this file depends on a header. It is up to
other rules to say that header is autogenerated. Right?)

B) This makes sense, though I don't necessarily see how that could happen. It
won't help you not rerun the -M hit, but the flow is: "rerun -M" on files to
get list of dependencies, restart with those dependency list loaded and see
what needs to be rebuilt. Right?

~~~
Peaker
A) Usually, there's a rule like: auto_%.h from %.xml Then you want #include
"auto_foo.h" to be found, and cause auto_foo.h to be auto-generated from
foo.xml, without explicitly mentioning "foo.xml" anywhere in the build system
itself.

B) If you rerun -M every time you try to build, your empty builds are going to
be quite expensive. It makes sense to cache that, and only rescan files when
they or their #included files changed. But then you need to be able to do the
file-inexistence dependency thing or it's wrong.

~~~
taeric
Ah, I'm really just getting used to the autotools conventions, where I don't
think any wildcard rules are used. (And, even then, I'm still just a learner.)

I see what you are saying with B, but I don't have quite enough experience to
know exactly how expensive that is. I also don't know enough about the kernel
build to know what it is doing.

Also, I'm curious why the -M flag can't output the inexistance stuff. I guess
it would be purely heuristic?

~~~
Peaker
I'm not talking about autotools :) I've always avoided autoconf/et-al. I'm
talking about make-based builds not being able to use gcc -M properly, because
it doesn't work when header files need to be generated due to their #include
directive.

-M could in theory output the inexistence stuff, but then most build systems couldn't even express that dependency.

~~~
taeric
Oh, I think I see what you mean. -M will catch when a C file needs to be
changed due to any header in the #include path, but not necessarily the header
files.

For that, I would assume you would still have to do that by hand for the
header. There may be some autotools thing that covers it. Though.... even
then, I'm not sure what the point is. If the header file itself is generated,
then you already have the dependency on what it generates from.

The only scenario I think I see as not covered is when there is an include in
an #if that flips from not taken to taken. Though, that does seem fairly edge
case.

I think I really just need to see an example of the inexistance stuff. In
particular, one that is expected to change between non-clean builds.

~~~
Peaker
There are two independent problems here.

Problem A:

foo.c: #include "bar_auto.h"

when you have a rule "%_auto.h: %_auto.xml".

gcc -M on foo.c will not tell you about the dependence on "bar_auto.h", but
will rather fail. With buildsome or a better #include scanner, you will know
that foo.o depends on bar_auto.h, even though bar_auto.h does not yet exist.

Problem B:

x.c: #include "bla.h"

a/bla.h does not exist

b/bla.h does exist

gcc -Ia -Ib -o x.o -c x.c

gcc -M tells us that x.o depends on b/bla.h. We cache this information to
avoid rerunning gcc -M every time. Then someone adds a/bla.h. A rebuild will
change x.o, but "make" or whatever build system cached the result of "gcc -M"
will not rebuild anything.

You might say "gcc -M" should be rerun each time, but this is extremely
wasteful, as there's no reason to rescan all the .c/.h files every time, when
they did not change.

------
epistasis
This sounds like it's just the cost to traverse the page table, right? ~300
cycles per raw memory lookup, and 3 of them because you'll typically need to
go three levels deep?

The TLB is tiny these days, and 4kb pages are tiny.

I'm super hopeful that Linus is going to force through some big improvements
to HugePages, because the current Linux HugePages support is super painful at
the moment. 2MB pages alone could be a massive gain.

~~~
justincormack
Switch to PowerPC and get 64k pages by default in most distros!

~~~
rwmj
.. and exposing lots of buggy userspace code into the bargain!

~~~
dmm
What kind of bugs appear with bigger default page sizes?

~~~
rwmj
Lots of userspace makes assumptions about page size being 4k and breaks when
it changes. Try looking for:

[https://www.google.co.uk/search?q="pagesize"+"64k"+"bug"](https://www.google.co.uk/search?q="pagesize"+"64k"+"bug")

[https://www.google.co.uk/search?q="pagesize"+"64k"+"issue"](https://www.google.co.uk/search?q="pagesize"+"64k"+"issue")

Another common one is actually in the kernel where filesystem block sizes are
limited to page sizes, so from this point of view large page sizes are better:

[https://lwn.net/Articles/591690/](https://lwn.net/Articles/591690/)

------
jw2013
I once wrote a kernel loadable module for getting page fault handling info
(time cost, etc.):

[https://github.com/jw2013/getvminfo](https://github.com/jw2013/getvminfo)

I use that to see the page fault pattern of both sequential and random memory
access pattern. In case anyone is interested, just check it out.

------
Rusky
One reason the Mill architecture looks so interesting is that all of the state
that needs to be saved is done asynchronously, in parallel with program
execution. In fact, the same mechanism is used for function calls, system
calls, and interrupts, which all just look like atomic ops. That should make
interrupts and irets _much_ faster.

~~~
willvarfar
(Mill team)

Another interesting aspect of the Mill architecture is that protection and
translation are separate. The cache uses virtual addresses, and the TLB sits
between cache and main RAM. The TLB is much bigger and slightly slower, so
simply doesn't fault nearly so often.

~~~
mwcampbell
Can you guys say anything yet about how the Unix fork syscall is implemented
on that architecture? The requirement for copy-on-write pages seems to
conflict with a single virtual address space.

~~~
willvarfar
Afraid its still Not Filed Yet (NFY). You knew I was going to say that. But
the slide deck is all ready for when we've filed.

The next talk will be about configuration, which is another cool topic, and
that'll be in a couple of weeks. Get on the mailing list to get details as
they become available: [http://millcomputing.com/mailing-
list/](http://millcomputing.com/mailing-list/)

------
dmethvin
This is one of the reasons why continuous performance testing is important.
Even when your own code doesn't regress, the platform beneath you may do so.
In this case it sounds like a case of the platform making most stuff faster
but not page faults, so it looks proportionally worse.

------
acqq
So Linus noticed that the page fault on the Intel CPU became slower on the
newer CPUs something like 10%. Before and after, it took roughly 1000 ticks to
perform the fault. Which just means that if the clock is the same, let's say 3
GHz, instead of 3 millions page faults per second on Core Duo you can now
"just" do 2.7 millions of page faults per second. Doesn't sound like something
to be much worried about. The code which spends most of its time in page
faults is just something that I can't imagine to be an example of the code
anybody even attempted to make faster. Linus admits that too: changing the
build process would certainly eliminate significant number of page faults in
his case. So, my conclusion this time is... Meh. The pig did't fly before and
now it won't 10% more so. Intel guys, if you sacrificed this in order to make
the instructions that matter faster, I welcome this: Linus also observes that
the rest of his code practically got faster around... more than 3 times!?
Before, it was around 900 ticks page fault code, the rest of the code around
50 percent of total, that is also 900 ticks. Now it's 1000 ticks PF, the rest
around 20% of total (see: Linus claims 80% spent in PF), that is, around 250
ticks for his code. I say, Intel, you did a perfect job speeding up some code
that matters to take just 250 instead of 900 ticks between Core Duo and new
CPUs. This sounds amazing.

(I of course welcome any arguments that demonstrate that I overlooked
something when claiming all this.)

~~~
ultimape
Yes, in Linus's case, it is a faster overall runtime, but such a stark
difference in cycle times is a bit of an eyebrow raiser.

Caching is supposed to speed things up, so it is kinda silly that the caching
system has somehow gone backwards in overall performance.

Pagefaults are a critical path for performance concerns in almost any software
system that needs high-performance. If they are 10% slower, then its like the
worst case performance of the computer is 10% slower.

Anyone who needs to write high performance code (say, simulations, or game
design: [http://gameprogrammingpatterns.com/data-
locality.html](http://gameprogrammingpatterns.com/data-locality.html) ), is
concerned with avoiding pagefaults and cache misses, but the average code
(general case?) doesn't concern itself with this as much, so the average
program may end up experiencing this more than a linux compile.

~~~
acqq
Try to count the total number of page faults per second in any task you're
doing. I don't know of any "average code" that produces an order of 3 millions
PF per second, which is needed for you to observe 10% slowdown. The code which
does that doesn't do anything than doing the page faults. Linus claims he
managed to get 5% slowdown when doing nothing but PF.

It's probably something like hundreds of thousands per second at most, giving
you something like a less than 0.5% slowdown. Versus the speedup of 300% for
some other code (it's an extreme value actually, but still). Your newer Intel
CPU certainly didn't get slower on average code compared to the Core Duo, for
the same clock speed.

Nothing to raise the eyebrows here. I believe it's not "in Linus case" but "in
every case" that the overall time is shorter.

------
halayli
"Computer Organization and Design, Fourth Edition: The Hardware/Software
Interface" is an excellent read that complements this article.

~~~
prlin
Quick google search brings up that it's available here:
[https://www.u-cursos.cl/usuario/9553d43f5ccbf1cca06cc02562b4...](https://www.u-cursos.cl/usuario/9553d43f5ccbf1cca06cc02562b4005e/mi_blog/r/MK.Computer.Organization.and.Design.4th.Edition.Oct.2011.pdf)

~~~
rbanffy
I'd really prefer not to see things like this on HN. Even an old edition is
still under copyright.

~~~
biehl
I wonder how many NH readers would benefit from reading in that book (a lot I
guess) and also how many have no immediate access or consider the asked price
unreasonable (probably fewer, but still many).

~~~
zokier
I would benefit from many things for which I consider the asked price to be
unreasonable. I still do not expect to get them for free.

~~~
biehl
And yet, how many of those things have a fixed monopoly prices, disconnected
from marginal costs, that has been set to fit a market with consumers with a
completely different median income than yours? Are you in a third world
country eg.?

------
protopete
To count the number of page faults during execution, use:

    
    
      perf stat <executable>
    

Page faults occur after anonymous pages are mmap'ed for the heap, either
mapping a common zero page, then faulted again for a memory write. Prefaulting
the page with MAP_POPULATE flag to mmap can help reduce the number of page
faults.

Shared libraries are also mmap'ed and faulted in, and doing it this way saves
memory for things that aren't used. But if paying the penalty for faulting the
pages when used outweighs the memory savings, it might be better to use
MAP_POPULATE here too. It might worth trying to add an LD_LIBRARY_XXX option
to tell the loader to use MAP_POPULATE. Statically linking the executable will
also reduce the number of faults (sections are combined, etc.)

------
Torgo
I consume Mr. Torvalds's G+ posts like I suppose a sports fan consumes sports
news. I can't play at that level so I live vicariously through others.

------
dalek2point3
I'm a complete noob when it comes to kernel development -- but this seems
super interesting. Anyone care to explain what's going on?

~~~
Rusky
Each process has a set of page tables, which describe how physical memory is
mapped to their view of memory. The entries in those tables can be marked "not
present", so that when they're accessed by the process the kernel is signaled
via what's called a page fault.

The interesting part is that this is used for both invalid memory areas (which
cause segfaults) and for virtual memory. The kernel can take memory your
process hasn't used in a while, write its contents to disk, and then mark that
area "not present." Then it can give that memory to someone else, and load
your data back in when you try to access it.

This trick is also used to load in binaries and other files. Instead of
reading a program in all at once, the kernel just updates some internal
bookkeeping to say "these pages should be from this file" and then lets the
page fault handler load them in on demand.

The problem here is that the actual kernel page fault handler is plenty fast,
but the hardware mechanisms that set it off and finish it are slow, because of
how the CPU is built.

~~~
dalek2point3
thanks! this is a great explanation.

------
userbinator
I'm not intimately familiar with the Linux kernel, but doesn't a page fault
usually involve reading something from disk? In that case even the fastest SSD
is going to take a few orders of magnitude more time than those 1K cycles to
get the desired data back into memory.

~~~
sliverstorm
A page fault is when the TLB misses and the CPU doesn't know _where_ in
physical memory to find the requested virtual address.

Edit: sorry, I'm totally wrong. Now I am wondering what the case I described
is called. It is the event when the page table walker is invoked.

~~~
userbinator
What you described, a TLB miss, is just called a TLB miss. The CPU will
automatically find and read the appropriate PTE and load it into the TLB, just
like it does on a cache miss.

~~~
sliverstorm
I learned slightly differently, that the CPU does not always find the mapping
by itself, but this explains it:

 _Some systems, mainly older RISC designs, trap into the OS when a page
translation is not found in the TLB_

\-- Wikipedia

~~~
userbinator
Right, I was assuming x86 (since that's what Linus was describing). Indeed it
is true a lot of older RISCs had MMUs that need a lot of "hand-holding" in
software. I think it's fortunate that x86 didn't go this route, as evidenced
by the increasing cost of context switches, since that basically requires
flushing the pipeline and switching to a completely different instruction
stream, while an automatic TLB, like a cache, doesn't interfere when it misses
-- an OoO/superscalar design can continue to execute around it, if there are
other instructions that don't depend on the miss.

A software-managed TLB involves switching contexts and executing instructions
in a TLB miss handler (the fetching of which could cause cache misses too),
then switching back to the instruction that was interrupted. Compare that to
just internally dispatching a memory read or two more, and you'll probably see
why soft TLBs seem to have fallen out of favour; even if context switches
could be done with no overhead, that extra cost of fetching, decoding, and
executing instructions can't be recovered. (As that old saying goes, "The
fastest way to do something is to not do it at all.")

Looking a bit more into it, MIPS is the most widely-used CPU that still has a
"soft TLB". The other popular RISC, ARM, is automatic like x86.

------
nraynaud
I coined the term "applications that you use at the coffee machine" a while
ago for long-running stuff (simulations, compilations, NP-complete stuff like
routing or model checking). It's basically paying engineers to do nothing.

~~~
rbanffy
I always assumed we were paid to think, not to type.

If you measure workforce engagement by the amount of time their knees spend
under the desk, you are measuring input, not output.

~~~
klodolph
Well, yes, that's a bad way to measure performance. But when I compile
something, it's because I want feedback from the compiler and static analyzer
about the program I just wrote. I have a half-dozen parts of the program in
short-term memory which I want to consider when seeing the output or the
program behavior. If it takes five minutes to compile code, then my short-term
memory will be filled with something else by then, like the weather, chores
I'm putting off at home, or whatever.

So if you measure how long it takes for engineers/developers to get feedback,
you're really measuring how good their tools are, which is a half-decent proxy
for engineer performance.

------
fooyc
Turns out the kernel is very well optimised for kernel compiling workloads.

~~~
thrownaway2424
Linux is highly tuned for exactly two things: building itself, and
distributing itself.

------
michaelf
Can anyone hazard a guess as to how Linus was able to measure the cost of a
page fault and an iret so precisely? What tools and techniques might he have
used?

~~~
apw
It's quite likely that he was using `perf':

    
    
        $ perf stat make
        ...
        116,222 page-faults               #    0.046 M/sec
        ...
    

[https://perf.wiki.kernel.org/index.php/Tutorial](https://perf.wiki.kernel.org/index.php/Tutorial)

------
julie1
So on 82 comments only one notices that the title is awfully wrong, and the 81
others spread their culture like jam on bread; the less you have the more you
spread.

Just for the sake of thinking YC is not about posing, does anybody understand
that actually each of your 4 Ghz core is actually taking 80% of its time at
the HW level having page fault because the architecture is this way. It means,
less that only 800Mega cycles are actually executed per seconds (or idling).
Actual computer do perform as well as a 800MHz computer that would never page
fault and are sucking more than 300W/h.

Doesn't this figures seem enormous?

EDIT: it should at least raise some incredulity, and if confirmed some serious
questions on how we measure computer performance vs power efficiency.

------
abus
What in the world could have convinced Linus to use Google+?

~~~
lern_too_spel
He needs asymmetric follows for his rants to get reach, and his rants don't
fit in 140 characters. That leaves blogs and Google+, which is a tradeoff
between better formatting controls and monetization (blogs) and realtime
engagement (Google+). His choice makes perfect sense to me.

~~~
shawnz
Why not Facebook?

~~~
lern_too_spel
Facebook requires log in to see public posts. It is not a suitable platform
for public ranting or any other public communication.

