Linus wrote: 'Even a fully built kernel ("allmodconfig", so a pretty full build) takes about half a minute on my normal desktop to say "I'm done, that pull changed nothing I could compile".'
Since it's just after a git operation all the file state should be warm in the disk cache. It really shouldn't take that long. The Linux kernel (at least by looking at [1]) is about the same size in terms of file count as Chromium, and we got this operation down to about a second by using better tools (i.e. non-recursive make and then eventually a replacement).
I appreciate that the kernel has its own requirements (it sounds like his no-op builds are still running shell scripts, something you ought to avoid in your critical path) and also it's great that he's running it this way in part to help profile a "normal" workload... but I'm also a bit sad to see so much time spent waiting for something slower than necessary, as well as time spent optimizing what feels like the wrong thing.
From further down in the comment thread, Linus says that speeding this up should help other workloads too, and he's sick of using make replacements:
---
+Peter oh, it's absolutely true that 'make' is a pig, and does too much, and we don't exactly help the situation by using tons of GNU make features and complex variables and various random shell escapes etc etc.
So there's no question that some "makefile compiler" could optimize this all. But quite frankly, I've had my fill of random make replacements. imake, cmake, qmake, they all solve some problem, and they all have their own quirks and idiocies.
So while I'd love for 'make' to be super-efficient, at the same time I'd much rather optimize the kernel to do what make needs really well, and have CPU's that don't take too long either.
Because let's face it, even if the kernel build process was some super-efficient thing, real life isn't that anyway. I guarantee you that the "tons of small scripts etc" that the kernel build does is a real load somewhere totally unrelated. Optimizing page faults will help other loads.
> Because let's face it, even if the kernel build process was some super-efficient thing, real life isn't that anyway. I guarantee you that the "tons of small scripts etc" that the kernel build does is a real load somewhere totally unrelated. Optimizing page faults will help other loads.
Suppose the kernel build were as efficient as Chrome's is claimed to be on this page (<5s) and wasn't stressing his system. Would Linus then approve of anyone submitting patches to deliberately slow down the Linux kernel build just to show up slownesses in page fault and encourage kernel devs to spend time on optimizing that?
No, of course not! That would be idiocy and the person submitting the patches would probably be banned by Linus in his titanic rage. However, since the slow build & page faults is the status quo, Linus is making lemonade of it...
I think you missed the point. He's not denying that make is slow and could be faster or that most of it doesn't have to do with page-faults. All he's saying is that page-faults in general are slower and speeding them up would improve lots of various different kinds of loads besides just the kernel build system (Which would probably not improve very much from just that alone).
I'm sure he'd be happy to take patches that make 'make' faster, but that's simply not what he's trying to address here.
So, that is actually an interesting thought experiment, thanks for that.
However I'm not sure it is directly applicable here. There are two courses of action that could solve this problem:
A, an action that improves kernel builds
B, an action that improves several workloads
For A and B of similar cost, it makes sense to do action B in preference to action A.
Your argument speaks to A being of positive utility, but given a finite knapsack of effort smaller than the set of possible actions that fit in the knapsack, a greedy algorithm that places any positive item into the knapsack is not optimal.
I don't follow your reasoning here, but let me expand my observation further: Linus can A. improve his build system (low-hanging fruit which the Chrome numbers suggest could yield an order of magnitude better performance), or B. he can search among a variety of difficult unlikely-to-yield-major improvements (like yelling at Intel engineers 'make it go faster!') which will improve his build system and also other hypothetical loads (which are unlikely to be large gains if any at all; what, is Intel too ignorant to try to make the TLB fast?). He is claiming B is better in part because of the hypothetical loads makes it better in total.
Most people would consider A a more reasonable reaction, especially after hearing that Linus's best idea for doing B is apparently going all the way down to the hardware level in search of some improvement. We can see this by intuitively asking what people's reactions would be to a proposal to induce B if the equilibrium were already at A.
On the other hand, Linus is one of a handful of people in the world who may be in a position to get results by yelling at Intel engineers to 'make it go faster!'. This isn't because Intel is too ignorant to do things on their own, but because practically everything can be optimized further, and Linus may have enough sway to focus the engineers' attention on the problem that he wants solved.
Personally, I don't care much about the speed of the Linux kernel build system, but I do care about the speed with which page faults are handled by the CPU. Even if the chances of success are lower, if he is able to succeed in speeding up every page fault on future Intel processors, I would consider that a much greater good.
The real problem (as I see it) is that I think he's trying to optimize the wrong thing. His worst-case test is based on trying to repeatedly fault in an uncacheable page: every lookup TLB lookup fails at every level of the cache. Likely, Intel has chosen to optimize the real situation where page translations are cached when they are repeatedly accessed.
Intuition is probably not a good guide here. An improvement to the kernel build process is relevant to the thousands of machines that are used for kernel development; an improvement to the page fault speed is relevant to the over 1 billion devices that run the kernel. That's a pretty big multiplier.
(I believe the GPs reasoning is that "improve the build system" and "retard the build system" are not symmetrical, because both directions require positive effort to be expended).
On the other hand, if build system improvements truly are low-hanging fruit, then there are a lot of people out there who are capable of helping. A much smaller number of people are capable of productively working on page fault optimizations, so having someone like Linus address low hanging build system fruit instead would be a waste of talent.
We have a different prior expectation of the utility and feasibility of action B. In my view, Linus is uniquely suited to do something useful here, and his chance of success is high. In your view, Intel is already doing the best it can, and his likelihood of success is low. Our positions follow.
A isn't really very low-hanging though, rewriting the build-system for something as complex as the kernel would be quite the undertaking even just from a technical standpoint, let alone that you now have to change thousands of peoples workflow by having them install and use a new build-system (some of those people are probably dedicated to just maintaining the build-system too, so their job would radically change) as well as figure out a suitable build-system in the first place (which one would work well for the kernel? which one works on all the architectures people want to compile linux on? etc)
B from linus' perspective just means "wait a year for things to automatically get better (after throwing some money at it)" which seems like the low-effort solution.
Not sure why, but your URL has a bunch of junk on the end of it.
I don't get what you're arguing against. Even if the kernel were superfast, there's probably another "real load" out there somewhere that legitimately runs into lots of page faults.
It's a question of what should be addressed first.
Linus is actively arguing against a faster kernel build. It sounds like it makes sense, because it gives him leverage with CPU vendors.
But it doesn't, really. If kernel build were already fast, he would never in a million years slow it down just to get that leverage.
Now maybe he's perfectly aware of this status quo bias, and he's taking advantage of it to meliorate something he would otherwise have no power over. Sneaky.
Still, what's the priority? He's made his point now, hasn't he? He could work on making a faster build process, now.
Actually, his main point is "here is a workload that is kernel-bound that ordinary users deal with, and here is where the time goes." He has explained in the past that he works on whatever interests him, period. So profiling page faults must be more interesting to him than speeding up the build per se. That's all.
"When a proposal to change a certain parameter is thought to have bad overall consequences, consider a change to the same parameter in the opposite direction. If this is also thought to have bad overall consequences..."
How is speeding up kernel build times a "bad overall consequence"? Thats exactly what Linus is trying to do.
Except Linus is not addressing that problem directly. If he was, he would probably be writing a make he'd be satisfied with. Instead, he's working on page faults.
Speeding up page faults does _directly_ improve build times. I don't understand why you think writing yet another make is the obvious direct fix, but fixing page faults is not. Or why you think Linus shouldn't fix things he is clearly interested in.
Oh no. Not on your life. It would be better for the entire world if Linus had a heart attack rather than rewriting Make. Make is already unusable; git-style make would be an abomination.
Can you even imagine reading the man pages of all the different git-make commands? The syntax?
I hope Linus keeps contributing to humanity and doing what he's doing. He's obviously a great engineer, and if he works hard, he might undo the wrong of releasing git into the world.
I can't tell if you're joking or not. The popularity of GitHub and its position as the defacto modern source control management tool demonstrate that very few people hold your point of view. Sure, Git has a lot of options, but what do you need to actually be productive with Git? Probably only a handful of commands: git clone, git pull, git push, git commit, git add, git checkout, git branch, git log. And not too much more than that. At my university Git is used in every single lab based CS course and I don't know of a single person who has had trouble learning to use it.
One of the main criticisms of git has been its user interface (which to be fair, has improved considerably over time). People often compare it to Mercurial's, unfavourably.
No, it demonstrates that things that the Linux kernel adopts tend to get adopted. That isn't very surprising. Git being faster than the dVCS's of the time probably helped. Popularity doesn't equate to quality. QWERTY is still the default keyboard layout.
I am willing to bet money (let's say 0.1 bitcoins) that there are plenty of people at your university who have trouble with Git. We can haggle on terms, but I'd bet that you'd see 10 people at least struggling with git in an intro course.
Once you learn to give it the right incantations, and learn never to deviate from the path you know, any tool can become usable. But it will never be a part of you and will never make you stronger.
This has crippled a generation of developers and I'm afraid it will be a massive barrier to entry, stunting humanity's growth probably by 10-20 years.
QWERTY exists because alternatives were not objectively better. Everytime anyone proposed an alternative, they had to not just be a little better, they had to be so much better that a lifetime of QWERTY muscle-memory being undone was justified.
It's the same reason so many new display technologies have failed: its not good enough to be better then an LCD eventually. You need to be better now, and better then the LCD which will come about as evolutionary improvements in manufacturing too. Being a little cheaper in the initial plant cost is meaningless if the plant has been built and we understand its processes.
Which of course is why the commentary on Git is absurd: people struggle with Git? People struggled with CVS. People struggle with the notion of "files" and programming in general. The alternative has to both exist, and be easier to use from the get-go. Not just a different set of traps.
At some level, command-line conventions are arbitrary. "ls" could just as easily be "dir" or "list." "revert" could just as easily be "checkout" in some contexts.
UNIX has never been a zero-learning-curve OS. That's OK. There are other operating systems that fill that role. git is a UNIX tool which is intuitive once you learn it.
The world doesn't owe you anything, and if you don't want to learn git, or any aspect of programming, nobody is forcing you to. There are lots of other things to do out there. Some of them even pay more.
The parent quote doesn't deserve to be downvoted but it seems to have been massively downvoted for the same reason a massive number of people use GIT who shouldn't - the bandwagon effect.
Git may well be a perfect tool for the Linux kernel but it's still an opaque nightmare for the many average non-kernel-developers who use it as "the new good version control system" (which it isn't, it's only a tool for a specific purpose made by a famous person).
The poster's point is valid - if Linus created a make replacement for his purposes, the effects of everyone else inappropriately adopting it would be horrific (even if it was indeed, a great make for the kernel and just for the kernel).
It's simply that doing that many system calls takes that amount of time. Even though everything is hot in the cache, make still needs to call stat on every single file in the kernel source tree, to detect those files that have changed.
The suggestion in the comments to allow a Windows-style batched stat is a good idea. This means that more work is done per system call, reducing the number of calls and therefore the amount of time spent saving the CPU state to switch to kernel mode, and then restoring it again to switch back.
Linus says about page fault overhead, not system call. I understand this is a much different thing. On x86 page fault raises an interrupt. For a system call there is an instruction designed for fast control transfer to avoid costly interrupt
No they are not. Cutting context switches in system calls is one of the easiest ways of boosting throughput in your average unoptimized Linux app, in my experience. Exactly because so many developers go around ignoring the cost of system calls.
Is anyone else horribly sad that this kind of content and discussion is ending up on Google+? I often find myself combing the mailing lists of key projects like the Linux kernel to figure out why things happened the way they did, and Google+ comments are not searchable in the same way; and even if they were, I am having a difficult time believing that the content will not just end up gone entirely after another ten or twenty years. I know some people point out that using "modern" communication channels decreased friction, latency, etc. but for some use cases the value of records and centralization is high enough to warrant moving slower or more painfully.
It is not only that. I am have took for the other messages that Torvalds has said the mists of "I have no idea what you said, but cool." spam in the comment section.
The exact same happens with John Carmack's tweets. I don't know what's going through peoples' minds that they think these high-end folk are so in need of encouragement that half-hearted messages are worth sending.
I imagine the end result is a bit like a performer having their own catchphrases sent to them, in that they're scroll-scroll-scrolled past without exception.
Is anyone else horribly sad that this kind of content and discussion is ending up on the internet instead of written down in books?
But seriously mailing lists are pretty archaic, not saying that g+ is great but it's possible to have solid data redundancy and centralization without living in the past.
Why do you equate discussions on mailing lists with living in the past? Sure, Google+ is newer, but that doesn't make it better. On the contrary, mailing lists are better in at least two ways: they're decentralized, and they don't require proprietary software.
That said, the level of "discussion" in that thread is kind of sad. No guarantee it would be better elsewhere, but I at least have tools that help me wade through mailing list traffic.
Basically, unthreaded messages for something like this is just obnoxious. And is pretty much the reason you move discussions to tables when you are at a public place and don't just sit having everyone try and shout at each other.
The title is a bit misleading. If I understood Linus correctly he is saying that a page fault on modern processors is costly. There is nothing the OS can do
to make a page fault itself (or iret) faster.
> It's interesting, because the kernel software overhead
>for looking up the page and putting it into the page
>tables is actually much lower. In my worst-case situation
>(admittedly a pretty made up case where we just end up
>mapping the fixed zero-page), those 1050 cycles is
>actually 80.7% of all the CPU time.
Page faults that don't happen are free. If the OS can find ways of reducing the number of them, that's a way of speeding up the average "time there would have been a page fault"... you're right that the title is misleading, though.
Agreed, given one is 32 bit memory addressing and the other 64 bit. You would expect the 64 bit to be upto twice as long and adding optimisations in the middle, but not equal. After all a 32 bit memory address is twice as much as a 64 bit one to handle behind the scene.
Also he only used one compiler and would test with another compiler to confirm the results. Just to eliminate a possible compiler quirk in a quick way compared to checking the machine code.
Interestingly, a soft page fault can be caused not only by a real page fault event, but also by NUMA autobalancing. Which actually could be a major source of kernel interference to purely user space processes. I sometimes see these NUMA-originated soft page faults very high up in a (CPU/cache) profiler for sensitive real-time processes that should avoid contexts switches and L1/L2 cache pollution at all costs.
As a side note NUMA autobalancing can be disabled by running:
> If "git status" is empty, and takes milliseconds to know that -- "make" should take roughly the same amount of time to know the same.
If absolutely nothing changed, you wouldn't need to build in the first place.
He's talking about the case of "nothing changed that needs to be built". Configuration files, sources that aren't built on that platform, etc. Which means the build system still needs to traverse directories and figure dependencies out, etc.
This is a rebuild, though. You can distinguish the preprocessing phase into its own build step and then you'd just preprocess everything. You could even try to cache which macros are relevant to which headers/files.
Actually, they sort of are. If the conditional not taken on the current configuration includes another header and that was where the change was, then there will be no dependency between the current file and that header on this platform/configuration.
That case is the easy case (though caching of included files is problematic in most build systems due to inability to express dependency on the inexistence of the included name in a previous include directory).
The hard case is changing defined macros in a way that doesn't matter but does pass extra -D flags to the compilation units. You can detect this by ad hoc preprocessor aware logic, or you can have a separate build step for preprocessing and do content aware rebuilds that avoid rebuilding if the preprocessed text is identical.
Right, I was just saying that some of the build environments are "smarter" than you might give credit. I think you'd be surprised how many people don't know about gcc -M and friends.
A) doesn't make sense. If the header is autogenerated, then it has a dependency specified elsewhere. So, I'm not sure how that is a failure of -M. (That is, it will tell you that this file depends on a header. It is up to other rules to say that header is autogenerated. Right?)
B) This makes sense, though I don't necessarily see how that could happen. It won't help you not rerun the -M hit, but the flow is: "rerun -M" on files to get list of dependencies, restart with those dependency list loaded and see what needs to be rebuilt. Right?
A) Usually, there's a rule like: auto_%.h from %.xml Then you want #include "auto_foo.h" to be found, and cause auto_foo.h to be auto-generated from foo.xml, without explicitly mentioning "foo.xml" anywhere in the build system itself.
B) If you rerun -M every time you try to build, your empty builds are going to be quite expensive. It makes sense to cache that, and only rescan files when they or their #included files changed. But then you need to be able to do the file-inexistence dependency thing or it's wrong.
Ah, I'm really just getting used to the autotools conventions, where I don't think any wildcard rules are used. (And, even then, I'm still just a learner.)
I see what you are saying with B, but I don't have quite enough experience to know exactly how expensive that is. I also don't know enough about the kernel build to know what it is doing.
Also, I'm curious why the -M flag can't output the inexistance stuff. I guess it would be purely heuristic?
I'm not talking about autotools :) I've always avoided autoconf/et-al. I'm talking about make-based builds not being able to use gcc -M properly, because it doesn't work when header files need to be generated due to their #include directive.
-M could in theory output the inexistence stuff, but then most build systems couldn't even express that dependency.
Oh, I think I see what you mean. -M will catch when a C file needs to be changed due to any header in the #include path, but not necessarily the header files.
For that, I would assume you would still have to do that by hand for the header. There may be some autotools thing that covers it. Though.... even then, I'm not sure what the point is. If the header file itself is generated, then you already have the dependency on what it generates from.
The only scenario I think I see as not covered is when there is an include in an #if that flips from not taken to taken. Though, that does seem fairly edge case.
I think I really just need to see an example of the inexistance stuff. In particular, one that is expected to change between non-clean builds.
gcc -M on foo.c will not tell you about the dependence on "bar_auto.h", but will rather fail. With buildsome or a better #include scanner, you will know that foo.o depends on bar_auto.h, even though bar_auto.h does not yet exist.
Problem B:
x.c: #include "bla.h"
a/bla.h does not exist
b/bla.h does exist
gcc -Ia -Ib -o x.o -c x.c
gcc -M tells us that x.o depends on b/bla.h. We cache this information to avoid rerunning gcc -M every time. Then someone adds a/bla.h. A rebuild will change x.o, but "make" or whatever build system cached the result of "gcc -M" will not rebuild anything.
You might say "gcc -M" should be rerun each time, but this is extremely wasteful, as there's no reason to rescan all the .c/.h files every time, when they did not change.
This sounds like it's just the cost to traverse the page table, right? ~300 cycles per raw memory lookup, and 3 of them because you'll typically need to go three levels deep?
The TLB is tiny these days, and 4kb pages are tiny.
I'm super hopeful that Linus is going to force through some big improvements to HugePages, because the current Linux HugePages support is super painful at the moment. 2MB pages alone could be a massive gain.
512K pagesize? Wouldn't that add a ton of IO in a lot of scenarios? Like every 1 byte file would now require a 512K event? Large pages (2MB/1GB) is for specialised use where you know you're not going to be paging things in/out too often, right?
IIRC Linus was quite dismissive about having larger-than-4k page sizes as the default.
Another common one is actually in the kernel where filesystem block sizes are limited to page sizes, so from this point of view large page sizes are better:
The thing is, all the relevant page table data should already be in L1 caches when the page fault handler returns. The TLB miss on iret should not require raw memory lookups and should be much faster than 300 cycles.
Probably it's more that iret simply has always been slow (hence why the various syscall/sysenter extensions were created).
One reason the Mill architecture looks so interesting is that all of the state that needs to be saved is done asynchronously, in parallel with program execution. In fact, the same mechanism is used for function calls, system calls, and interrupts, which all just look like atomic ops. That should make interrupts and irets much faster.
Another interesting aspect of the Mill architecture is that protection and translation are separate. The cache uses virtual addresses, and the TLB sits between cache and main RAM. The TLB is much bigger and slightly slower, so simply doesn't fault nearly so often.
Can you guys say anything yet about how the Unix fork syscall is implemented on that architecture? The requirement for copy-on-write pages seems to conflict with a single virtual address space.
Afraid its still Not Filed Yet (NFY). You knew I was going to say that. But the slide deck is all ready for when we've filed.
The next talk will be about configuration, which is another cool topic, and that'll be in a couple of weeks. Get on the mailing list to get details as they become available: http://millcomputing.com/mailing-list/
IIRC SPARC register windows do something like this - a call and a return do not need to save or restore anything but a pointer to what is the first register available to your function.
I think it's at least partially a problem of not being better enough than "good enough". Not to mention Itanium (and maybe Sparc, I don't know enough about it to say) had plenty of its own problems, including with interrupt handling.
This is one of the reasons why continuous performance testing is important. Even when your own code doesn't regress, the platform beneath you may do so. In this case it sounds like a case of the platform making most stuff faster but not page faults, so it looks proportionally worse.
So Linus noticed that the page fault on the Intel CPU became slower on the newer CPUs something like 10%. Before and after, it took roughly 1000 ticks to perform the fault. Which just means that if the clock is the same, let's say 3 GHz, instead of 3 millions page faults per second on Core Duo you can now "just" do 2.7 millions of page faults per second. Doesn't sound like something to be much worried about. The code which spends most of its time in page faults is just something that I can't imagine to be an example of the code anybody even attempted to make faster. Linus admits that too: changing the build process would certainly eliminate significant number of page faults in his case. So, my conclusion this time is... Meh. The pig did't fly before and now it won't 10% more so. Intel guys, if you sacrificed this in order to make the instructions that matter faster, I welcome this: Linus also observes that the rest of his code practically got faster around... more than 3 times!? Before, it was around 900 ticks page fault code, the rest of the code around 50 percent of total, that is also 900 ticks. Now it's 1000 ticks PF, the rest around 20% of total (see: Linus claims 80% spent in PF), that is, around 250 ticks for his code. I say, Intel, you did a perfect job speeding up some code that matters to take just 250 instead of 900 ticks between Core Duo and new CPUs. This sounds amazing.
(I of course welcome any arguments that demonstrate that I overlooked something when claiming all this.)
Yes, in Linus's case, it is a faster overall runtime, but such a stark difference in cycle times is a bit of an eyebrow raiser.
Caching is supposed to speed things up, so it is kinda silly that the caching system has somehow gone backwards in overall performance.
Pagefaults are a critical path for performance concerns in almost any software system that needs high-performance. If they are 10% slower, then its like the worst case performance of the computer is 10% slower.
Anyone who needs to write high performance code (say, simulations, or game design: http://gameprogrammingpatterns.com/data-locality.html ), is concerned with avoiding pagefaults and cache misses, but the average code (general case?) doesn't concern itself with this as much, so the average program may end up experiencing this more than a linux compile.
Try to count the total number of page faults per second in any task you're doing. I don't know of any "average code" that produces an order of 3 millions PF per second, which is needed for you to observe 10% slowdown. The code which does that doesn't do anything than doing the page faults. Linus claims he managed to get 5% slowdown when doing nothing but PF.
It's probably something like hundreds of thousands per second at most, giving you something like a less than 0.5% slowdown. Versus the speedup of 300% for some other code (it's an extreme value actually, but still). Your newer Intel CPU certainly didn't get slower on average code compared to the Core Duo, for the same clock speed.
Nothing to raise the eyebrows here. I believe it's not "in Linus case" but "in every case" that the overall time is shorter.
I wonder how many NH readers would benefit from reading in that book (a lot I guess) and also how many have no immediate access or consider the asked price unreasonable (probably fewer, but still many).
And yet, how many of those things have a fixed monopoly prices, disconnected from marginal costs, that has been set to fit a market with consumers with a completely different median income than yours? Are you in a third world country eg.?
What happened to the Hacker Ethic? "Access to computers--and anything which might teach you something about the way the world works--should be unlimited and total." and "All information should be free."
Nothing "happened". Using "hacker ethic" as an excuse for piratism has always been stupid. That is why Stallman began writing GNU instead distributing illegitimate copies of UNIX.
To count the number of page faults during execution, use:
perf stat <executable>
Page faults occur after anonymous pages are mmap'ed for the heap, either mapping a common zero page, then faulted again for a memory write. Prefaulting the page with MAP_POPULATE flag to mmap can help reduce the number of page faults.
Shared libraries are also mmap'ed and faulted in, and doing it this way saves memory for things that aren't used. But if paying the penalty for faulting the pages when used outweighs the memory savings, it might be better to use MAP_POPULATE here too. It might worth trying to add an LD_LIBRARY_XXX option to tell the loader to use MAP_POPULATE. Statically linking the executable will also reduce the number of faults (sections are combined, etc.)
Each process has a set of page tables, which describe how physical memory is mapped to their view of memory. The entries in those tables can be marked "not present", so that when they're accessed by the process the kernel is signaled via what's called a page fault.
The interesting part is that this is used for both invalid memory areas (which cause segfaults) and for virtual memory. The kernel can take memory your process hasn't used in a while, write its contents to disk, and then mark that area "not present." Then it can give that memory to someone else, and load your data back in when you try to access it.
This trick is also used to load in binaries and other files. Instead of reading a program in all at once, the kernel just updates some internal bookkeeping to say "these pages should be from this file" and then lets the page fault handler load them in on demand.
The problem here is that the actual kernel page fault handler is plenty fast, but the hardware mechanisms that set it off and finish it are slow, because of how the CPU is built.
I'm not intimately familiar with the Linux kernel, but doesn't a page fault usually involve reading something from disk? In that case even the fastest SSD is going to take a few orders of magnitude more time than those 1K cycles to get the desired data back into memory.
Requiring a read from disk is a "hard page fault". A "soft page fault" is when the kernel can satisfy it just by fiddling some flags (eg. marking a pagecache page as dirty) or creating a copy of an already-in-memory page.
I'd say your original statement is correct, and that the term "page fault" is overloaded. It can be used for both the TLB miss handler and also for loading swapped out data from disk. It's up to context to make it clear.
What you described, a TLB miss, is just called a TLB miss. The CPU will automatically find and read the appropriate PTE and load it into the TLB, just like it does on a cache miss.
Right, I was assuming x86 (since that's what Linus was describing). Indeed it is true a lot of older RISCs had MMUs that need a lot of "hand-holding" in software. I think it's fortunate that x86 didn't go this route, as evidenced by the increasing cost of context switches, since that basically requires flushing the pipeline and switching to a completely different instruction stream, while an automatic TLB, like a cache, doesn't interfere when it misses -- an OoO/superscalar design can continue to execute around it, if there are other instructions that don't depend on the miss.
A software-managed TLB involves switching contexts and executing instructions in a TLB miss handler (the fetching of which could cause cache misses too), then switching back to the instruction that was interrupted. Compare that to just internally dispatching a memory read or two more, and you'll probably see why soft TLBs seem to have fallen out of favour; even if context switches could be done with no overhead, that extra cost of fetching, decoding, and executing instructions can't be recovered. (As that old saying goes, "The fastest way to do something is to not do it at all.")
Looking a bit more into it, MIPS is the most widely-used CPU that still has a "soft TLB". The other popular RISC, ARM, is automatic like x86.
I coined the term "applications that you use at the coffee machine" a while ago for long-running stuff (simulations, compilations, NP-complete stuff like routing or model checking). It's basically paying engineers to do nothing.
Well, yes, that's a bad way to measure performance. But when I compile something, it's because I want feedback from the compiler and static analyzer about the program I just wrote. I have a half-dozen parts of the program in short-term memory which I want to consider when seeing the output or the program behavior. If it takes five minutes to compile code, then my short-term memory will be filled with something else by then, like the weather, chores I'm putting off at home, or whatever.
So if you measure how long it takes for engineers/developers to get feedback, you're really measuring how good their tools are, which is a half-decent proxy for engineer performance.
It's more like reading reddit, when you're waiting for a heavy computation, you're not exploring results, tuning the algorithm or anything. You're just stuck on your tracks by a stupid machine. If anything, you'll be conservative in your use of the slow algorithm, because you don't want to lose another half day. You can't fiddle. Even if you try to be productive reading something smart in the mean time, you'll be interrupted by the end of the computation, and you'll be completely out of context when analyzing the results.
(Note that it's the same when testing takes a long time, coincidentally, the same company had a long running product and 80 minute test cycle, so everybody was paid to take coffee)
From what I have seen, not that many people have running stuff at the same time. But still, someone with a running stuff might propose to take a coffee break to someone who is not running stuff, and everybody loses (it's a bit like smoking breaks, where smokers tend to synchronize with the heaviest smoker).
The worst part is that some applications are really hard to make faster, for example, going from "slow now" to "faster tomorrow" might necessitate quite a few changes that will be validated at "slow now" speed, for months. So the whole road to a faster tomorrow will be at pilgrimage pace.
Can anyone hazard a guess as to how Linus was able to measure the cost of a page fault and an iret so precisely? What tools and techniques might he have used?
I don't know what he used, but its probably based on the model specific registers for performance counters. See Chapter 18 of the Intel 64 and IA-32 Architectures Software Developer's Manual.
So on 82 comments only one notices that the title is awfully wrong, and the 81 others spread their culture like jam on bread; the less you have the more you spread.
Just for the sake of thinking YC is not about posing, does anybody understand that actually each of your 4 Ghz core is actually taking 80% of its time at the HW level having page fault because the architecture is this way. It means, less that only 800Mega cycles are actually executed per seconds (or idling). Actual computer do perform as well as a 800MHz computer that would never page fault and are sucking more than 300W/h.
Doesn't this figures seem enormous?
EDIT: it should at least raise some incredulity, and if confirmed some serious questions on how we measure computer performance vs power efficiency.
He needs asymmetric follows for his rants to get reach, and his rants don't fit in 140 characters. That leaves blogs and Google+, which is a tradeoff between better formatting controls and monetization (blogs) and realtime engagement (Google+). His choice makes perfect sense to me.
Since it's just after a git operation all the file state should be warm in the disk cache. It really shouldn't take that long. The Linux kernel (at least by looking at [1]) is about the same size in terms of file count as Chromium, and we got this operation down to about a second by using better tools (i.e. non-recursive make and then eventually a replacement).
I appreciate that the kernel has its own requirements (it sounds like his no-op builds are still running shell scripts, something you ought to avoid in your critical path) and also it's great that he's running it this way in part to help profile a "normal" workload... but I'm also a bit sad to see so much time spent waiting for something slower than necessary, as well as time spent optimizing what feels like the wrong thing.
[1]: http://larjona.wordpress.com/2011/06/15/numbers-about-the-li...