Hacker News new | comments | show | ask | jobs | submit login
How newlines affect Linux kernel performance (amit.zone)
287 points by tbodt 9 days ago | hide | past | web | favorite | 42 comments

v8 (before v5.9) used to only inline functions that were under 600 characters (and 196 AST nodes). That was another fun way to add fuel to the tab vs. spaces fire: identical functions that used spaces instead of tabs could run significantly slower because they weren't inlined.

> The kernel compiler, GCC, is unaware to the code size that will be generated by the inline assembly. It therefore tries to estimate its size based on newline characters and statement separators (';' on x86).

Seems like a poor decision to use whitespace as a factor in estimating code size. Some people like to space out their code with lots of comments even if it is only a few lines long. This is my naive view, though - does anyone out there have any insight as to why semicolons alone aren't sufficient to estimate code size?

This is measuring code size in inline assembly. The assumption underlying the heuristic is that code size is proportional to the number of assembly instructions and that a line of inline assembly is usually an instruction (although a separator may mean multiple instructions per line). It is not looking at the C code itself, only the inline assembly string.

That heuristic is actually usually a very good one. However, in this case, the inline assembly contains a lot of computation in directives, so it consists of several lines to compute one small instruction.

To answer the question you actually asked rather than what you meant to ask (because the question you asked is interesting in and of itself): compilers throw away the actual source code very quickly. At the stage that inlining is considered, the program is going to consist of a sequence of instructions not unlike the C abstract machine as opposed to a more direct representation of code. The simplest way to estimate a program size is to simply count these instructions, but some instructions stand a good chance of being eliminated (say, a bitcast instruction that changes type without adjusting value), so there's usually some adjustment for "this is really expensive on this machine" or "this instruction is free."

Arguably inlining (or not) should be deferred to the link / output phases and there would be enough information present to perform the operation or not when putting the puzzle of segments together.

The problem with doing this is that many other optimizations depend on inlining. So, if inlining is deferred until link time, thise optimizations will be far less useful.

Take for example auto-vecorization. If we have a loop that just calls a function which increments a value in an array, we may very well have a great candidate to vectorize. However, if the function isn't inlined, the auto vectorization optimization will just see a function call as opposed to an increment and won't be able to do anything with it.

Why isn't optimization a repeated process? Apply some sequence of operations in a loop until diminishing gains are detected? So for example, run inlining again after the code has been lowered., I'm guessing it's actually really complex to try & implement & some optimizations can't run once the code has been lowered?

Similarly, the article talks about how the LLVM inliner clearly is smarter since it seems likely they ask their built-in assembler to provide an estimate of the code size rather than using more coarse heuristics. It seems like there's clearly a better path forward for GCC.

> Why isn't optimization a repeated process? Apply some sequence of operations in a loop until diminishing gains are detected?

Compile time. Compiler optimizations are a tradeoff between code size, running time, and compile time, complicated by the fact that that sneezing at your code can change performance by 40% or more and the fact that correctness comes with steep performance costs and often isn't actually desired (until it really is).

Of all the optimizations, inlining is probably the worst for the difficulty of balancing the triad of objectives and the black magic of its heuristics.

The part of this that I don't understand is: since we will always know more information later in time, why not defer such decisions until runtime?

HP Labs' Dynamo project (~20 years ago) was a JIT for PA-RISC on PA-RISC, and it made most programs run faster, in large part by being able to inline functions at runtime even across library boundaries.

Was there something impractical about this strategy that caused it to end up in the dustbin of computer science history?

> The part of this that I don't understand is: since we will always know more information later in time, why not defer such decisions until runtime?

Latency vs Throughput. And cost of optimizations vs win of optimization.

> The part of this that I don't understand is: since we will always know more information later in time, why not defer such decisions until runtime?

I don't think I fully agree with this. The amount of information available to the JIT compared to an AoT compiler also depends on what input a JIT works on. In the case of Java, I'd imagine that the JVM has a strictly greater amount of information to work with than an AoT compiler because JVM bytecode carries basically all the information available in Java source code. However, a Rust JIT would have semantic information that a lower-level JIT like Dynamo wouldn't (assuming I'm understanding Dynamo right), like aliasing information. Dynamo could infer that a particular memory store probably doesn't affect a later memory load, and optimize based on that, but a Rust JIT could guarantee it, allowing for the same potential optimizations without having to insert deoptimization checks.

This is where profile-guided optimization comes into play, I think. Assuming a representative profile (and that's admittedly a big if), you can get much of the same information that a JIT gets while also getting to work with a potentially semantically-richer source.

> HP Labs' Dynamo project (~20 years ago) was a JIT for PA-RISC on PA-RISC, and it made most programs run faster, in large part by being able to inline functions at runtime even across library boundaries.

I'm curious what kind of code Dynamo was tested on, as the effectiveness of a JIT depends on the particular code patterns of the software being run. For example, JITs can help a lot with code with lots of virtual function calls or indirections that traditional compilers tend to stumble on, but I don't think a JIT would help as much for something like a cache-optimized computational kernel.

Admittedly, my point is weakened by the graph shown in an old Ars Technica article on Dynamo [0], where Dynamo showed significant improvements on the SPEC benchmark suite, which I assume doesn't rely on shared libraries. Now I'm curious whether similar differences can be shown given the nearly 20 years C/C++/Fortran compilers have had to improve.

> Was there something impractical about this strategy that caused it to end up in the dustbin of computer science history?

I'd hardly say that JITs ended up in the dustbin of computer science history; on the contrary, they're quite common. The JVM and the various Javascript JITs are probably the most well-known ones; you also have Pypy (and everything derived from RPython), LuaJIT, Julia, MATLAB, the new JITs showing up in Ruby and PostgreSQL, and the more recent push to develop new JITs (like Graal). The JVM got its JIT in 1998, and the JIT was made the default in 1.3 in 2000, so it's not like all these are recent developments.

Now, between all that, I don't know why a low-level JIT like Dynamo (if I understood the article I found [1] right) didn't become more popular. It certainly seemed like promising technology at the time. I'd love to know why it seemed to die out and whether it's worth pursuing today.

[0]: http://archive.arstechnica.com/reviews/1q00/dynamo/dynamo-3.... [1]: http://archive.arstechnica.com/reviews/1q00/dynamo/dynamo-1....

> Seems like a poor decision to use whitespace as a factor in estimating code size.

It's a heuristic. It's always going to be more poor in some respect that doing something precise. You could probably make it better in a million different ways, but it's suppose to be quick and simple.

Even with a simple heuristic, wouldn't you try to pick this magic number 600 so that it's about as likely to be too big as too small?

Or is the problem very asymmetric, that inlining too-large pieces has dramatic downsides that failing to inline small pieces does not? Or is it just that the magic number was chosen decades ago, and the tradeoffs (perhaps re memory) are different now?

> wouldn't you try to pick this magic number 600...

I guess ideally you'd run a sophisticated machine learning algorithm across a huge corpus of real-world code, on real-world devices to discover the right number... but in reality people just don't have time for that so they pick a number and move on. Can't really blame them.

I had a binary search in mind, if that counts as sophisticated :) Just surprised by orig GGP's assertion that more inlining would automatically be faster.

to inline the parsing must be done. Once parsing is done there is no newlines and semicolons, only AST nodes. Counting AST nodes is very quick and simple and provides for much better heuristic.

> the parsing must be done.

Therein lies the bulk of the problem, gcc doesn't parse the assembly code. The assembly is just copied with some minor string replacements for the registry names, and it's up to the assembler to deal with it and parse it.

> to inline the parsing must be done

Yes... that's the point. For your metric counting AST nodes you need to have parsed the method. The other heuristic works on unparsed source-code. So you don't waste time parsing a method that you won't inline.

> Counting AST nodes is very quick and simple

But parsing is neither. And you need to have done that first. I think (there's a paper I'm sure but can't find it) that parse time for JS applications is very substantial.

You don't want to inline functions that haven't been parsed. If it hasn't been called by the time you are JITing, then it probably isn't going to be on a hot path.

If it's been JITed then the AST may well have been discarded after the JIT completed! Doesn't V8 work like that today? You'll need to pass it again even if you have already parsed it!

It's not hard to think of worst-cases. Think about an enormous function that you would not want to inline. It's been JITed, or it's running in a byte code interpreter, and the AST has been discarded. In your scheme you'll have to parse it to find it's too big. Huge waste of time, if you can see it's too large from the source code.

You can't throw away your interpreter bytecode, because the JIT code needs to be able to bailout if a guard fails (e.g. if you specialized a var as an int and you get a non-int).

So you can always use the size of your bytecode as an inlining heuristic (which is what Chakra does).

Of course that wouldn't work before v8 had an interpreter, but they could have used other heuristics. For example, they could have used the size of the AST as a heuristic and saved that even after the AST gets thrown away.

Read the article - LLVM does it correctly.

You don’t put semicolons between instructions in assembly. A semicolon is used for comments to the end of the line. Instructions go one per line

No, the GNU assembler (unlike some others) does use semicolons as an instruction separator. Comment markers include // and // (like C), as well as potentially #, @, or others depending on the architecture.

Edit: Actually, on some lesser-used architectures it does support using ; to start a comment, e.g. z80 [1], but not the most common ones.

[1] https://sourceware.org/binutils/docs-2.26/as/Z80_002dChars.h...

__builtin_constant_p is a rats nest of edge cases. We're looking into differences between implementation from gcc and clang now for the kernel.

Also, we're working on getting Clang's integrated assembler to assemble the Linux kernel, but there's a long tail of syntactic sugar and simpler basic things Clang has to implement as GAS from binutils is a defacto standard for assembler.

I'd really like to see a comparison with clang's inliner. In my cases [1] gcc constexpr support was always horrible and clang decent. So I'm not sure if using the builtin assembler would fix the gcc situation.

1: https://github.com/rurban/safeclib/blob/master/tests/perf_me...

So I think the inlining decisions in a compiler are some of the most heuristic-y parts of the compiler. There was a paper from Jeff Dean and some other Googlers recently that hinted that machine learned models might be the next leap forward in compiler technology, as a potential means of replacing these heuristics.

Machine learning? If some logic, than precise and reproducible please. Prolog would be nice, not some neural nets. gcc already has some nice lisp. Jeff Dean usually has good ideas, but not this one. ML would be more useful for a JIT, dynamic optimization problems at run-time.

But the problem is only the botched architecture, with non-manageable interdependencies. Of course inlining is the mother of all optimizations, most other optimization should be applied after inlining. But for inlining to be effective it has to run after constant folding. And then after inlining you have to run most optimization passes again to get the real benefits.

This problem is much simpler: Missing as-integration and a botched constexpr API. They should just take it from llvm.

Seems like those competitions where people write one-liners have their merits.

I was wondering why that'd be the case (usually the only reason these urls even pop up was because someone was copying urls from a google results page), then I saw this at the bottom:

>Made with the new Google Sites, an effortless way to create beautiful sites.

The idea is to prevent you from leaking the URL of your internal google sites page through the referer header. Maybe Google does some tracking, but the idea behind the redirect was for privacy, as I understand it. (Disclaimer: former Googler.)

There's usually a good justification like this for features like these. The fact that every single such feature results in Google tracking more of your life is a mere coincidence.

I mean, they could just inject Javascript that detected that you clicked the link if tracking was the only goal. The fact that they prefer an HTTP redirect means there's more than just tracking involved.

It would be nice if they put a link to the privacy policy on the redirect page, so the intent was more clear.

You can disable sending the Referer header with Referrer-Policy:


(Maybe R-P didn't exist when the current solution was concocted.)

Per MDN, Referrer-Policy isn't supported in IE/Edge either, so you'd have to do browser sniffing. This seems like it has the potential for something to go wrong and cause a referer leak.

On the other hand, if it's a public site then obviously there's no concern about the URL leaking, so I'd think this should only be turned on for sites that are set as private.

I'm guessing that the underlying problem here is that you can share things in Google Drive / Google Docs to "anyone with the link" and when that's enabled, you probably didn't actually intend to share it with the owners of the websites that you happened to link to. Safer to just never let the URL leak.

TIL about Google Sites. TBH I'd prefer Google link tracking everywhere over popups every time I visit a la Medium.

Sorry for that. It was the path of least resistance when I set a website.

Alternatively, you can run this in the web console to fix the links:

  for (let a of document.querySelectorAll('a[href*="/url?"')) { a.href = decodeURIComponent(a.href.match(/[?]q=([^&]+)/)[1]); }

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact