Hacker News new | past | comments | ask | show | jobs | submit login
Cost of a Integer Cast in Go (boyter.org)
84 points by kiyanwang on Aug 30, 2022 | hide | past | favorite | 78 comments



As a compiler guy, I'd appreciate some look at the layers of abstraction in-between (so, ASM). Microbenchmarks are famously jittery on modern CPUs due to cache effects, branch prediction, process eviction, data dependencies, pipeline stalls, OoO execution, instruction level parallelism, etc.

You have to be really careful to ensure the numbers you're getting are really coming from the thing you're trying to benchmark, and aren't corrupted beyond recognition by your benchmarking harness.

Some of these questions would surely be trivial if I actually knew any Go, but I'm left wondering:

* What does the machine code / assembly look like for this? What does the cast compile down to?

* What's `int` an alias for? I assume 64-bit-signed-integer?

* Are integer casts checked in go? Would an overflowing cast fault?


* https://godbolt.org/z/3dEdha44W

* Architecture based, so 32 or 64 bit signed integer.

* No faults. Signed become -1 and unsigned become MAX.


This is using GCC Go which no-one actually uses.

Edit: why I'm being downvoted: https://go.godbolt.org/z/699d7KjWr


I for one develop on GCC-Go. The only reason I chose Go as my next project's language is because it has a GCC implementation (I have strict rule about GPL licensed development tool chains for my own projects).


Anyone that cares about the extra performance out of Go code that the decades old battle tested GCC backend is able to provide.


I did not test but I assume the Go compiler is "much" faster than the "semi"old GCC implementation.


As far as I can see, GCCGo implements Go 1.18 as of today [0]. Moreover, as noted on official Golang webpage for GCCGo [1], there are even some advantages for using GCCGo, quoting:

"On x86 GNU/Linux systems the gccgo compiler is able to use a small discontiguous stack for goroutines. This permits programs to run many more goroutines, since each goroutine can use a relatively small stack. Doing this requires using the gold linker version 2.22 or later. You can either install GNU binutils 2.22 or later, or you can build gold yourself."

Considering GCC can optimize well written code to the point of saturating the target processor, given the correct target flags, I'm not entirely sure that "gc" would be "much" faster than GCCGo. I'm relatively new with Go, but equally old with GCC, esp. with g++, so assuming the optimization prowess is equally valid for GCCGo.

Last but not least, GCCGo is a part of GCC as a primary language since 4.7, which is an eternity in software terms.

[0]: https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libgo/VERSION;h=...

[1]: https://go.dev/doc/install/gccgo


It was in 2019 I doubt it changed but the default Go compiler is overall faster than GCCGO: https://meltware.com/2019/01/16/gccgo-benchmarks-2019.html


I was referring to the quality of generated machine code.


I was also talking about that compiled code.


Doesn't look like it, since you mention Go compiler being faster, which is true and isn't the point of generated machine code, as the reference compiler doesn't do many of the GCC optimization passes.


I think he meant the performance of the compiled binary, not the build.


All are good points. Also, loop unrolling and induction variable analysis. LLVM is particularly aggressive at trying to optimize inductions. It will literally turn a "sum of i from 0 to N" into "n(n+1)/2", among others.

It's really important to look at the actual machine code.


An aside.

> It will literally turn a "sum of i from 0 to N" into "n(n+1)/2", among others[1]

Yeah, seen that on a godbolt youtube vid. Question is, should it do this? Or should it force you to use a library, by reporting what you're trying to do and telling you there's an easier way ( "sum of 1 to n is <formula>, instead of a loop use library function 'sum1toN()" )

I think getting too clever risks hurting the user by not letting them know there's a better way.

[1] actually it seems to do a slightly different version of this to prevent risk of overflow, but same result.


One issue is that an optimization like this could have resulted from many prior inlinings and foldings that had to work together to create the pattern that was finally matched, so it's not obvious to point the user to what they should change their code to. Maybe they actually intended that the code get folded at the end by the compiler, but all of those optimizations were done by templates that will get folded a different, but equally cool way, under different specializations.

Generally I think some of LLVM's optimizations are trying too hard for the amount of complexity and compilation overhead they create. All that complexity comes at the risk of bugs. With optimizations around UB, it becomes downright mind-boggling what could go wrong. But I'm not an LLVM maintainer so what the heck do I know.


micro benchmarks are especially problematic in real-world code where you load stuff from random addresses in memory.

If the code after the cast is blocked on a memory load, then you have a lot of free instructions while the cpu is waiting for the memory load to complete. In this case it doesn't matter if the cast is free or takes a handfull of instructions.

Sometimes code becomes faster by using more instructions to make the data more compact so more of the data stays in the caches.


Microbenchmarks are only valid for the one function under test, and only on the current machine; they're all right for optimizing on particular hardware, but not so much to go out into the world and go "X is faster than Y"

That said, I did like this website where you could set up JS benchmarks, they would run on your own machine and you could compare how it ran on other people's systems. It wasn't perfect, but it gave a decent indication if X was faster than Y. Of course, it's a snapshot in time, JS engines have gone through tons of optimizations over the years.


My point is that the context of the function can invalidate a microbenchmark completely.

If you only call this function once in a while, then the context is more important than the function.

You can only ignore the context when you do video decoding or matrix inversion or similar "context free" long running code.


int is defined to be the pointer width (or something like this), so probably int64 where the OP is running their code.

integer casts are unchecked.


and is it aligned properly


> about 3x the overhead to convert from int to float.

That 3x overhead has nothing to do with converting types.

A CPU can't compute x += y before it has the old value of the x. This is called "data dependency chain".

Modern AMD64 CPUs have 1 cycle of latency to add integers, 3 cycles of latency to add FP32 or FP64 numbers. This means no matter what else is in the loop, a loop which increments a single float or double variable can't possibly be faster than 3 cycles / iteration.

As demonstrated by the OP, Apple M1 is very similar in that regard.


Parallelism can break that speed limit, at least in theory. You could use four different registers to hold x+y, x+2y, x+3y, x+4y and run four instances of the loop in parallel, if there are no other data dependencies.

I'm not sure if hardware microcode can be smart enough to detect that and do it with register renaming and such, but a sufficiently smart compiler could. (Or realistically it would be optimized into SIMD instructions; what I said in the first paragraph is really an ad-hoc implementation of SIMD.)


> run four instances of the loop in parallel

Indeed, here’s an example which uses even higher latency FMA instructions to accumulate: https://stackoverflow.com/a/59495197/126995 The example uses 4 independent accumulators to saturate the throughput, instead of stalling on the latency. BTW, on most modern CPUs that code doesn’t saturate the compute throughput, the bottleneck is loads.

> a sufficiently smart compiler could

They tend to avoid doing that for floats, for accuracy reasons: float addition ain’t associative.


> They tend to avoid doing that for floats, for accuracy reasons: float addition ain’t associative.

Indeed, unless you use -ffast-math (also known as -fidontcareaboutthe results). See https://stackoverflow.com/questions/7420665/what-does-gccs-f...


To be fair, lots of devs haven’t studied IEEE 754 error analysis, and don’t know exactly how much precision they can rely on either way. I know I don’t, I just sort of assume the worst.


Isn't a cast from int to float going to generate CVTSQ2SD or similar instruction, that are slow and high latency? On Ice Lake Xeon these are faster but as recently as Skylake they had reciprocal throughput of 2.


Yeah, cvtsi2sd has 3-9 cycles or latency on Zen3, but throughput ain’t too bad. The table on uops.info says on Zen3, cvtsi2sd has 1 cycle throughput, meaning a core can complete these instructions every cycle. There’s no other dependencies so the pipeline will deal with the latency just fine, by dispatching instructions for multiple loop iterations.

It can’t deal with the latency of ARM64 equivalent of addss/addsd due to the dependency chain. I think that’s the bottleneck of the OP’s microbenchmark.


I have to say I don't understand the point of debates like this when it comes to Go (its like the age-old "+" vs "strings.Builder" debate for concatenation).

The point of Go is that its quick to learn, quick to get stuff done, and its got a fantastic stdlib that enables you to get most modern things out-of-the box.

If someone is worrying about the cost of an integer cast in Go, then either:

    (a) They are using the wrong language and should be using something more low-level (e.g. Rust, C etc.) instead.
    (b) Like many who went before them, they are engaging in premature optimisation
I'd put money on most people falling head-first into (b).


It seems like there are a lot of posts online about how any effort at all to understand how code is executed and what tradeoffs are being made is a case of premature optimization.


I don’t think that’s what people are complaining about. Doing a poor job is.

For example += on floats is slower than += on integers. So does casting actually take 3x as long or is that an artifact of poor test design?


> its like the age-old "+" vs "strings.Builder" debate for concatenation

that's not even a debate. Have you done tests on this? If I remember right, once the string runs over the GC default (I think 1 GB), "+" turns from linear to exponential time. So yeah I guess for most stuff it doesn't matter, but when it does matter its a huge time difference


> once the string runs over the GC default (I think 1 GB),

Not wishing to sound snarky, but is there even a valid use-case for concatenating GB+ quantities of strings (or even hundreds of MB) ?

I mean surely you'll (potentially) run into memory exhaustion issues and other fun things ?


Of course there is. 1 GB is not even that big these days.


> Of course there is

When? Why not use ropes and avoid the copy?


personally I don't see a need for it, but I'm sure someone will do something dumb that requires it


While not Go, in C# you can avoid a ton of heap allocations by using StringBuilder instead of concatenation, and the crossover point for one being more efficient than the other is at something like 4 concatenations. But people will always say nonsense about "premature optimization" the moment you care about performance.


In regards to Go, in most cases its a pointless optimization. Unless the string is quite large, its the same result. Go is nothing like C# in this case.


In python since 2.5 (and im sure other languages implemntations) the compiler/implementation recognize this += use case and implement it as a fast one.


By your logic, any attempt to optimize code written in Go makes no sense. But there are many of us who use go for the reasons you described and still care about efficiency, and especially about cases where trivial changes have significant consequences (one can argue if this particular article qualifies or not).


No, not by their logic. The question is presented as a filter. A filter needs to be (1) so obvious that it is learned by osmosis for a good practitioner, or (2) so useful that they are going to need it in their daily work. Why would any of these two hold for a Go dev job?


The post is suggesting that they were testing a surprising answer to the filter. A candidate took a look at the cast and claimed it was inefficient. The author was double checking their intuition that it shouldn’t be.

We don’t know the shape or result of the filter at all from the article.


It's weird to see this analysis without any look at the generated assembly. They've also added a data dependency between benchmark loops, which is going to matter. Depending on the size they're testing with they're also just going to be benchmarking memory bandwidth.


The code does not seem to perform any reads or writes other than to registers, so I would be surprised if memory bandwidth mattered.


Iiuc it's iterating over a list of integers? Reading those from memory requires bandwidth.

Edit: nevermind, I thought i was from an array, not just incremented.


The benchmarks are interesting, but that's not what caught my attention.

I'm not convinced finding increasingly esoteric bugs in an interviewer-selected language is an effective gauge of coding ability. I expect this is actually worse than whiteboard coding.


We don't know what the actual coding question was; the claim that casting int to int32 is expensive seems to be an unprompted remark coming from an interviewee. I'm guessing they didn't get the job.


It didn't sound like they were required to find esoteric bugs. This is just something a candidate offered unprompted.


TFA said

> The filter for this is a simple review exercise. We present a small chunk of code and ask them to review it over 15 minutes pointing out any issues they see. The idea is to respect their and our time. It works pretty well and we can determine how much experience someone has by their ability to pick up the obvious vs subtle bugs.

So maybe there's more? But it's definitely a mandatory, initial part of the interview process at the very least.


The code may be pretty bad, and it's interesting to see whether a candidate spots that it's bad (if they don't you're presumably going to get more poor code from them if hired) and what about it in particular they point out. It's an opportunity for them to show you what they know, rather than a box checking exercise where you know the answers.

I've actually done an exercise like this about a year ago, in Java, I remember I really cared about iterators at the time, and so even though it's probably not the worst thing wrong with the code I just kept coming back to these C-style for loops which I thought were reprehensible. I should ask my interviewer (whose department I now work in) what their impression was as a result of this obsession.


They even went one further and explained the answer they're looking for:

You lose data due to overflows which is what we expect

Seems like a reasonable question, though without more context that may or may not be true.


Why is that?


I had a long answer, but my phone eated it. (thanks, bus jounce + accidental swipe down + hn forgetting the post between reloads)

How about: why is finding increasingly esoteric bugs in an interviewer-specified language a great way of hiring software engineers?


> How about: why is finding increasingly esoteric bugs in an interviewer-specified language a great way of hiring software engineers?

I don't know. You were saying it wasn't, so I wondered why.


What's wrong with asking candidates to find nonesoteric bugs?


While the programmer should produce clean, good and optimised code, finding bugs and errors is more job for a pentester. Experienced programmers can easily detect these, but that is not their main job.

Programmer should be able to solve problems and apply the selected language for solving these problems as efficiently as possible.


You think a penetration tester should find bugs in code? They're looking for security weaknesses only, and almost never by looking at the code. Your method may be the most expensive possible.


Every bug is a weakness at some level. The most efficient way in bug bounties currently to make money is by reviewing open-source code. Manually testing takes huge amount of time. You can also automate code review for low hanging fruits with tools like semgrep or GitHub's code scanning.

Of course programmers should test their code themselves and minimise the bugs, but their work it not to look for them.


>Of course programmers should test their code themselves and minimise the bugs, but their work it not to look for them.

I have to respectfully WAT. Code review should be a part of everybody's workflow. And all programmers involved should be looking out for bugs. The best bugs are those which were never merged in the first place.


> I have to respectfully WAT. Code review should be a part of everybody's workflow.

Exactly, it is one workflow. Not their whole job. Their overall job is to solve problems with code.


what a nightmare.

Ops tech 1 - "hey... my database just dropped an entire table, i lost a week of work"

Ops tech 2 - " thats a serious bug, you should escalate"

Ops tech 1- "hey you wrote this thing, it dropped my db table, i lost a week of work... "

Great and Mighty Programmer - ' - not my job, i am a programmer for you see, and looking for bugs is beneath me, some peasant task. now begone, i must solve more problems! '

Ops tech 1 "so what do we do? this doesnt work, like, at all. completely broken, not even the most rudimentary testing was done by who ever created it"

Ops tech 2 "stop using the database, we will build an excel spreadsheet on a shared network drive"


It's a subset of the job, yes. Not to be left to a pen tester.


I would argue their job is to solve problems with working code, which requires finding/avoiding bugs.


Why a pentester and not a QA team more broadly? QA won’t necessarily review the code (haven’t met a team that did), but they will typically hammer a system with test cases and scenarios that expose unusual behavior and uncover bugs.

I’ve had pentesters review code looking for things like insecure hashing or encryption, or low hanging fruit like cress in the code, but I wouldn’t be inclined to leave what is essentially a QA process to a pentester.


It may be useful to have a look at the assembly. There's probably no instruction corresponding to the cast, you just get a different arithmetic instruction that uses the bottom X bits. For a comparison between code with and without a cast between signed and unsigned ints of the same width, I would expect exactly the same instructions to be used.


The line x += int32(i)

Produces the assembler line

0x001e 00030 (.\cast_test.go:10) ADDL CX, DX

In comparison the code

x += i

is using

0x001e 00030 (.\cast_test.go:19) ADDQ CX, DX

The results are definitely different since the ADDL is for 32 bits and ADDQ is 64 bits and that is how its achieving the cast. Not sure the intent of the benchmark (the cost of a cast) is being captured directly since we have a specialist instruction for add to do the casting. But when you think about it ADDL 0, $Val achieves a cast and its 1 instruction so its not expensive at all and there may be something better than that for explicit casts, I doubt there is worse. Its basically free, I see no performance difference between them and that fits with expectations since casting is just throwing away all the top bits.


Thanks. This is what I said I expected, given that int32 and int are different widths.


I agree... The compiler can make a lot of assumptions on such a simple operation, you have to look at the assembly. In other more complex cast cases it might be slower so I would not assume anything.


Not only useful, but required. They have to check that the code is performing the intended operation and, even, that the loop still is there (https://kristerw.blogspot.com/2019/04/how-llvm-optimizes-geo... shows how gcc and clang remove some loops. The loops used here have simple forms, too, and could fall to the same type of optimization)


This may also depend on the architecture, what may be free or close to free on AArch64 may not be the case on x86_64 for example. Also of course, the the relative costs (e.g. 3x for float cast) may be different as well.


This kind of benchmark makes zero sense. Sorry for being straightforward, it would be okay for someone who only starts programming, but this "have been doing interviews" part leaves an aftertaste. One must know their shit before "present a small chunk of code and ask them to review it over 15 minutes pointing out any issues they see".


Honestly I can't even understand how this is real...this seems like exactly the wrong person I want doing interviews and even as a developer at my company


Checking how many language subtleties you can point out is no doubt a good interview if you're looking for people experienced in a specific language.

It strikes me as so different from what the FAANGs are doing: they ask you to write trivial algorithms on the whiteboard, which apparently filters 99% of engineers.

Theoretically you can have a good candidate that doesn't know e.g. that appending to a slice passed in a function argument is a recipe for disaster. You'd show them the door over something that they could learn in a day.


> Theoretically you can have a good candidate that doesn't know e.g. that appending to a slice passed in a function argument is a recipe for disaster.

Huh? This is a common pattern. (hash.Hash).Sum, strconv.AppendInt, etc.

I think you're referring to the fact that append can mutate existing memory (that is, it doesn't always allocate new memory), but in practice, this is rarely an issue, because, by convention, functions that append to a slice always return that slice.


You are making a lot of assumptions about the interview. Code review is an important part of the development process and seeing candidates who can suggest improvements to code is valuable.

You seem to be assuming that the intended answers are language subtitles but I don't see evidence of that in the post.


theoretically sure but chances are any good engineer knows that stuff and doesn't need to learn it, and FAANG can filter heavily because they get a lot of applicants.

Basically if you don't know something you could have learned in a day in preparation for an interview at a top paying company that in itself is a sufficient signal.


There are many more “things you can learn in a day” than there are available days to do it, though. Nobody will ever know all the possible tiny simple things.

I would think in general that it’s better to select people who adapt and learn quickly rather than people who seem to know everything but you’re right: we are not fishing in the same pond as the FAANGs.


So all the tests were done on an m1 mac. Is there any difference on x86_64?

I wouldn't expect there to be, but unless this program is only going to be running on new macs, this analysis isn't really complete.


Would guessing how many instructions the code uses be better assessed looking at the assembler files (Might need to decompile to get it, not too familiar with the Go toolchain) Or a good debugger?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: