
Go 1.8 toolchain improvements - spacey
https://dave.cheney.net/2016/11/19/go-1-8-toolchain-improvements
======
svetly0
MIPS32 support - kudos to Vladimir Stefanovic and Imagination Technologies for
making this happen. Many people from the embedded world will also greatly
appreciate support for soft-float MIPS32 hardware.

~~~
Twirrim
This is honestly the thing that is most interesting to me right now. MIPS32
has a massive embedded user base. Combined with Go's very fast GC there's some
good opportunities here.

------
0xmohit
There are scores of other optimizations [0] as well:

    
    
      Optimizations:
      
      bytes, strings: optimize for ASCII sets (CL 31593)
      bytes, strings: optimize multi-byte index operations on s390x (CL 32447)
      bytes,strings: use IndexByte more often in Index on AMD64 (CL 31690)
      bytes: Use the same algorithm as strings for Index (CL 22550)
      bytes: improve WriteRune performance (CL 28816)
      bytes: improve performance for bytes.Compare on ppc64x (CL 30949)
      bytes: make IndexRune faster (CL 28537)
      cmd/asm, go/build: invoke cmd/asm only once per package (CL 27636)
      cmd/compile, cmd/link: more efficient typelink generation (CL 31772)
      cmd/compile, cmd/link: stop generating unused go.string.hdr symbols. (CL 31030)
      cmd/compile,runtime: redo how map assignments work (CL 30815)
      cmd/compile/internal/obj/x86: eliminate some function prologues (CL 24814)
      cmd/compile/internal/ssa: generate bswap on AMD64 (CL 32222)
      cmd/compile: accept literals in samesafeexpr (CL 26666)
      cmd/compile: add more non-returning runtime calls (CL 28965)
      cmd/compile: add size hint to map literal allocations (CL 23558)
      cmd/compile: be more aggressive in tighten pass for booleans (CL 28390)
      cmd/compile: directly construct Fields instead of ODCLFIELD nodes (CL 31670)
      cmd/compile: don't reserve X15 for float sub/div any more (CL 28272)
      cmd/compile: don’t generate pointless gotos during inlining (CL 27461)
      cmd/compile: fold negation into comparison operators (CL 28232)
      cmd/compile: generate makeslice calls with int arguments (CL 27851)
      cmd/compile: handle e == T comparison more efficiently (CL 26660)
      cmd/compile: improve s390x SSA rules for logical ops (CL 31754)
      cmd/compile: improve s390x rules for folding ADDconst into loads/stores (CL 30616)
      cmd/compile: improve string iteration performance (CL 27853)
      cmd/compile: improve tighten pass (CL 28712)
      cmd/compile: inline _, ok = i.(T) (CL 26658)
      cmd/compile: inline atomics from runtime/internal/atomic on amd64 (CL 27641, CL 27813)
      cmd/compile: inline convT2{I,E} when result doesn't escape (CL 29373)
      cmd/compile: inline x, ok := y.(T) where T is a scalar (CL 26659)
      cmd/compile: intrinsify atomic operations on s390x (CL 31614)
      cmd/compile: intrinsify math/big.mulWW, divWW on AMD64 (CL 30542)
      cmd/compile: intrinsify runtime/internal/atomic.Xaddint64 (CL 29274)
      cmd/compile: intrinsify slicebytetostringtmp when not instrumenting (CL 29017)
      cmd/compile: intrinsify sync/atomic for amd64 (CL 28076)
      cmd/compile: make [0]T and [1]T SSAable types (CL 32416)
      cmd/compile: make link register allocatable in non-leaf functions (CL 30597)
      cmd/compile: missing float indexed loads/stores on amd64 (CL 28273)
      cmd/compile: move stringtoslicebytetmp to the backend (CL 32158)
      cmd/compile: only generate ·f symbols when necessary (CL 31031)
      cmd/compile: optimize bool to int conversion (CL 22711)
      cmd/compile: optimize integer "in range" expressions (CL 27652)
      cmd/compile: remove Zero and NilCheck for newobject (CL 27930)
      cmd/compile: remove duplicate nilchecks (CL 29952)
      cmd/compile: remove some write barriers for stack writes (CL 30290)
      cmd/compile: simplify div/mod on ARM (CL 29390)
      cmd/compile: statically initialize some interface values (CL 26668)
      cmd/compile: unroll comparisons to short constant strings (CL 26758)
      cmd/compile: use 2-result divide op (CL 25004)
      cmd/compile: use masks instead of branches for slicing (CL 32022)
      cmd/compile: when inlining ==, don’t take the address of the values (CL 22277)
      container/heap: remove one unnecessary comparison in Fix (CL 24273)
      crypto/elliptic: add s390x assembly implementation of NIST P-256 Curve (CL 31231)
      crypto/sha256: improve performance for sha256.block on ppc64le (CL 32318)
      crypto/sha512: improve performance for sha512.block on ppc64le (CL 32320)
      crypto/{aes,cipher}: add optimized implementation of AES-GCM for s390x (CL 30361)
      encoding/asn1: reduce allocations in Marshal (CL 27030)
      encoding/csv: avoid allocations when reading records (CL 24723)
      encoding/hex: change lookup table from string to array (CL 27254)
      encoding/json: Use a lookup table for safe characters (CL 24466)
      hash/crc32: improve the AMD64 implementation using SSE4.2 (CL 24471)
      hash/crc32: improve the AMD64 implementation using SSE4.2 (CL 27931)
      hash/crc32: improve the processing of the last bytes in the SSE4.2 code for AMD64 (CL 24470)
      image/color: improve speed of RGBA methods (CL 31773)
      image/draw: optimize drawFillOver as drawFillSrc for opaque fills (CL 28790)
      math/big: 10%-20% faster float->decimal conversion (CL 31250, CL 31275)
      math/big: avoid allocation in float.{Add, Sub} when there's no aliasing (CL 23568)
      math/big: make division faster (CL 30613)
      math/big: use array instead of slice for deBruijn lookups (CL 26663)
      math/big: uses SIMD for some math big functions on s390x (CL 32211)
      math: speed up Gamma(+Inf) (CL 31370)
      math: speed up bessel functions on AMD64 (CL 28086)
      math: use SIMD to accelerate some scalar math functions on s390x (CL 32352)
      reflect: avoid zeroing memory that will be overwritten (CL 28011)
      regexp: avoid alloc in QuoteMeta when not quoting (CL 31395)
      regexp: reduce mallocs in Regexp.Find* and Regexp.ReplaceAll* (CL 23030)
      runtime: cgo calls are about 100ns faster (CL 29656, CL 30080)
      runtime: defer is now 2X faster (CL 29656)
      runtime: implement getcallersp in Go (CL 29655)
      runtime: improve memmove for amd64 (CL 22515, CL 29590)
      runtime: increase malloc size classes (CL 24493)
      runtime: large objects no longer cause significant goroutine pauses (CL 23540)
      runtime: make append only clear uncopied memory (CL 30192)
      runtime: make assists perform root jobs (CL 32432)
      runtime: memclr perf improvements on ppc64x (CL 30373)
      runtime: minor string/rune optimizations (CL 27460)
      runtime: optimize defer code (CL 29656)
      runtime: remove a load and shift from scanobject (CL 22712)
      runtime: remove defer from standard cgo call (CL 30080)
      runtime: speed up StartTrace with lots of blocked goroutines (CL 25573)
      runtime: speed up non-ASCII rune decoding (CL 28490)
      strconv: make FormatFloat slowpath a little faster (CL 30099)
      strings: add special cases for Join of 2 and 3 strings (CL 25005)
      strings: make IndexRune faster (CL 28546)
      strings: use AVX2 for Index if available (CL 22551)
      strings: use Index in Count (CL 28586)
      syscall: avoid convT2I allocs for common Windows error values (CL 28484, CL 28990)
      text/template: improve lexer performance in finding left delimiters (CL 24863)
      unicode/utf8: optimize ValidRune (CL 32122)
      unicode/utf8: reduce bounds checks in EncodeRune (CL 28492)
    

[0]
[https://github.com/golang/go/blob/master/doc/go1.8.txt](https://github.com/golang/go/blob/master/doc/go1.8.txt)

------
grabcocque
So, the language has now spent getting on for 18 months significant slower
than it used to be, and nobody seems to have a real issues with this?

~~~
zalmoxes
No, the language got "faster" over the last 18 months. This graph is about
compilation speed. Everyone in the Go community wants faster compilation
times, hence why Dave is tracking these with every release.

~~~
luibelgo
I consider that the Go compiler is pretty fast, how big has to be the project
to make it noticeable enough?

~~~
smarterclayton
As a concrete example, Kubernetes (0.5M first party lines of code, ~0.7M
libraries) can take about 2-3 minutes to compile on a 2013 MBP on Go 1.7. So
probably around 200-300kloc you'd start to say "this feels slow".

~~~
oelmekki
Before golang, my experience with compilation was about building softwares as
an user in gentoo, and a few LFS. Speed of compilation was the first thing
that stroke me with golang. But then again, it was the first time I was
actually writing code to be compiled.

Is it that usual C/C++ opensource projects are very big, or am I correct to
assume golang compiler processing go code is indeed really faster than gcc
processing C/C++ ?

~~~
candiodari
The 10.000 foot view is that C code is optimized more, but compiles slower.

Historically this was always seen as a good tradeoff. You only compile once,
but code runs thousands of times or more. Therefore a large slowdown in
compilation is seen as a good tradeoff for better runtime speed.

Go follows the turbo pascal type compilers in that

1) the language is limited in several ways to make compilation faster (e.g. it
is given as an argument against generics)

2) the compiler just doesn't try a whole range of optimizations

Unfortunately this is not really what you're seeing. What you're seeing in
this case is a massive difference in the size of the projects you're
compiling.

Go has been written with the explicit purpose of making writing large programs
somewhere between tedious and impossible. By contrast, many C/C++ programs are
huge. Tens to hundreds of megabytes of source code, and when compiling Gentoo,
I bet you're compiling more than a gigabyte total.

~~~
oelmekki
Very insightful. Thanks!

