boris-smidt's comments

boris-smidt · 2024-12-28T16:48:10 1735404490

I just tried the switch statement ``` func BenchmarkSwitch(b *testing.B) { for i := 0; i < b.N; i++ { for i := 0; i < len(randomBytes); i++ { loop: for i, x := range randomBytes[i:] { switch x { case '0', '1', '2', '3', '4', '5', '6', '7', '8', '9': result = i != 0 break loop case 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'y', 'z': result = i != 0 break loop case 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'Y', 'Z': result = i != 0 break loop } } } } } ```

It isn't any faster than the || compare, so i suspect the go compiler doesn't really treat it like a special case.

boris-smidt · 2024-12-28T12:22:56 1735388576

Also add cryptographical people to that list. A friend of mine told me that they create 'constant' time algorithms.

These have to written in assembly otherwise the compiler will try to speed up the algorithm. But then you can theoretically measure the execution time of encrypting and reverse engineer the possible keys.

boris-smidt · 2024-12-28T11:40:08 1735386008

I'll have another try in Zig later on and indeed the XMM/MMX and MMY really is confusing, couldn't they just cal it V128, V256 and V512 ...

boris-smidt · 2024-12-28T11:37:00 1735385820

I suspect it depends, when you would write a yaml/json parser you can only change the algorithm up to a point. After that you will have to start doing some bit fiddling and then being able to see the assembly can be really valuable.

david-gpu · 2024-12-28T12:05:10 1735387510

How many programmers write a YAML/JSON parser vs. use an existing library?

How many of the ones who write their own parser would benefit from using a library more than from reading assembly?

If your answer is that: "well, the ones writing the library benefit from learning assembly"... Think about what percentage of programmers they represent. Not to mention that source-level profiling will still give them better bang for their buck.

As somebody who has read a ton of assembly in their career because those marginal gains mattered: 99% of programmers are better off investing their time elsewhere.

boris-smidt · 2024-12-28T12:37:31 1735389451

Yes i agree with that one most people don't need, they should first use a profiler. Then they can easily improve the performance by 10x.

For example I optimized a calculation with python dataframes by monkeypatching the 'unique' method so it would skip the sort, since my data was already sorted. This gained me a 5% performance improvement. (there where a few other tricks which reduced the calculation time from 3h to 20m making iterating faster)

So i guess the assembly part is just a personal interest and it is only useful for the most inner loop of a program which you can't avoid.

It seems that in general using SIMDs/intrinsic is already in the very advanced playbook of a developer. Just like reflection, classpath scanning etc, GPU acceleration.

Ideally the standard library should provide the fastest JSON/YAML/CSV parser so no other attempts are made to improve on the standard.

I suspect your argument could even be used if you need performance it might be easier to just switch languages. Somebody was super exiting to me that he used a javascript lib which generated optimized code for SQL queries deserialization at runtime. I bluntly said well shouldn't you just use another language to avoid this complexity.

Curious question, Why did you read assembly often in your career?

boris-smidt · 2024-12-28T11:34:18 1735385658

Reading is indeed the main thing i would like to learn, For my master thesis i had dome some GPU programming (8 years ago) and then it was super useful to read the assembly to reduce the amount of steps and to understand the 'execution model'. So this allows you to make sure your 'optimization' actually optimized anything.

boris-smidt · 2024-12-28T11:27:47 1735385267

Interesting i never thought of ASCII to be a table of tables I'll have another look into the ASCII table number sequences to better understand it. This thread is really full of gold nuggets.

boris-smidt · 2024-12-28T11:20:53 1735384853

I'll try it in the follow up post and probably include some graphics to make it clear what it does. Its indeed the real simd way of comparing multiple characters in 1 iteration instead of 1 character per iteration.

dragontamer · 2024-12-28T15:54:45 1735401285

purplesyringa's post below is ~7 assembly instructions without any need of lookup tables.

And the instructions purplesyringa uses are Add, AND, OR, and Compare. Modern CPUs can execute ~4 ADD/AND/OR instructions per clock tick (even if they are SIMD or even AVX512 instructions), so you're looking at under 4ish clock cycles to purplesyringa's solution.

Its likely the best solution in the whole thread. The main advantage of pshufb is its extreme flexibility due to the lookup table, so there's advantages in learning that methodology.

But in general, sequences of simpler instructions are much faster.

---------

The "hard part" of purplesyringa's code is that its branchless.

Learning to write branchless (especially using the pcmpeq* line of instructions) is a bit of a hassle and worth practicing for sure. AVX512 added explicit mask registers that do it better but 128-bit SSE / 256-bit AVX is still more widespread today.

boris-smidt · 2024-12-28T11:19:06 1735384746

Hey i just discovered that my blog post was posted here i indeed have little experience with assembly and the idea was just could i use a 'larger register' like you would use a 64bit register to keep the table hot in the CPU. But sadly i failed and kinda gave up and made the post as is.

The good news is i will do a 'second' step into assembly programming where i will try to implement the algorithm of `dan-robertson`. There will also be a switch case and other tricks suggested in this thread.

With my experience i noticed that i'm missing breakpoints and unit tests to better understand what each assembly step does. Which things read from memory etc. What also doesn't help is that x86 feels a bit like an 'arbitrary patchwork of instructions' where some instructions exists for float and doubles but not int etc.

It seems a lot of implicit knowledge is needed to do assembly programming, for example `xor A A` to zero the register. So i probably have to read hackers delight from back to back.

The only assembly i have done in the past was AVR which is very limited and it was easy to understand. I could just put a JTAG debugger and see the registers directly.

Golang is used because it is a nice balance between high and low level, unlike the JVM/Python (JIT compilers) you can't disassemble and see the native code.

Ideally there would be a 'modern hackers delight' which would also include some simple simd tricks.

My goal is to eventually reimplement 'write a compiler/interpeter in go' in zig and rust to learn those languages. After 1h of trying it seems zig is 'quite' easy to include assembly as well and to do SIMD. But currently i'm on page 40 of 'write a compiler in go' so i'm still a long way off. For the zig/rust implementation i probably need to include a GC in this simple VM https://craftinginterpreters.com/garbage-collection.html so still a lot to learn.

Also one 'clever' idea i implemented in the lexer was to allow full UTF8 but only in string literal parsing mode. This avoids that you can use emoijs as variable and function names.