Hacker News new | past | comments | ask | show | jobs | submit login
An Intel Programmer Jumps over Wall: First Impressions of ARM SIMD Programming (branchfree.org)
167 points by signa11 23 days ago | hide | past | web | favorite | 38 comments

Author here. Happy to answer questions, endure abuse, or, best yet, be put in my place by someone with a Really Nice Table of ARM Latencies and Throughputs that Could Have Been Found If I Wasn't Such an Idiot.

[ Note the title of the article was "Jumps over The Wall", in keeping with the dish. ]

The best you can do for an ARM latency table is ddging

"aXX optimization guide"

a57: http://infocenter.arm.com/help/topic/com.arm.doc.uan0015b/Co...

a72: http://infocenter.arm.com/help/topic/com.arm.doc.uan0016a/co...

Notably not public is the a53 version of this document. I did some work getting ORB_SLAM to run on the raspberry pi a couple years ago. My conclusion was that the instruction timings for NEON instructions are more or less the same as the a57, but just remember there are only two instructions issued per cycle and everything is in order.

Edit: replace google with ddg

I've seen those. We also can get A55, A75 but notably not A76 either. And what's there is... terse. But generally OK (not a fan of the blank spot in latency/throughput tables, regardless of who is doing it). Sadly, ARM is the exception - Ampere, Marvell/Cavium and Apple don't have a _whiff_ of information like this out there.

BTW: I would suggest you look into the closely related art of GPGPU programming. OpenCL, CUDA, or whatever. The architecture of GPUs is SIMD on steroids. You'll likely be very happy with the metrics associated with GPU-shared memory (NVidia) or Local-memory (AMD).

NVidia publishes throughput metrics in their PTX page: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....

AMD's information is harder to find, but it does exist: http://clrx.nativeboinc.org/wiki2/wiki/wiki/GcnTimings


I haven't been doing too much SIMD programming myself, but I know the general intrinsics. I think ARM's biggest problem is documentation. Intel's intrinsics page is great: https://software.intel.com/sites/landingpage/IntrinsicsGuide...

ARM's intrinsics page is... not as good. But I guess still workable. As you've noted, ARM's intrinsics documentation is missing latency / throughput metrics. https://developer.arm.com/technologies/neon/intrinsics

POWER9 by the way has latency/throughput metrics published. I'm most happy with POWER9 as an alternative CPU, too bad its so expensive though.


Intel + Agner Fog is definitely the best documentation for how things work and optimization information. NVidia CUDA and POWER9 seem to be #2 and #3 with documentation.

Since GPUs are available for cheap these days (under $300 and they plug into any motherboard), GPUs seem to be the best alternative computer these days to learn. Since they're decently documented, its good to explore IMO.

I got my start on performance programming doing GPGPU regex (a terrible idea in our context - networked IPS - but that's for another time). This was in 2006, so this meant doing GPGPU using "Cg", including such delights as using textures as arrays and floats for everything. CUDA 0.9 and a engineering sample G80 card were a revelation.

I keep meaning to go back and do some GPGPU again. I am very fond of architectures that actually exist. GPU is a very different bet than SIMD but a lot of fun.

Incidentally PTX is a fiction - more of an IR than something with true throughput and latency numbers. I don't know if anyone programs GPUs directly but that would be interesting.

> Incidentally PTX is a fiction - more of an IR than something with true throughput and latency numbers. I don't know if anyone programs GPUs directly but that would be interesting.

PTX is a fiction, but its very close to the underlying "SASS" assembly language of NVidia GPUs. There was some research done into this field here: https://arxiv.org/abs/1804.06826

They reverse engineered the SASS assembly language, all the op-codes match up to PTX opcodes. The main difference is that SASS includes control information, such as stall cycles, write or read barrier information.

In effect, it seems like NVidia's compiler figures out stalls and dependencies at compile-time. Memory-dependencies are written to each SASS assembly instruction in Volta, in a format that has changed differently each generation.

The typical assembly programmer doesn't care about these details (CPUs typically handle that logic automatically), so it makes sense to ignore them through the higher-level PTX Assembly language.

It seems like PTX is sufficient for understanding the general execution of an NVidia GPU. There's probably no reason to write the underlying SASS code by hand, especially because PTX matches up to the SASS opcodes. Furthermore, it is clear that NVidia is tweaking the details of those memory-barrier and dependency information, and doesn't want programmers to write against that abstraction level.

I see your simdjson code appears to be C but has a dependency for jsoncpp. There's also a graph which compares several different json libraries except jsoncpp. I've used jsoncpp for some other projects and I'd be curious about your opinion and any comparisons you might have.

That could be clearer, sorry. We compare against RapidJSON and sajson for the bulk of the paper and for the 'splash' graphics as these are the fastest two libraries that validate to a similar extent (gason is a little faster than RapidJSON but accepts some crazy stuff).

In the paper, we do also include measurements against 7 other libraries including jsoncpp. It is an order of magnitude slower than we are on the measurements we took. I am not familiar enough with it to know whether that's meaningful; it may have a lot of other attributes that make it more useful for what you're doing than what we have. We're pretty barebones, especially at this stage.

(the broader comparison is in Table 9 at https://arxiv.org/pdf/1902.08318.pdf)

That's pretty interesting, thanks!

You're using C intrinsics instead of assembly, right? Are you sure the compiler isn't doing the scheduling well enough on its own?

If you're think about algorithms in terms of how they might embed into a processor, you need to know the throughput and latency numbers even if the compiler is getting things right.

For example, there are some nice possibilities with the TBL instruction (or PSHUFB, or VERMB, etc) for doing character class membership tests. Which version you use would have a lot to do with whether you think TBL is going to issue 2/cycle, 1/cycle or 0.5/cycle (just fr'instance) - you might design the algorithm quite differently. It's not just a case of picking the one design choice and then having the compiler schedule that - if you're not thinking about this during design, you're not going to be remotely near peak performance.

Sure but the compiler can compile your intrinsics to different-but-equivalent instructions, and modern compilers actually do that pretty aggressively. Are you sure the code output was actually using the instructions you expected?

Often times you wind up needing to change the entire algorithm to use different instructions with different latency/throughput numbers. The classic case off the top of my head (for Intel at least) is PCMPESTRI versus completely different approaches for substring search.

Yes. I look at the asm all the time, having not fallen off the turnip truck yesterday.

I'm not an expert assembly programmer and I'm definitely not an expert at using SIMD instructions, but the last time I created a NEON implementation of an algorithm I was able to beat GCC's output (from compiling C intrinsics) after an afternoon of tweaking things by hand. From looking at GCC's output it didn't seem like it was doing a great job interleaving ARM and NEON ops to take advantage of that pipeline, although maybe it's gotten better since GCC 5.

How do you feel working on code that is (will be) used by governments and organizations of all kinds to censor the Internet?

One of the uncomfortable aspects of open source is that you don't get to choose who uses it. The majority of customers we were aware of were security customers looking to stop attacks on machines behind next-gen firewalls. We weren't aware of any "democracy suppression" use cases, but it would be naive to assume it's impossible.

(edit - also worth noting that this question seems to refer to my old job; I don't currently work on Hyperscan, although I may go back to regex one day. I'm sympathetic with the line of questioning, though, so please stop downvoting the parent)

I'm curious about what you think about RISC-V's draft V extension.

Scalar Vector Extensions?!

Used to be Scalable VE

Fixed, thanks. I type the word 'scalar' so much, apparently, I didn't notice.

Props to the author for the objective article. Re CA53 throughput and latencies, no -- I don't have the tables, but I've done my fair share of tbl/tbx mesurements: https://www.cnx-software.com/2017/08/07/how-arm-nerfed-neon-...

Welcome to hn, we meet again!

Salut! I've finally decided to register for writing, after years of mute reading ; )

Looks like the author is going to enjoy RISC-V SIMD (once it is finalized).

I'm not sure... I like the look of the bit manipulation stuff, but I'm not really a fan of the variable-length vector approach. I think these systems are built to make the world safe for matrix multiply - and simple math workloads - but the short-vector approach (e.g. the typical x86 SIMD style of doing things) is sometimes exactly what you want.

I keep meaning to write more about this. I have found 3 categories of SIMD use in my own work:

1) Doing one thing a gazillion times (e.g. conventional SIMD). This works well on vector machines, of course.

2) Using SIMD registers to do "more stuff". So in a couple string matchers and regex matchers I've designed, you are really use using a SIMD register because it's bigger than a GPR register (duh). But this might be because you want to simulate a 512-bit NFA rather than a 64-bit NFA.

3) Using SIMD operations to do weird, irregular stuff where what you're really getting is a substitute for branchy code. I blogged about an example of this called "SMH" https://branchfree.org/2018/05/30/smh-the-swiss-army-chainsa...

I'll grant that a lot of code can't be readily adapted to large vectors, but this doesn't seem like much of a problem for (proposed) RISC-V. If you're creating an algorithm that only works on a specific width, you can just `setvl` and assert that you have enough space available. You're likely targeting a specific CPU or class of CPU where you know there will be support for 256-bit vectors or whatever. If someone tries to run it on a cheap embedded CPU, they'll be disappointed that your code won't run on their 64-bit vectors, but this is no different from trying to run AVX code on an Intel Atom.

I suppose it's hard to know until we can actually write code for it, but I haven't imagined a scenario where the RV model is significantly worse than packed SIMD, other than a tiny bit of vector configuration bookkeeping. I think that's a worthwhile tradeoff for getting simple, portable, fast code for elementwise operations and implementation flexibility.

I'd love to be convinced otherwise, though.

There's really two kinds of vector code in my experience. The first kind is the standard vector code that most people probably think of, the loops that boil down to:

   #pragma vectorize_me_to_death
   for (int i = 0; i < N; i++) {
     // ...
But there is also a second kind of vector opportunity: small vector opportunities. SLP vectorization is the ur-example here: you scan a block of code for operations that happen to be doing the same operation on different values and make vector code out of it. For this kind of vector code, there is a lot more focus on horizontal and shuffling code than the wide kind of vector code.

I haven't looked at the ARM SVE or RISC-V vector ISAs in detail, but I imagine that they don't support the latter kind of vectorization very well.

The pain of variant SIMD sizes is real, I admit. In the Intel world this means 3 sizes at the moment: 128-bit for Atom and really old stuff (if you care), 256-bit for most mainstream processors and 512-bit for cutting edge - then pick your baselines _within_ those. Not fun. On the other hand, we can compare the current ARM situation, where everyone is stuck at 128 for now.

A "tiny bit of vector configuration bookkeeping" has my hackles going up. I'm used to SIMD operations where it's quite usual that you have latency = 1 and reciprocal throughput = 3. This is a narrow path to walk and not one where "tiny bit" of extra overheads will be welcome. I guess we'll see - I would like RISC-V to succeed, but worry that all these resizable models will be quite slow for the codes I write now.

I'm not sure I understand how callee-saved registers are going to work under RVV, given the way that dynamic reconfiguration works.

My guess is that either no registers will be callee-saved or one group of 8 registers will be callee-saved (in the current draft, you can group 1, 2, 4 or 8 sequential registers, but only starting on a multiple of the group size, so register groups are always going to be contained within one of the four 8-register maximum-size groups); this will require dynamic stack allocation since the vector length is not fixed.

The saving sequence would be something like this (after setting up a frame pointer if needed): "vsetvl t0, x0, e8, m8; sub sp, sp, t0; vse.v v16, (sp)".

Are there any cores available that implement it?

This is by the same person who authored Hyperscan, which is also mentioned on this blog & worth a read.

About documentation, Intel’s is not great either. Otherwise I wouldn’t bother making this: https://github.com/Const-me/IntelIntrinsics

https://software.intel.com/sites/landingpage/IntrinsicsGuide... is kept in my browser history. It gives pretty good documentation of most of the details of intrinsics (including timing information on some processors for some instructions), although it is missing the enum definitions.

I did similar thing, but the output are unix man pages: https://github.com/WojciechMula/man-intrinsics

Nice work. By the way, on the x86 simdjson did you use any of the string compare instructions in SSE 4.2? Or was AVX faster?

SSE4.2 is bollocks. It was DOA. There are better ways of doing most of it using PSHUFB and what's more, those ways were better when SSE4.2 arrived. It's only gotten slower, relatively speaking, as it is exiled to the edge of the die and has not been promoted to wider regs (AVX2, AVX512).

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact