Hacker News new | past | comments | ask | show | jobs | submit login
LDM: My Favorite ARM Instruction (keleshev.com)
145 points by one_and_only on Oct 15, 2020 | hide | past | favorite | 127 comments



This instruction makes design of ARM CPUs harder.

As an example, consider page fault handling for a situation where part of access is valid and other part is not, especially for store operation. Or out-of-order execution. Or, for whatever sake, "load multiple" from PC (which is exposed to programmer in quite peculiar way) - you have to have it.

ARM is full of these quirks. I can see why they did that back in the day, but today or even 30 years ago these quirks are not a good decisions.


Yes, this instruction was removed in 64-bit arm for this very reason.

It has been replaced with LDP (load pair, i.e load two registers) which is not nearly as nice for software developers but made the life of the ARM designers much easier. For comparison:

https://elixir.bootlin.com/linux/v4.14.52/source/arch/arm64/...

https://elixir.bootlin.com/linux/v4.14.52/source/arch/arm/ke...

BUT, I don't think we will see real benefits of this unless we remove aarch32 support from ARMv8.


Many aarch64 chips have no aarch32 support for reasons like this


Many? Arm Ltd designed Armv8-a chips fall into three tiers:

• Support AArch32 at EL0-EL3 (everywhere)

• Support AArch32 at EL0 (user mode)

• No AArch32 support

The most fielded cores fall into the first category. The newer cores - A76 and it’s contemporaries IIRC - fall into the second category. Only it’s recently disclosed data center cores fall into the third category.


Well, all the recent Apple chips...


That’s like five chips :P


Very true, but that’s the classic 32b arch only, aarch64 is much cleaner. The next big A class CPUs will be aarch64 only.


If that is "hard", I would hate to see what you think of things like cache coherency across cores...

As an example, consider page fault handling for a situation where part of access is valid and other part is not, especially for store operation.

You do a check for the full range before the access, because for best performance you would want to pull in as much as possible in one go anyway.

Or out-of-order execution.

Everything gets turned into uops, this instruction can emit more than one.

"load multiple" from PC (which is exposed to programmer in quite peculiar way)

What's hard about that?


>If that is "hard", I would hate to see what you think of things like cache coherency across cores...

Cache coherence across cores is easy. The number of corner cases is much smaller.

The number of corner cases of old architectures (MIPS, or SPARC) is staggering. I designed a MIPS core prototype and I was done in single week with all arithmetic and memory access commands and spent three weeks designing, implementing and debugging branch handling, due to branch delay slot "feature".

>You do a check for the full range before the access, because for best performance you would want to pull in as much as possible in one go anyway.

You cannot do that - pull as much as possible in one go, - for case page boundary is crossed. But the very fact that you have to check before issuing operation make things more complex than they can be.

>>"load multiple" from PC (which is exposed to programmer in quite peculiar way)

>What's hard about that?

State handling. Can this instruction be interrupted?

>>Or out-of-order execution.

>Everything gets turned into uops, this instruction can emit more than one.

There are more than "everything gets turned into uops" out-of-order execution engines - scoreboarding, for example, provides most of benefits for much less energy cost.


What, you don’t like xoring (or should I say, eoring…thanks ARM) the program counter?


I always preferred EOR. Maybe the cute factor, or maybe because I first played with assembly language on an Acorn Electron (and later a full BBC Master Series) where I think the nomenclature comes from (the 6502 assembler built into BBC BASIC called the instruction EOR). ARM was originally Acorn Risk Machines, and the original designs evolved from the CPUs designed for the Acorn Archimedes range of computers.


the cute factor

I think PowerPC's EIEIO wins on cute.


ARM got the EOR spelling from the 6502.


I've heard POWER engineers kvetch for the same reason about lm/lmw (Load Multiple Word, which is an equivalent instruction). I recall discussion about whether to deprecate it, but, as far as I'm aware, it's still a legal instruction as of 2020.


> As an example, consider page fault handling for a situation where part of access is valid and other part is not,

The OS has to handle this case anyway for misaligned loads and stores.


The hardware has to handle this case for misaligned loads and stores.

Are load and store (multiple or misaligned) atomic? If so, then hardware has to check things in advance and trap. If not, then there is a possibility for some part of state to be lost.


Unaligned loads and stores are not atomic.

Misaligned loads and stores that cross page boundaries are handled by the OS and hardware together. The OS must ensure that the post-condition of the page-fault handler leaves both pages on either side of the boundary in an accessible state.


What I am looking for is not "accessible" but "consistent".

What you describe above makes hardware design harder (one has to split unaligned access into two (unnecessary) microoperations which preclude or make harder to use techniques simpler than out-of-order execution engine with content-addressable memory). It also makes software handling harder.

One can designate unaligned access as "undefined behavior" and leave it to compiler writers to make things right.


> One can designate unaligned access as "undefined behavior" and leave it to compiler writers to make things right.

Lots of hardware architectures have tried to levy this requirement on the software. Every one has subsequently walked it back and provided unaligned access support in hardware. Even RISC-V, which is as orthodox a RISC as it gets, supports unaligned access in hardware.

So that's got to be the baseline for deciding how much additional work must be performed when supporting something new: Any memory instruction may touch two pages, but the vast majority of them won't.


MIPS has four (two instructions for loads and two for stores) instructions exactly for unaligned access situation, when it cannot be handled in software. This is exactly what I am talking about. BTW, MIPS were proud that their ISA is as good as microcode, and the solution with two instructions is exactly that - expose as much of abstract internals as possible.

Thus, I cannot agree with you on "every hardware architecture walked back".

https://en.wikipedia.org/wiki/Lexra - for an historical expose.


MIPS did in fact walk it back with MIPSr6 and nanoMIPS. The old unaligned-half-access instructions were dropped entirely, and most memory access instructions support unaligned access natively.


https://en.wikipedia.org/wiki/MIPS_architecture#MIPS32/MIPS6...

It contains "possibly via trapping" which I am not at all against of. What is interesting there that MIPS introduced BALIGN instruction which reminds me of first generation of Alpha ISA.

Alpha did not had byte or word (16-bit) memory access. These should be provided using 32-bit word access and some shuffling. They walked back on that for code size reasons and later Alpha ISAs had byte and 16-bit loads and stores. I do not have reference ready but I believe they were aligned accesses.


On the original ARM2 this design made a lot of sense because if you understood how the instruction timing worked, you could essentially perform as well as a specialized blitter. I never understood this at the time, and largely blame that on a popular ARM Assembly Language book I had that said that LDM/STM were just convenience instructions, essentially syntactic sugar. They most certainly were not. With the right alignment and using enough registers you could get 2-3x the throughput of using single word load/stores. I honestly think if that book hadn't gotten it so wrong there would have been a lot more decent games for the Archimedes machines.


It's not that they couldn't do it, it's that LDM is a terrible instruction and was rightly removed in the cleanup that is arm64. See also: x86 LOOP...


IDK, it's a fantastic instruction... for in order, cache poor designs. It's awful for the gate counts AArch64 is targeting though.


LOOP was a great instruction until Microsoft messed it up. If the instruction went fast enough, a timing loop in Windows got messed up, and Windows wouldn't boot. People would blame the CPU vendor for the bug, so LOOP was purposely made slow.


This looks an interesting story that I tried to do Google it, but couldn't find any reliable source. Isn't it easy to do windows fix than processor hack?

I could find some reasons though here https://stackoverflow.com/questions/35742570/why-is-the-loop...



I recall that the affected OS was Windows 95. (could be 98) The first to hit the problem was AMD. Although not pre-internet, it nearly is, so finding info would be hard.

People didn't get automatic updates from fast always-on connections. Lots of people had slow dial-up connections. For some people, even that wasn't an option, so an update would require physical media.


Intel do that with a lot of old instructions I.e. MMX and SSE instructions are really really fast on AMD but horrible on Intel. I guess this is to force people to upgrade to the latest and greatest AVX whatever.


Citation needed? I've never encountered that, and used plenty of those.

The main reason old instruction become slow is that the hardware implementation is replaced by a microcoded version.


MMX these days costs extra instructions on both AMD and Intel CPUs. This has been the case since SSE and SSE2 were introduced and later AVX extensions.

There quite possibly been a point in time where certain MMX instructions run faster on some AMD CPUs and required less uOps to achieve.

https://agner.org/optimize/ This is a pretty good source for uOps and instruction cycles.


I mean, they couldn’t have kept it even if they wanted, since the bitmask itself would take up the entire instruction width. stp/ldp is there, though.


If it didn't have the other issues, there are various ways it might have been saved, for example having it operate on pairs of registers, or splitting it into separate instructions for the low half and high half.


nanoMIPS uses a counted contiguous range of registers in its version of this instruction.


TIL there’s a thing called nanoMIPS.


Tragically LOOP is alive and well in x86-64.


EDIT: Blergh, confused LOOP with REP, but keeping the below comment so the rest of the thread still makes sense.

FWIW LOOP isn't the worse thing in the world once you have dedicated silicon for it anyway generating micro ops in the instruction decode pathway. It's just a pretty cute run length encoding scheme for the instruction stream.


It's slow as sin, though. Just straight emulating it using more common instructions is like 4x better in most modern Intel CPUs. For some insane reason, it emits 8 uops on Skylake.


There is a reason why loop is (was made) slow: It was (in the 90s) explicitly made slow because it was used for timing loops. Making it faster would have broken existing software.

Source: https://stackoverflow.com/a/35743699

See also https://stackoverflow.com/questions/35742570/why-is-the-loop...


The things you link to don't seem to say that at all, they seem to say it got slow because it was hard to implement and no-one cared about it.


"IIRC LOOP was used in some software for timing loops; there was (important) software that did not work on CPUs where LOOP was too fast (this was in the early 90s or so). So CPU makers learned to make LOOP slow."

"(My opinion: Intel is probably still making it slow on purpose, and hasn't bothered to rewrite their microcode for it for a long time. Modern CPUs are probably too fast for anything using loop in a naive way to work correctly.)"


It's also very fast on AMD (not any slower than the equivalent dec/jnz), so use it if you want your software to run faster on AMD and slower on Intel...


Sure, it doesn't matter anymore because anyone who cares is going through the vector unit to do bulk transfers. But there was issues with doing unaligned base and length memory transfers for the longest time, well through x86_64's original design.


are you confusing LOOP with the REP prefix?


I absolutely am, thanks!


My gripe with LOOP and other crappy instructions is that they use up valuable space in the instruction encoding.


If we’re talk old arm, then my favorite is the if-then instructions, which would conditionally execute up to three instructions based on a 1-3 bit field indicating which instructions to execute if the control was true or false.

I believe they could be mixed so you could have it.tft - in which case you’d execute the first and third instructions if the condition was true, and the second if it was false.

Needless to say this is challenging instruction to deal with any kind of out of order processor, or any kind of branch predictor (logically a single instruction could produce multiple branches).

I know that in modern arm processors the performance is much worse than a short jump, so I’m guessing that they just give up when they see it instructions and do it in order.

Alas this instruction is gone in arm64. It’s obviously better for cpu designers, but I will miss it forever.


That isn’t quite how it worked: every instruction has a 4 bit condition code field. It is usually AL “always” which is omitted in assembler. There wasn’t any limit on how many conditional instructions you could have in sequence, but a branch is usually faster than three or so conditionals.

It gets really fun when you combine condition codes with comparisons, or with ALU instructions that set the processor status bits: you can do some quite intricate logic in very little space.


You're both right... the 32-bit ARM encoding gives every instruction a 4-bit condition field, but the 16-bit Thumb and mostly-16-bit Thumb2 encodings use the IT instruction to carry the conditions for the next few instructions.


Ah cheers! I didn't think about it being thumb only (I never used raw arm32, so in my brain thumb2 is all that existed :D )


Is a risc still a risc when you have complex instructions like this?


The best answer to any RISC vs CISC question is the analysis John Mashey posted to Usenet comp.arch https://www.yarchive.net/comp/risc_definition.html (Mashey was a CPU designer for MIPS.)

In the analysis he counts instruction set features like number of registers, number of addressing modes, number of memory accessed per instruction. He compares over a dozen architectures.

ARM comes out as the least RISCy RISC, but definitely on that side of the line, and x86 as the least CISCy CISC. (This was before amd64.)


I think the notion of CISC and RISC are practically meaningless in 2020.


Yeah. RISC kinda became a shorthand for "fixed instruction length, load-store" and CISC became a synonym for "x86/amd64".


It's for sure a hybrid given that they were microcoded on early ARM cores. But a really, really useful half way point given that those early ARM cores lacked caches unlike prototypical RISC chips and these instructions would other wise be competing with the memory transfers themselves if they didn't maximize density to a single aligned instruction.


Maybe not, but it's still at least an order of magnitude fewer instructions than x86 :)


It is my understanding that it is "reduced-instruction set computer" rather than "reduced instruction-set computer".

That is the reduced means each instruction does less, rather than specifying the number of them.


Wikipedia[1] both corroborates and disagrees;

> A RISC is a computer with a small, highly optimized set of instructions

but later:

> The term "reduced" in that phrase was intended to describe the fact that the amount of work any single instruction accomplishes is reduced—at most a single data memory cycle—compared to the "complex instructions" of CISC CPUs that may require dozens of data memory cycles in order to execute a single instruction

1. https://en.wikipedia.org/wiki/Reduced_instruction_set_comput...


From this[1] piece it seems the the original goal was indeed both:

> Cocke and his team reduced the size of the instruction set, eliminating certain instructions that were seldom used. "We knew we wanted a computer with a simple architecture and a set of simple instructions that could be executed in a single machine cycle—making the resulting machine significantly more efficient than possible with other, more complex computer designs," recalled Cocke in 1987.

[1]: https://www.ibm.com/ibm/history/ibm100/us/en/icons/risc/


IIRC the idea is that the judge of the complexity is ultimately how direct it maps to the underlying implementation. For example VLWI machines follow the same principles but with the focus around super scalars, i.e. they favour explicit parallelism defined in the instruction stream as opposed to dynamic circuitry implementing instruction reordering and dependency tracking


But that's not all!

LDM can access banked user registers. As a special case, this disables base register updates.

LDM can copy SPSR to CPSR, causing an exception return with register banking. This is a way to go from kernel code to user code.


Nice to see it mentioned: The explicit pop-into-PC is my favorite way to return-from-subroutine.


And it also allows you to pull some quite elegant RoP attacks! :)


Looks like nice code. I have always liked ARM assembler for its cleanliness. But who does a lot of assembler programming? Those who do might as well be horrified by the cycles needed. So I am not sure that nice code is paramount for the success of the instruction set.


Compared to the alternative (up to 16 load instructions separately, with all of their associated fetch bandwidth), the number of cycles it takes looks really good. It's a way of amortizing the overhead of the instruction across all of the loads being issued.


Presuming your CPU 1. is pipelined, and 2. has a clock frequency higher than the memory bus speed, wouldn’t each memory fetch be guaranteed to take more than 1 CPU cycle (let’s say X cycles) to resolve? If the static CPU-side execute-phase overhead for each load instruction is M cycles, wouldn’t the total load time just be M+XN — because the static phase of each successive pipelined op is occurring while the previous ops are waiting around for RAM to get back to them?

Sort of like serially launching N async XHRs. The “serially” part doesn’t really matter, since the time of each task is dominated by the latency of the remote getting back to you; so you may as well think of it as launching them in parallel.

The only practical difference I can see is that LDM results in smaller code and so increases cache-coherency.


LDM is from a world where single cycle memory access without caches is a thing. Whether that's classic systems like the ARM1 when DRAM was actually faster than CPU clocks at the time (you could even run multiple CPUs out of phase with each other and just have them both hit the same DRAM), or embedded system like the gameboy advance or Cortex-M cores where main memory can (but doesn't have to be) be single cycle access SRAM or NOR flash.


At the time of the ARM1 and ARM2, DRAM typically had a 120ns cycle time (tho you could get 100ns DRAM). The Archimedes machines ran at 8MHz so they ran at the limit of the DRAM bandwidth. There were some early prototypes which could run at 12MHz but that required fast DRAM and careful component selection or the computer would not run reliably - look for “fast A500” at http://chrisacorns.computinghistory.org.uk/Computers/A500.ht...

Other CPUs were not as good as the ARM at using the available DRAM bandwidth.


> up to 16 load instructions separately

That's not the only alternative. Here's AVX2 code that copies 32 bytes with 2 instructions:

    vmovdqu ymm0, ymmword ptr[rsi]
    vmovdqu ymmword ptr[rdi], ymm0
 
Pretty sure ARM NEON can do something similar.


The point isn't the number of bytes transferred, but being able to dump the integer register file to/from main memory quickly (and issue an indirect jump at the same time!)

Also, the gate count where these instructions really shine don't tend to have vector units.


But NEON is optional, not available in many "truly" embedded implementations.


Not optional in AArch64 though.


So is AVX2.


So, 1 instruction, but how many clock cycles does it take to load 16 registers?


Who cares? Feel that code density !

More seriously, yup: if you read the architecture manual this instruction can take a long time. It's also a pain for a OoO (not superscalar!) cpu to keep track of all the dependencies.

It made complete sense back in the early 80s when the instruction set was designed, but like universal condition code dependent instruction execution it was dropped from ARM64 for good reasons.


Superscalar doesn't really matter here (just throw in an interlock), but out of order very much does.


True! I was (inappropriately) thinking of SS as a combination of both.


Why would it be harder for OoO to keep track of this than N mov/load instructions? It is just that the dependency tracker is working in instruction space?

As mentioned before these were microcoded anyways so it may end up "compiling" down to N loads which should track just as well.


Imagine you hit a bus error on the last memory access of LDM. You already updated 15 registers, so now you need to undo all of that.

If instead you had executed 16 separate LD instructions, all of them would have retired and update state by the time the last one gets the bus error.

It’s all doable of course, but it means your OOO algorithm and physical implementation need to handle a lot more state in flight.


Why couldn't you just not undo it, and have it read/write twice? This wouldn't be a problem if you're accessing normal ram, like the prologue and epilogue of a function. Add a comment to the manual saying to not use this instruction when reading/writing hardware registers where accessing it changes state.


Not all instructions are re-startable. ldm/stm that writes back to the base register is only re-startable if the write-back and the update to PC are last.

The microcontroller-scale cores have a few extra bits in the control/status register that govern restart of an interrupted ldm/stm or IT (if-then, a form of hammock predication).


> Imagine you hit a bus error on the last memory access of LDM. You already updated 15 registers, so now you need to undo all of that.

Irrelevant. "Undo all of that" is just a matter of using an older version of the register alias table. The retirement process has do to that for any and all faults anyway.


My point was that undoing multiple writes is not as easy as undoing one write. Your register file (really, reorder buffer) either needs to add more write ports to it, or you need a state machine to complete that clean-up over multiple cycles.

Without that instruction, your hardware design could have assumed that reverting state can always be done in one cycle. That makes the hardware implementation easier.


Your point is based on a fundamental mis-understanding of what the reorder buffer (ROB) is doing.

Any time there is a fault, it is detected by the retirement stage as it inspects the ROB. There are already hundreds of state changes that must be rolled back: all of the succeeding entries in the ROB. So the work performed by the recovery mechanism isn't changed by the amount of clobbered state because it is completely clobbered! The ROB must rewind it all.

The register file doesn't need any additional ports, because the fault recovery mechanism doesn't touch the physical registers. It only updates the register alias table (RAT) to point to the appropriate registers. But that machine must be capable of restoring 100% of the register pointers very quickly no matter what. Whether the faulting instruction touched one, two, or 16 registers doesn't matter. The 100+ updates following it are enough state that the answer must be: "Everything".


You are correct for complex CPUs with actual ROB with hundreds of entries. For those micro-architecture, LDM is not particularly stressful. I was focused on the micro-controller case, with much simpler pipelines. LDM is one of the instructions causing the pipeline a lot more headaches than it's worth.


Microcontrollers don't need additional ports either. Take a look a the ARMv7-M status/control register. It has a few bits of state to allow ldm, stm, and if-then to be resumed in-place.

At that scale, those mechanisms aren't micro-coded, they are state-machined. You just have to restore the state machine.


Chances are high it depends on the CPU, whether the memory read/written straddles a cache line, etc.

https://gab.wallawalla.edu/~curt.nelson/cptr380/textbook/adv... says it takes 2 + N cycles, plus 2 if the PC is in the list.

Storing them takes 1 + N cycles, according to that PDF.

(Also, this kind of instruction isn’t a novel idea. 68k had MOVEM, for example (http://68k.hax.com/MOVEM))


According to https://developer.arm.com/documentation/ddi0165/b/instructio... , it's 17 clock cycles if the next instruction uses the last register (r15), otherwise 16


If anyone is interested why R15 is special cased here a bit, it's because R15 is the program counter and that extra cycle is the interlock where they'd otherwise have to put a delay slot.


Hmm, I didn't know about that. Then it would be 20 or something.

(The special case I accounted for is because there's a bubble in the instruction pipeline if the next instruction uses the last register)


Are instructions typically variable in the number of cycles it takes?

If, as the article claims, pop is truly aliased to ldm, then I'd expect it to be very fast.


They are, this one takes from 2 to something like 20 cycles, depending on the number of registers, an whether the PC is one of them.


> Are instructions typically variable in the number of cycles it takes?

Very much so, either due to memory latency or just inherent complexity. For example, a single division instruction can easily take the same time as ten adds.


I meant variant for the same instruction. Is expect a divide to be slower than an add but I wasn't sure if a single instruction would have a range.


Yes, even in very simple architectures instructions often take different numbers of cycles.


Less than 16 individual loads since you're skipping 15 opcode fetches


Do you know how are bools stored in the structs in ARM Cortex M4? I am getting faster code with IAR if bools are in the beginning of the struct with other members uint16 and uint8. However, it is claimed in the compiler manuals that the bools are 8bits, which means that faster code should be when the bools are at the end of the structs due to better alignment of other members. Also ARM has 32bits memory fetch access, so how bools are fetched then? Does it mean that a loading of a bool is not deterministic in terms of number of cycles?


On m68k my favorite instruction was MOVEM, which was similar except that we had to provide a register range.

MOVEM was the key to fast buffer->screen copies.


ARM looks to be even friendlier than my previous favorite assembly language on the Motorola 680x0 family. I'm tempted to learn it but wonder if I could produce code that would take advantage of processor parallel execution as well as a good optimizing compiler.


I guess this was inspired by the 6809's PSHU/PSHS/PULU/PULS instructions, and indirectly, the VAX's PUSHR/POPR instructions. It's way out of place on a RISC but was a concession to practicality.


The idea of loading/storing multiple registers very much predates microprocessors and very much predates the VAX.

The earliest appearance I know of is in the IBM System/360, first released in 1964. The instruction is probably even older than that.

The ARM did change the instruction, arguably improving it. The IBM instruction loaded/stored multiple registers, but the registers were required to be consecutive. The ARM instruction has a register mask.

See IBM System/360 Principles of Operation, page 26: http://bitsavers.trailing-edge.com/pdf/ibm/360/princOps/A22-...


Pretty much all CISC chips in fact have cute ways of compressing instruction fetch bandwidth when transferring large amounts of data, since every CISC out there was designed in a time before caches. Without separate I/D caches the fetch bandwidth and data bandwidth would otherwise conflict. That's why you see dedicated memcpy, strlen, etc instructions on the more complete CISC designs. The RISC movement in general was very much a recognition of how ISAs could change with gate counts that allowed guaranteed caches.

As you note, it's a place were ARM shows itself to be a hybrid RISC/CISC design, required because the ARM1 didn't have caches.


It slightly reminds me of the x86 instructions PUSHA/POPA.


No exposure to assembly apart from #'disassemble in the repl, just wanted to praise the beautiful page design. Makes me want to dive into Keleshevs book.


If the author is reading this: would you consider providing an Atom feed for your blog? I'm sure a lot of people here would love to subscribe.


On removal in ARM64: Do LDP and STP load and store fixed pair of registers? e.g. (r0, r1), (r2, r3) ..


LDM/STM are awful for modern cores, and was even microcoded on at least early ARM cores despite them being, you know, RISC.

As for why it's awful, it's the amount of loads and stores that can be in flight associated with a single instruction when an exception is taken, and how the instruction is restarted afterwards. ARM-M cores make this a little better by exposing "I've executed this many loads so far in the LDM" in architectural state so it can be restarted without going back and reissuing loads (so it can be used in MMIO ranges), but the instruction really only shines in simple, in order designs. They're almost as bad to implement on modern cores as the load indirect instructions you see on some older CISC chips.

Additionally, LDM/STM really shined in either cache-less or cache-poor designs where instruction fetch is competing for memory bandwidth with the transfer itself. That doesn't really apply to these modern cores with fairly harvard looking memory access patterns. Therefore, getting rid of these instructions isn't the biggest deal in the world.

So to answer your question, they absolutely could have done that, but chose to use the transition to AArch64 to remove pariahs like LDM/STM from around their neck because they're more trouble than they're worth from a hardware perspective on modern OoO cores. The LDP/STP instructions are the bone they throw you to improve instruction density versus memory bandwidth to/from the registers, but they don't really want each instruction being responsible for more than a single memory transfer for core internal bookkeeping reasons.


LDM/STM were also the source of some really wacky hardware bugs on some STM32 microcontrollers, such as:

> If an interrupt occurs during an CPU AHB burst read access to an end of SDRAM row, it may result in wrong data read from the next row if all the conditions below are met:

> • The SDRAM data bus is 16-bit or 8-bit wide. 32-bit SDRAM mode is not affected.

> • RBURST bit is reset in the FMC_SDCR1 register (read FIFO disabled).

> • An interrupt occurs while CPU is performing an AHB incrementing bursts read access of unspecified length (using LDM = Load Multiple instruction).

> • The address of the burst operation includes the end of an SDRAM row.

(https://www.st.com/resource/en/errata_sheet/dm00068628-stm32..., section 2.3.5)

Admittedly, this was a silicon bug in ST's memory controller. But it was a bug which was only triggered by LDM/STM instructions!


FWIW, disabling the read fifo like that would be a really goofy choice, but yeah, very good point. These are very special cased instructions that are different even than DMA transfers you might expect them to look like.


There's a reason for that. Another errata for the same part explains that:

> If an interrupt occurs during an CPU AHB burst read access to one SDRAM internal bank followed by a second read to another SDRAM internal bank, it may result in wrong data read if all the conditions below are met:

> • SDRAM read FIFO enabled. RBURST bit is set in the FMC_SDCR1 register

> • An interrupt occurs while CPU is performing an AHB incrementing bursts read access of unspecified length (using LDM = Load Multiple instruction) to one SDRAM internal bank and followed by another CPU read access to another SDRAM internal bank.


Could it also be the case that this is rendered partially obsolete by vector instructions? Obviously vector loads/stores don't cover all these cases, but I have to imagine they cover quite a few, and without all the bookkeeping (who knew loading one big thing would be so much easier to keep track of than loading a handful of tiny things).


No, because you still want fairly dense dumps of registers out to cache for function prologues. So blits from the integer register file still show up in your profile traces, hence LDP/STP.


According to https://developer.arm.com/architectures/learn-the-architectu... , it looks like it can be any two registers


Seems like a step backward. If the constraint was the number of bits in the mask then each bit could have represented a register pair.


The constraint absolutely was not that, it was that they don't want to have an instruction that can load more than two cache lines worth of data, or that can write to more than two registers.


Limiting the registers to be specific pairs is awful if you're implementing a register allocator.


Is it? These are stack management instructions; you know in advance from the platform ABI which registers are "scratch" and which are "save", so if you allocate any of the "save" registers in your function you emit the corresponding push/pop pairs in the start and end of the function. At worst you push/pop a register you don't need to - but in ARM64, the stack has to be 128-bit aligned, so you have to push/pop pairs.


Then your allocator would need to know that if it's already decided to use one register of a pair that the other half of the pair gets the save/restore for 'free' and is now better than using a different callee-saves register. I suspect that unless your allocator was designed from the start to be able to deal with that kind of "my choice of register here affects costs and thus my decisions about allocation for a completely different value over there" it's not going to be able to do a great job under that kind of constraint.


Callee-saved registers are saved on function entry, all at once. There is no interaction other than the register allocation step choosing how many values need to be preserved across function calls.


OK, but that only works for prologue/epilogue --- it's better if you can use the instructions elsewhere too.


would a compiler even emit an instruction like this?


The 'original' ARM C compiler (Norcroft) certainly did. As other commentators have pointed out, it was extremely powerful and a big performance boost in the early days of shallow pipelined, in-order, small (or no) cache ARM processors.


Ah, I had assumed that any benefit would be purely in terms of code size/instruction cache usage. That's interesting!


Yes, it's basically a function prologue/epilogue in a single instruction. In practice any disassembly of an AArch32 program will have loads of these. It saves instruction cycles in most cases.

That wasn't the actual design goal, though: Acorn was trying to get away with not shipping a DMA controller, which was quite expensive kit at the time they were shipping the Archimedes. Having an instruction pair that let you use registers as DMA buffer let them get similar performance and save lots of per-unit cost.


Short memcpy can also get lowered to ldm/stm.


Like all the time. Think that POP and PUSH are aliased to LDM and STM type instructions, so at least they're emitted when entering and returning from functions (exceptions apply).




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: