Hacker News new | past | comments | ask | show | jobs | submit | torusle's comments login

There are couple of tricks you can do if you fiddle with the bits of a floating point value using integer arithmetic and binary logic.

That was a thing back in the 90th..

I wonder how hard the performance hit from moving values between integer and float pipeline is nowadays.

Last time I looked into that was the Cortex-A8 (first I-Phone area). Doing that kind of trick costed around 26 cycles (back and forth) due to pipeline stalls back then.


There are basic integer operations in the FP/SIMD units on most CPUs, so there’s no generally need to “move back and forth” unless you need to branch on the result of a comparison, use a value as an address, or do some more specialized arithmetic.

On x86 there is sometimes (depending on the specific microarchitecture) an extra cycle additional latency when using an integer operation on a xmm register last used with a float operation.

I have seen it explained as the integer and foat ALUs's being physically distant and the forwarding network needing an extra cycle to transport the operands.


This is correct, but it’s happily pretty rare for it to matter in practice (because the domain bypass penalty is small and does not directly impact throughput, only latency).

(For that matter, though, most modern FP/SIMD units have a direct approximate-reciprocal instruction that is single-cycle throughput or better and much more accurate--generally around 10-12 bits, so there's no need for this sort of thing. See, e.g. FRECPE on ARM NEON and [V]RCPP[S/D] on x86.)

These kinds of tricks are still used today. They're not so useful if you need a reciprocal or square root, since CPUs now have dedicated hardware for that, but it's different if you need a _cube_ root or x^(1/2.4).

I wonder to what extent the dedicated hardware is essentially implementing the same steps but at the transistor level.

The big cores do. They essentially pump division through something like an FMA (fused multiply-add) unit, possibly the same unit that is used for multiplication and addition. That's for the Newton-Raphson steps, or Goldschmidt steps.

In hardware it's much easier to do a LUT-based approximation for the initial estimate rather than the subtraction trick, though.

It's common for CPUs to give 6-8 accurate bits in the approximation. x86 gives 13 accurate bits. Back in 1975, the Cray 1 gave 30 (!) accurate bits in the first approximation, and it didn't even have a division instruction (everything about that machine was big and fast).


There is also the case with machines that lack FP support, like some ARM Cortex M variants.

I've worked in the payment industry and among other things built a payment terminal, so I know a thing or two about it.

1st: The message overhead during communication is not an issue. It is tiny compared to the time it takes for the credit card to do it's crypto thing.

2nd: This won't be adapted ever. There is simply way to much working infrastructure out there build on the APDU protocol. And there are so many companies and stakeholders involved that doing any change will take years. If they start aligning on a change, it would be something that makes a difference, not something that safes time in the order of 5 to 20 milliseconds. per payment.


Hi there, author here! I think you've highlighted a couple of things worth clarifying in the doc:

1. Apple having just announced it is opening up NFC to developers means that both major mobile platforms can now act as responding devices; so widely distributing new NFC protocols to consumer devices has become very fast and inexpensive through an update to the OS or installing a third party app.

2. Mobile consumer hardware is sufficiently fast for the application operations (eg. Cryptographic operations) so that these roundtrip and metadata overheads of APDU do actually make a meaningful contribution to the total time it takes to complete a transaction. Experiencing this in my development efforts here was the motivation for designing this alternative.

3. A1 is interoperable with APDU infrastructure and can therefore be adopted by terminals immediately, since reader terminals can attempt an A1 initiation and any APDU response from a legacy device is considered a failure; at which point the terminal can fall back to its default APDU message exchange.

I will update the doc to clarify these points, what do you think?

Given your experience I'd be interested in your detailed feedback, maybe we could jump on a call soon if you have time?


> 1. Apple having just announced it is opening up NFC > to developers means that both major mobile > platforms can now act as responding devices;

> 2. Mobile consumer hardware is sufficiently fast for the application > operations (eg. Cryptographic operations)

You are right here. It is possible to emulate a card using mobile phones. We've been able to shim/emulate any card for much longer.

The thing is: To connect to the payment system you need a certificate. And you simply don't get it unless you can prove that you have all kinds of security measures applied.

For android and apple, the actual payment stuff runs inside a isolated little micro-controller which has been certified and is temper proof/protected. This little thing is fortified so far that it will destruct itself when you try to open it up. There are alarm meshes, light sensors and much more inside the chip to detect any intrusion just to protect the certificate.

If you don't have that security, the payment providers will deny you their certificate, plain and simple.

You can build your own thing using card emulation via apps, but you will take all the risks.

How it works in practice is the following: These temper proof micro-controllers can run java-card code. You can write an applet for them, get it installed (if you have the key from apple/google - not easy). Then install it and you have a two-way communication: Your applet inside the secure world communicating with the payment world, and your ordinary mobile app showing things on the screen.


"You can build your own thing using card emulation via apps, but you will take all the risks." right! This is exactly what I've developed. I've built a new NFC application, with its own PKI infrastructure, that's deployed onto users' mobile devices via an app install. It works over APDU, but it would be more efficient if A1 was made available by the two major mobile OS. It would likely shave off >10% of the total time to complete which makes a material difference to the failure rate (ie. customer's lifting away too early or other interference) and therefore improves the overall UX.


Or as simple as using the hardware accelerated CRC32 that we have in our x86 CPUs.

Last time I checked, CRC32 worked surprisingly well as a hash.


Hah, neat.

The 'weird' instructions can often be 3-4 orders of magnitude slower than arithmetic instructions, though I doubt that matters here.


CRC32 is the same speed as an integer multiply, going all the way back to Nehalem (2008). 3 cycles latency, and you can start a new one each cycle (or more than one, on Zen 5).


Sure, in the same way SIMD instructions get a linear speedup in theory.

If you Google the words "CRC32 is slow", you can see hundreds of people complaining about this.


Another bonus quirk, from the 486 and Pentium area..

BSWAP EAX converts from little endian to big endian and vice versa. It was a 32 bit instruction to begin with.

However, we have the 0x66 prefix that switches between 16 and 32 bit mode. If you apply that to BSWAP EAX undefined funky things happen.

On some CPU architectures (Intel vs. AMD) the prefix was just ignored. On others it did something that I call an "inner swap". E.g. in your four bytes that are stored in EAX byte 1 and 2 are swapped.

  0x11223344 became 0x11332244.


Also known as "bswap ax", and research shows that it does something surprising but consistent on almost all hardware: It zeros the register.

https://www.ragestorm.net/blogs/?p=141

https://gynvael.coldwind.pl/?id=268

However, this page, now gone, suggests that some CPUs (early 486s?) did something different: http://web.archive.org/web/20071231192014/http://www.df.lth....

Unfortunately I have not found any evidence nor reason to believe that this "inner swap" behaviour you mention exists in some CPU --- except perhaps some flawed emulators?


Nah, it is not that bad.

Sure you can mess up your performance by picking bad compiler options, but most of the time you are fine with just default optimizations enabled and let it do it's thing. No need to understand the black magic behind it.

This is only really necessary if you want to squeeze the last bit of performance out of a piece of code. And honestly, how often dies this occur in day to day coding unless you write a video or audio codec?


The main flags to look at:

* mtune/march - specifying a value of native optimizes for the current machine, x86-64-v1/v2/v3/v4 for generations or you can specify a specific CPU (ARM has different naming conventions). Recommendation: use the generation if distributing binaries, native if building and running locally unless you can get much much more specific

* -O2 / -O3 - turn on most optimizations for speed. Alternatively Os/Oz for smaller binaries (sometimes faster, particularly on ARM)

* -flto=thin - get most of the benefits of LTO with minimal compile time overhead

* pgo - if you have a representative workload you can use this to replace compiler heuristics with real world measurements. AutoFDO is the next evolution of this to make it easier to connect data from production environments to compile time.

* math: -fno-math-errno and -fno-trapping-math are “safe” subsets of ffast-math (i.e. don’t alter the numerical accuracy). -fno-signed-zeros can also probably be considered if valuable.


Also I learned recently that there's `-Og` which enables optimizations suitable for debug build.


In practice I’ve had limited success with that flag. It still seems to enable optimizations that make debugging difficult.


Agreed. I like to compile most translation units with -O3 and then only compile the translation units that I want to debug with -O0. That way I can often end up with a binary that's reasonably fast but still debuggable for the parts that I care about.


Yup that’s what I’ve resorted to (in Rust I do it at the crate level instead of translation unit). The only downside is forgetting about it and then wondering why stepping through is failing to work properly.


Correct.

And every graphic programmer worth it's salt had a copy of "Computer Graphics - Principles and Practice" on the desk and followed whatever came out of the "Graphics Gems" series.

We knew about BSP, Octrees and all that stuff. It was common knowledge.


> Linked lists are taught as fundamental data structures in programming courses, but they are more commonly encountered in tech interviews than in real-world projects.

I beg to disagree.

In kernels, drivers, and embedded systems they are very common.


Most people who take data structures courses or perform tech interviews don't end up working on kernels, drivers, or embedded systems though. To me, it sounds like the point being made is that there are a large number of programmers who have learned about linked lists but haven't run into many cases where they needed them in the world world, and I think it's accurate.


This was my intention


Agree, I can't recall using anything more complicated than lists/arrays or hash tables (key/value stores) in practice, in many years of (mostly web application) programming. And even those I'm not coding from scratch, I'm using classes or functions that my programming language gives me. For anything more complicated than that, I'm using a database, which of course is using many data structures under the covers but I don't directly touch those.


I used to use them all the time. However, now? I would be hard pressed to not use one of the many built in vector/list/dict/hash items in many languages now. I would have to be truly doing something very low level or for speed to use one.


As a counterpoint, I’ve been working on collaborative text editing. I ended up implementing a custom b-tree because we needed a few features that I couldn’t find in any off the shelf library:

- My values represent runs of characters in the document.

- Inserts in the tree may split a run.

- Runs have a size - 0 if the run is marked as deleted or the number of characters otherwise. The size changes as we process edits

- Every inserted character has an ID. I need to be able to look up any run by its ID, and then edit it. (Including querying the run’s current position in the tree and editing the run’s size).

It’s an interesting data structure problem, and it took a few weeks to have a good solution (and a few weeks more later rewriting it in safe rust & improving performance in the process).

I love this stuff. I think it’s pretty rare to find a reason to code your own collection types these days, but it certainly comes up from time to time!


> I love this stuff. I think it’s pretty rare to find a reason to code your own collection types these days, but it certainly comes up from time to time!

Absolutely! That is one of the places you want to use that style of programming. As the base classes and built in structs do not really cover it yet.

Also as a counterpoint sometimes the built in ones have some very interesting degenerate cases. I had one in an old library that basically doubled its memory footprint every time you exceeded its buffer. That was a point to change it to be a fixed allocation or something else. If i had no idea of the fundamentals I would have been totally in the weeds and no idea why it was doing it.


You need a linked list to write hello world in any Lisp, though.

Seems like the glaring exception to the rule!


As source code, but not necessarily as running code.

SBCL:

    * (defun hello-world () (write-string "hello world"))
    HELLO-WORLD

    * (disassemble #'hello-world)
    ; disassembly for HELLO-WORLD
    ; Size: 36 bytes. Origin: #x100311C85C                        ; HELLO-WORLD
    ; 5C:       AA0A40F9         LDR R0, [THREAD, #16]            ; binding-stack-pointer
    ; 60:       4A0B00F9         STR R0, [CFP, #16]
    ; 64:       EAFDFF58         LDR R0, #x100311C820             ; "hello world"
    ; 68:       570080D2         MOVZ NARGS, #2
    ; 6C:       29EC80D2         MOVZ TMP, #1889
    ; 70:       BE6B69F8         LDR LR, [NULL, TMP]              ; WRITE-STRING
    ; 74:       DE130091         ADD LR, LR, #4
    ; 78:       C0031FD6         BR LR
    ; 7C:       E00120D4         BRK #15                          ; Invalid argument count trap
The actual code for this example is machine code (which references a string, which is a vector), here without linked lists.


No, you don't need to use linked lists to send a string to the standard output port in most Lisps. You just call a function.


(write-string "hello world") has linked list semantics, the fact that compilers can be smart enough to ignore this is not the point I'm making.

(write-string (cdr '(write-string "hello-world"))) also has to work, so it's pretty easy to materialize that semantics at any point.


> has linked list semantics

Nope; it has linked list syntax (that certainly isn't ignored even by very good compilers). Syntax isn't semantics.

The semantics is that a function write-string is called, with a string as its argument.

The second expression has linked list processing in its semantics because you stuck in a cdr, as well as a quote which makes a piece of the program available as run-time list datum. (This is semantics that could be easily optimized away in the executable form, but I would say that it has linked list processing in its abstract semantics.)


> Nope; it has linked list syntax (that certainly isn't ignored even by very good compilers)

We're looking at the same string and seeing different things. You're seeing `(write-string "hello world")` as a program, I'm seeing it as an expression.

It has linked list semantics, which you can preserve until runtime like this `'(write-string "hello world")`. Note that I didn't change the string, I changed its context. If the original were living in a string, and you called read on it, it would become a linked list. If you called eval on that list, it would become a function call. This is basic stuff which I'm well aware you know, so I'm not sure what all the quibbling is about.

You literally need a linked list to write a program in a language in which the code becomes linked lists. And you're going to have a bad time writing Lisp if you don't get the hang of cons cells, early and often.

Is "code is data" true, or false? You're trying to have it both ways here.

The "large number of programmers who have learned about linked lists but haven't run into many cases where they needed them in the world world" include approximately zero programmers who have wielded Lisp in anger, is my point. I thought that was pretty clear from context, but I guess not.


The compiler manipulating linked lists to get your program into executable form, versus that program manipulating linked lists to do its job, are different things.

You're also likely provoking list manipulation by running C hello world. The C grammar has lists, like lists of parameter declarators in a function, or argument expressions in a function call.

By the time you've compiled your C program, you've likely "used", lists, trees and hash tables.

Classic line-numbered BASIC interpreters stored the lines as a linked list. For instance in

  10 PRINT "Hello"
  20 GOTO 10
the 10 line is stored as a datum which has a pointer to the 20. Some BASIC implementations implemented a forward GOTO as a linear search through the linked list starting at the current line, and a backwards GOTO as a scan from the beginning.

So in addition to not being able to write C hello without linked lists, the same holds for BASIC.

The way you present your idea about Lisp is harmful because it is likely to be misinterpreted and become misinformation in the eyes of those who not so well informed.

The Lisp community still has to deal with nonsense like that the execution of Lisp programs is slow because linked lists are continuously being traversed, or that the only data structure is a list.

Think about how you might be playing into that. What do you think it looks like when you say that you can't write a hello world, without using linked lists.


Really only because they’re so goddamn easy. I find myself using linked lists a lot less since adopting rust for embedded code (even with no_std and no allocator, but especially when alloc-only std data structures are within reach).


Linked lists were heavily used in application software before the appearance of standard libraries and Java, which is when dynamically sizable array-based lists become common. There also wasn't a gap between the performance of linked lists and arrays before CPU became significantly faster than RAM.


Modern processor and cache performance lend themselves to vectors and SSA. Linked lists just don't scale well outside of niche uses.


>In kernels, drivers, and embedded systems they are very common.

Out of all the programmers in the world, what percentage of them do you think work in the kernel/driver/embedded spaces?


First, my only guess is that everyone's guesses are going to be wildly wrong. People who work in such spaces will greatly overestimate. People who don't will greatly underestimate. (This is mostly due to how many comments I've read on HN that implicitly assume that most people's problems and perspectives are the same as the commenter's.)

Second, linked lists are useful in a lot more places than that. Probably a better proxy would be low-level coders. You almost always want a linked list somewhere when you're dealing with memory addresses and pointers. Maybe not for the primary collections, but there are always secondary ones that need to maintain a collection with a different membership or ordering, and vectors of pointers don't have many clear advantages over intrusive linked lists for those secondary collections.


Yeah intrusive collections in C is the biggest use I’ve seen. I played with a physics engine a few years ago (chipmunk2d) which made heavy use of intrusive linked lists to store all the objects in the world model. I suspect there’s some clever data structures out there that might have better performance, but the intrusive linked list approach was simple and fast!


1%


More like 0.01% -- if we consider enterprise programmers, web programmers, and application/game programmers which I'd expect to be the largest groups...


Yep. There aren't many software developers I know who have ever touched {Linux, macOS, FreeBSD, Windows} kernel code except for embedded devs, driver devs, security researchers, hobbyists, and SREs/PEs.

The % who have touched kernel bits, wrote a triangle engine scene renderer, wrote a compiler, touched server metal in production, have worked on ASICs, and can put together ML/AI building blocks shrinks way, way down to a handful of living humans.


if the value really is 0.01%, then the education pipeline needs to be revised. 'blue collar' programmer positions should be the majority.


This not about blue collar vs white colar. After all corporate programmers and web programmers can both be blue colar, and systems programmers can be white colar (if we're using "blue colar" to mean smaller salaries and fewer percs - otherwise programming is a white colar job anyway).

This is about how many work in kernels/embedded systems/etc vs more common programming gigs. And that's less about how many are trained to do so, but rather how many are needed.


There are plenty of good uses for linked list and their variants. Like LRU lists come to mind; I couldn't bet that it's the most efficient way to implement them but they're pretty darn good. Then obviously things like breadth first search need a type of queue data structure. It often can come down to memory pressure, if you've got Gigs to spare, then allocating a contiguous block of memory for a list of something isn't a big deal, if memory is tight and maybe fragmented, linked lists can and will get it done. They have their places.

I did start to encounter some fresh grads with degrees that said "computer science" on them that couldn't answer some basic linked list questions. I was beginning to think it was a bad set of questions until I hit those kids. If you claim to know "computer science" and don't know what a linked list is, especially beyond some text books stuff, I'm probably not interested.


Why? Why would someone reach for a linked list in a kernel, driver, or embedded system?


No memory allocation/reallocation, preallocated resources managed in e.g. a free list. Also for things like packetized networks, lists are handy for filling as you progress down the stack while using fixed sized packet buffers, or reassembling fragments.

In embedded world, memory often needs to be exactly controlled, and allocation failures are fatal without a more complex MMU. In kernel world, I believe the main reason is that allocations can block.


In kernels, it's usually hard to get general-purpose allocation working reliably in all contexts. And you need that for resizable vectors. With lists, you just need to be able to grab an element-sized block. Quite often, it's even done with the memory page granularity.

In addition, a lot of data structures might be shared across multiple cores. Linked lists can be traversed and mutated concurrently (although with a bit of care).


I wonder how much of that is due to the kernel history, and the influence of C idioms, and not because of some inherent design superiority.

I'd be convinced once I see pure Rust kernels geared towards modern machines suddenly using linked lists everywhere. Otherwise I'm leaning towards it being a side-effect of the language choice and culture.

Also because I've seen the same kind of reasoning applied to compilers (e.g. "of course you need linked lists in compilers, they are extremely graph traversal heavy"). But one look at modern compilers implemented in Rust paint a very different picture, with index-based vectors, data-oriented design and flattened ASTs everywhere.


Getting a general memory allocator working in kernel contexts is a hard task. You need to make sure it can't block and is re-enterable, that it doesn't result in fragmentation, and that it can be used from multiple threads.

It can be solved (or worked around), but it's understandable that people don't _want_ to do that.


Intrusive lists are really powerful for those kinds of scenarios, and technically are linked lists. They're widely used in the kernel, IIRC.


O(n) iteration but pretty much guaranteed O(1) for every other operation. If that's the semantic you need, then linked lists are your friend.


Any time you have a computer interacting with the outside world in an asynchronous fashion you basically have to have some form of buffering which takes the form of a queue/fifo. A linked list is the most performant/natural way of modeling a queue in our ubiquitous computing infrastructure.

I/e in a DMA-based ethernet driver, the ethernet MAC receives packets asynchronously from the processor, perhaps faster than the processor can ingest them. So the mac interrupts the processor to give it new packets, and the processor can't sit processing the packets in the interrupt context, so it needs to put them into some ordered list for processing later when it has downtime. In a true embedded system, the memory for this list is going to be fixed or statically allocated, but you still don't really want to have an array-style list with fixed indexing, as you'll have to manage what happens when the index wraps around back to 0 etc, so instead you just construct a linked list in that pre-allocated memory.

I wouldn't say linked lists aren't really used in high-level applications, as I said they're used all over the place whenever you have external asynchronous communication, it's just that modern high-level frameworks/libs totally abstract this away from most people writing high level code.


Easier to avoid allocation errors, e.g. in the Linux kernel. I think Alice Ryhl mentioned it here - https://www.youtube.com/watch?v=CEznkXjYFb4


How do linked list prevent allocation errors? If anything it would seem to make them worse.

My experience in embedded, everything is hardcoded as a compile time constant, including fixed size arrays (or vectors of a fixed capacity)


Intrusive linked lists eliminate the allocation entirely. With a vector<Obj>, you have the Obj allocation and then potential vector-related reallocations. With an intrusive linked list, you only have the Obj allocation. So your code that adds/removes list entries does no additional allocation at all, it reuses a pointer or two that was allocated as part of the original Obj allocation. Often the list manipulation happens at a time when allocation failures are inconvenient or impossible to handle.


In more complex embedded software you are likely to see free lists used to manage pools of preallocated resources (like event structs etc) or pools of fixed sized memory buffers.


In embedded, you often need message queues.

A common way to implement these is to have an array of messages, sized for the worst case scenario and use this as the message pool.

You keep the unused messages in a single linked "free-list", and keep the used messages in a double linked queue or fifo structure.

That way you get O(1) allocation, de-allocation, enqueue and dequeue operations for your message queue.

Another example for this paradigm are job queues. You might have several actuators or sensors connected to a single interface and want to talk to them. The high level "business" logic enqueues such jobs and an interrupt driven logic works on these jobs in the background, aka interrupts.

And because you only move some pointers around for each of these operations it is perfectly fine to do so in interrupt handlers.

What you really want to avoid is to move kilobytes of data around. That quickly leads to missing other interrupts in time.


I'd say most developers don't write kernels/drivers or embeds, at least from what I've seen. I am not saying that there are not many devs like this, but rather that there are fewer kernel devs than web devs.


I beg to disagree^2. Tasks, threads, and processes are often structured as rings where there is always a "next" to maintain simplicity of task switching. The overall architecture of resources is modelable as cyclic graphs but implemented as rings, deques, single LLs, and other data structures.


I don't do any of those things and I still use lists constantly. Kinda strange to learn that many others don't use them at all it seems.


What kinds of things are you using them for usually? Is it mostly in C/C++?


linked lists shine when you can perform a O(1) remove operation if you have a reference to an object on the list. This is very common when using C structs and not possible in Java for example.


these cases are usually cases where you want to use a (hash) Set. If you're Ok changing everyone's indexing, the indexing didn't matter.


Hardware existed.

Around 2006 I had some automotive entertainment system from NEC on my table which had one of the Bitboys GPU chips on it. Wrote some vector graphics API for it.

It wasn't bad honestly. It supported OpenGL/ES 1.0 I only had to contact them twice for driver bugs. They resolved that within a few days.


I want this as a paint pigment please...


I wrote thousands lines of assembler for NASM. It was such a nice update to the good old turbo assembler back in the 90th.


Same here... i come also from TASM


As Borland fanboy, TASM was already quite cool, the only thing I missed was MASM being more high level with its macro capabilities, but by time I got to play with MASM, I was already into Windows 3.x world.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: