I like this site very much, its simple and old school.
"llvm-exegesis is a benchmarking tool that uses information available in LLVM to measure host machine instruction characteristics like latency, throughput, or port decomposition.
Given an LLVM opcode name and a benchmarking mode, llvm-exegesis generates a code snippet that makes execution as serial (resp. as parallel) as possible so that we can measure the latency (resp. inverse throughput/uop decomposition) of the instruction. The code snippet is jitted and executed on the host subtarget. The time taken (resp. resource usage) is measured using hardware performance counters. The result is printed out as YAML to the standard output.
The main goal of this tool is to automatically (in)validate the LLVM’s TableDef scheduling models. To that end, we also provide analysis of the results."
I advocated for the term to only be used by folks that can build their own computer, language, OS, and apps. Look up Chuck Moore and Niklaus Wirth for examples focusing on simplicity. For full-stack architect w/ team building it, look up Bob Barton of Burrough B5000. I cant remember if early folks writing Lisp interpreters built the CPU's, too. I know Burger did for a Scheme CPU, though.
The focus was on expressive power more than simplicity, though.
If you wish to make an apple pie from scratch, you must first invent the universe.
We all gotta pick our specialties. For me, low level high-performance stuff is more of hobby project level. I got a few cool things I've learned by myself but it isn't really what I do for a living.
Web isn't some magical beast- its just rote programming.
Actual business logic is where the categorization starts.
If you want to be humbled, go to a conference on high energy physics.
That isn't what the poster stated.
How about a Java to MIPS compiler? In C++ and then again in Java just to be self-hosting.
Or an all software OpenGL-compatible pipeline with painter-algorithm Z-buffering and trapezoid engine? Quaternions and all that state... fun-fun.
Another mandatory project in college was a non-pre-fork()ed select() caching HTTP/1.0 proxy in POSIX C89. IPC was a PITA back then.
There are still bigger fish, but one can be that much less of dilettante by digging deeper and mastering the why, not the what or the current fashion.
I guess I better throw most of those assumptions out for modern microarchitectures.
 This is not true for 8- or 16-bit subregisters, causing all kinds of fun issues (see "partial register stalls", also discussed in Agner Fog's materials).
What book is that?
Chunks of code are generated. Sometimes, a chunk will need to branch to a chunk that isn't yet generated or even allocated. Chunks can be cleared away for various reasons like evictions to make room, so the branches to a chunk might need to not go there anymore.
One strategy is to use indirect branches. When the destination doesn't exist, the branch goes to code that will resolve that. It's a bit like how dynamic linkers work.
Another strategy is to replace the code. The processor may stumble a bit over what is essentially self-modifying code, but then it eventually runs faster. The degree to which the processor will pause is a key consideration here. Anybody know?
BTW, if you like digging into this stuff, we're hiring.
stub 1) upon entry, atomic write stub 2 into the table. Then compile or enqueue onto a compilation thread/process. At the end of compilation, atomic write the newly compiled address to the table.
stub 2) spinlock on function pointer array entry != Stub 2 address. At end of spinlock, call new function pointer. If you care about power consumption or timing attacks, the spinlock can have a backoff corresponding with the speed of your JIT compiler. This function should be able to fit into a single cacheline, so it should usually be hot.
inlined thunk: Given a function, read address from table and call it.
The client uses the thunk to call the function address in the table without knowing if it's compiled or not. On Intel, you shouldn't need a read fence due to MESI rules.
The branch/fetch predictor works by hashing the address of the branch instruction into its own lookup table. So by forcing stub 3 to be inlined, you're relying on the hash to prefetch the correct function address and call it directly.
You should also be able to put the JIT into a separate process that mmaps with the NX bit set, and have the client load without the NX bit set. So it would be difficult to exploit the JIT compiler itself (but the produced code could still be unsafe).
You might want to cachealign the function table or pack similar chunks in the same function table entry to prevent false sharing whenever there's an atomic write. Also, this table is going to thrash your cache by incurring an extra cacheline load for the trampoline that it does every time.
Anyways, that's my back of the napkin design for how to do it.
That is certainly a different way to think about threading. My instinct would be to keep a separate JIT cache for each thread, keeping the threads from stealing resources from each other but also keeping them from sharing the JIT effort.
I think your approach is pretty much how /lib/ld-linux.so.2 deals with the PLT. I didn't see an explanation for stub 3, but I guess you mean the part of the JIT output that does the indirect jump. I take it that you believe the speed advantage of a direct jump is not enough to overcome the occasional hit caused by modifying it, with all the invalidations (icache, trace cache, etc.) that it might cause. Being similar to the dynamic linker is probably well-aligned with what Intel is trying to optimize for upcoming chips. OTOH, the spectre problem might limit what Intel does.
Putting the JIT in a separate process would have to add latency. Upon hitting a missing chunk of code, translation can't really wait. The idea is to run modern stuff, such as a recent desktop OS, via the JIT. We do use more than one mapping on Linux, as required to keep SE Linux happy.
With regards to latency, if you have a dedicated JIT thread, you can read ahead and JIT the start of all the jump/calls up to the next current ret in parallel to executing the first chunk. Like a super fancy prefetch. Heuristically, I think it's safe to assume that most code is local but I could be wrong.
Also, if you load hwloc on x86-64, you can read your CPU topology at initialization time. If hyperthreading is present (which is somewhat common), you can set affinity for JIT threads to be on the same core as the emulated CPUs. This will minimize read concurrency overhead since you'll be reading/writing to the physical core's L1 almost every time. (This part is crazy, but you might even be able to get away without using atomics due to how cache-associativity works, but I've never tried it and it might not be guaranteed into the future.)
To properly handle all cases means forcing a memory fence whenever you are done writing code. I haven't thought too much about which fence is proper, but at minimum a write fence needs to ensure all data is written out as expected. You might need a full fence.
EDIT: Maybe a sfence at the location of the "writing" thread, while the JIT Code needs an lfence (to ensure all caches are properly updated). IIRC, fences on both sides are needed in these cases.
The "dynamic linker" idea seems better to my instinct, because it won't require these fences. Of course, testing would be required (Instincts are typically proven wrong in performance questions). IIRC, Intel Skylake and AMD Zen perform branch prediction on dynamic addresses, so you should get okay performance out of the dynamic lookup.
would you mind sharing what PRM sections back up this reading? maybe a careful re-read will help you figure out what "might" require a full fence.
I'm generally willing to hold a technical discussion among peers. But it has to be on even terms with a decent amount of respect given both ways. Your tone in the post above, as well as your posting history, does not give me any hope towards having a good discussion with you.
here's the documentation: https://software.intel.com/sites/default/files/managed/39/c5...
Section 8.1.3 Handling Self- and Cross-Modifying code
(* OPTION 1 )
Store modified code (as data) into code segment; Jump to new code or an intermediate location; Execute new code;
( OPTION 2 )
Store modified code (as data) into code segment;
Execute a serializing instruction; ( For example, CPUID instruction *) Execute new code;
this doesn't strike me as contentious, there is no guideline that states a fence is necessary. you're possibly making the assumption that fences are "serializing instructions" and, well, again i'd question what documentation backs that interpretation up. is a sfence or lfence "more" serializing?
I can't ignore your tone. Sorry. Maybe other posters can, but I can't and I won't.
EDIT: I had a lot of other stuff here. But my attacks vs you were likely unfair. So I've erased them.
Nonetheless, as I stated in my first response to you: I have no hope to have a positive discussion from this thread with you. I do believe you might be a stronger assembly programmer than I am. Maybe next time we can have a good discussion. But the tone was not set correctly in the initial post, so I do not wish to discuss this matter with you any further.
im not claiming to be a "better assembler programmer" im literally quoting the only relevant documentation on a tricky area where you were spreading lies charitably dressed up as half-guesses. if you hadn't barged in with some grossly incorrect theory about fences we wouldn't be talking at all. i still think lying and misleading while putting on airs of educating folks is worse than, uh, quoting documentation? are you thinking the documentation is being too harsh on your pet theory?
will you at least agree to stop spreading this dangerous misinformation? neither fence is serializing (like... you wouldn't want them to be...) and someone building a JIT around your guessing is heading towards an erroneous implementation.
> apparently filled with unhinged personal attacks,
> you were spreading lies
> lying and misleading
Too late. You already failed at this within your very own post. The above words are not technical in any means, and I predicted that this would happen. Let me repeat myself one last time:
> I do not wish to discuss this matter with you any further
Perhaps I was not clear? But let me say it one more time. I DO NOT WISH TO DISCUSS THIS MATER WITH YOU ANY FUTHER. Am I clear this time?
EDIT: That does not mean I'm unwilling to discuss things with you in the future. As I stated earlier, it seems like you're an avid assembly programmer, and there aren't many of those in the world. So maybe next time, I'll feel confident to have a discussion with you. But in this thread, at this time? I do not wish to hold any technical discussion with you today.
if you can't handle being called out for providing false technical information... don't hand out false technical information? seems easy. it's not, as you so desperately want, a personal attack to identify areas where you're misrepresenting the technology.
im curious, how did you want this to go? DT, incorrectly:"fences are required" jv6:"ah, well, hm, er, here's section 8.1.3 which doesn't mention fences at all and specifies two methods (neither of which is a fence) that ARE required to ensure correct operation" like can i even tell people the truth? or must your egregious misrepresentation stand without question for you to find the "tone" acceptable?
I know one thing you want: you want to talk about assembly language / various low level details. Unfortunately, I don't want to do that with you in this thread. So I will NOT give you that. Is there anything else you might want from me?
As for what I want: I want to say goodbye to you, at least for now. Perhaps you and I will have a more pleasant discussion in the future. But this current discussion was poisoned from the start and I do not wish for this discussion to continue.
i mean, they don't try to hide this information? Section 8.1.3 gives two options that are guaranteed to work across future iterations of hardware. the sibling comment that assumes a heavy serializing like CPUID is the ONLY method is shooting from the hip and only really appropriate if you're playing tricks with what LA you're using for the write/read.
you'll probably want to skim "11.6 Self-Modifying Code" as well. SMC is guaranteed to work, not work fast. Most folks want that second one and blasting out icache entries is going to put a damper on that.
(This is particularly motivated by the fact that Spectre mitigation requires ordinary indirect jumps in the kernel to use a "retpoline", which prevents the processor speculating the jump target but is very slow - are you using retpolines for your indirect jumps?)
The most recent "static calls" patch set thread: https://lkml.org/lkml/2019/1/9/1112
In terms of latency, having the JIT in a process or thread is semantics at that point so you might as well process isolate it.
It all depends on how often your JIT instruction fetches outside of L2 due to the size of itself and the code that it generates. You can approximately measure it using cachegrind and then multiply the Data + instruction L1/L2 miss values by whatever they cost on your architecture.
Does anyone know if this individual still teaches at the Technical University of Denmark and if there might be any lectures of their's online?
I've been watching Onur Mutlu's CMU lectures online and this material is a nice addition to computer architecture self-study.