Brendan answered with advantages, but for how: it's still just a software breakpoint, int3/0xCC on x86 as you say. But the round trip of dealing with that breakpoint is much tighter, because the handler function is called directly in the kernel trap, without even a context switch. Uprobes has about the minimal overhead that a software breakpoint can possibly have, whereas involving a userspace debugger requires a bunch of syscalls and context switches every time.
> mov instructions from register to register make up for more than 60% of the time spent in the critical section of the code, while we would expect most of the time to be spent xoring and anding. I have not investigated why this is the case, ideas welcome
If you're not using precise events, then the instruction addresses reported by perf will have some skid. This is a small cpu delay from when a performance counter overflows to when the interrupt actually freezes state.
You can choose precise sampling for some events, depending on the CPU. Try "-e cycles:pp" for instance.
0,09 │ mov (%rax,%r8,4),%eax
29,32 │ mov %r14,%r8
I think this first mov from memory is likely to be your true cycle eater, much more so than the second mov reg-reg or any single xor/and operations. But don't optimize based on my hunch - measure it precisely first! If memory access proves to be your slowdown, then you can try optimizing your access patterns.
Also, because of how complicated modern pipelining is, some instructions that you wouldn't expect to take a long time do take a long time (usually because they're waiting on a mov from memory to finish). In this case, the mov from memory could be throwing up a hazard that's blocking the mov from register.
We don't even have 4 billion characters possible now. The Unicode range is only 0-10FFFF, and UTF-16 can't represent any more than that. So UTF-32 is restricted to that range too, despite what 32 bits would allow, never mind 64.
But we don't seem to be running out -- Planes 3-13 are completely unassigned so far, covering 30000-DFFFF. That's nearly 65% of the Unicode range completely untouched, and planes 1, 2, and 14 still have big gaps too.
The issue isn't the quantity of unassigned codepoints, it's how many private use ones are available, only 137,000 of them. Publicly available private use schemes such as ConScript are fast filling up this space, mainly by encoding block characters in the same way Unicode encodes Korean Hangul, i.e. by using a formula over a small set of base components to generate all the block characters.
My own surrogate scheme, UTF-88, implemented in Go at https://github.com/gavingroovygrover/utf88 , expands the number of UTF-8 codepoints to 2 billion as originally specified by using the top 75% of the private use codepoints as 2nd tier surrogates. This scheme can easily be fitted on top of UTF-16 instead. I've taken the liberty in this scheme of making 16 planes (0x10 to 0x1F) available as private use; the rest are unassigned.
I created this scheme to help in using a formulaic method to generate a commonly used subset of the CJK characters, perhaps in the codepoints which would be 6 bytes under UTF-8. It would be more difficult than the Hangul scheme because CJK characters are built recursively. If successful, I'd look at pitching the UTF-88 surrogation scheme for UTF-16 and having UTF-8 and UTF-32 officially extended to 2 billion characters.
NFG uses the negative numbers down to about -2 billion as a implementation-internal private use area to temporarily store graphemes. Enables fast grapheme-based manipulation of strings in Perl 6. Though such negative-numbered codepoints could only be used for private use in data interchange between 3rd parties if the UTF-32 was used, because neither UTF-8 (even pre-2003) nor UTF-16 could encode them.
It's stated on official site  that it's based on Linux 2.6.33, and it looks like kernel and userspace are compiled for Elbrus ISA and run natively, without x86 emulation.
There are no links to sources on their site, and they don't provide datasheets. To request sources under GPL you need to get binaries first. I live in Russia, and I've never seen Elbrus in real use anywhere. It's not marketed or sold to general public. I think target market is government's security agencies. Of course they get sources anyway for audit and have no incentive to publish them.
Sure. Just in this case variant 2 is very unlikely. I'd say getting sources after buying one or two Elbrus-based servers would be a success for a company not affilated with government.
Company I work for once tried to get linux kernel sources for an embedded system produced by another russian company. We got just honest "we have our proprietary module in our linux tree, so we won't be giving you any sources. Still, we are nice enough to recompile it for you with options you need." Ugh...
The libelpthread caught my eye -- a "version of libpthread, optimized for operation in hard real time" (according to bing translator). And there's a paper in English about this: http://www.mcst.ru/doc/1107/PCS223.pdf
They also state that in realtime mode it's possible to set up different modes of external interrupt handling, "computation scheduling", "disk i/o" and "some other things", whatever that might mean. Kernel is modified to support realtime operation too.
It looks like profile_node.stp is just probing timer.profile, but in theory this could be any event that systemtap is capable of. A function in glibc, a syscall, a kernel function, a perf event on say branch misses, ... there are many many possibilities. (Might need to tweak other parts of their analysis that assume a time-based event though.)
let total = (1..).map(|x| x*x).take(10).fold(0, |acc, x| acc+x);
You could replace the fold with ".sum()" if you use the unstable std::iter::AdditiveIterator. Or you could incorporate the squaring map into the fold, but then we're no longer comparing the same thing.