The C vs Firefox results are about the same that I'm seeing in my 8-bit emulator (http://floooh.github.io/virtualkc/). For the Amstrad CPC (currently the most expensive system) on my 2.8GHz Core i5 MBP I'm seeing about 1.3 to 1.5 ms 'emulator time' per 16.6ms frame for the WASM/asm.js version on FF Nightly, and for the native version (clang -O3) about 1.2 to 1.4ms.
This 'core emulator loop' is pretty much 100% integer code, with a lot of 8- and 16-bit integer bit operations (shifting, masking, ...). No calls into the browser or operating system, and almost no calls into the CRT or C++ stdlib (may be a small memcpy here and there).
The performance differences for 'pure' C code between browser- and native-version are so small now that they are no longer relevant, at least for the use cases I encountered, or rather within the normal performance differences when running the code on a slightly different CPU.
Calling out into HTML5 APIs is a whole different topic though ;)
Languages like Java and JavaScript don't let you lay data out in memory directly, nor do they give you much on control over how memory is accessed, so any performance benchmark involving C is entirely superficial.
I can write 2 programs in C, both which iterate over some amount of elements and perform the same calculations on the same amount of data, and have one take 500ms and the other take 8s. It's all a matter of how you lay things out in memory.
Emscripten uses a TypedArray for the C heap in asm.js which has the exactly same data layout as if the C code is compiled natively, this gives you (more or less) the same performance advantages for properly arranged data as native code.
What's amazing is that in some engines, if the JIT can determine that you are within the bounds of the array, it can disable those checks for the duration of the loop.
So something like for (x = 0; x < arr.length; x++) {...} can run at pretty much the same speed as C code.
That's a bold claim; Sure it is theoretically possible, to have two versions of the same C program take either 500ms or 8ms purely due to memory layout.
But I would like to challenge you to actually do it! I.e. same number of calculations, on same amount of data and a factor of 60 run time difference, with only the memory layout as actual difference between the two implementations.
I'll take that challenge. Here's how to do it, with a little bit of sneaky interpretation of 'memory layout':
have an algorithm which takes a large struct S and looks at a subset Sf to determine what to do with S, for some value of Sf use all of S, otherwise skip it. (e.g. when distance between 2 particles < threshold, calculate force).
Now have a very low pass-rate for the filter so that total time ~= time taken to read all of Sf.
Make Sf a single byte and S >= 64 bytes (page size if you really want to beat it).
And now compare array-of-struct vs struct-of-array ;). You should see ~64x performance difference in the assymptotic case.
AoS will use one byte per cacheline read. SoA will use 64 bytes per cacheline. if you're memory bound this will translate to an almost 64x speed difference. There are other ways to achieve the same effect but that is the gist of getting 64x performance. using 1 byte vs 64. if you want to go really crazy, use a single bit and a bitfield array for the SoA. If you use a bitfield the passing chance doesn't have to be that low.
I happened to have to present on AoS vs SoA today so I have a less extreme benchmark to show the difference (~16x because it filters over 32 bit ints)
So you can write code that is 64x times slower when you specifically optimize for a slowdown. Cool.
Can you run that in web assembly and get the same slowdown. If so most of the same things that allow C/C++ to faster than other languages on systems will also allow them to be faster in the browser.
Yeah, this will affect any language that stores structs as POD (plain old data) as long as it has somewhat sensible alignment rules and allocation (i.e. doesn't hide every struct behind a pointer, doesn't align char's to 8 bytes).
Sorry, I don't quite understand the explanation...
Do you have some simple pseudo-code?
And are you using the same number of instructions in both cases?
Note also: If you are moving data structures of different sizes, you are not using the same number of instructions(calculations), as the challenge required.
>Sorry, I don't quite understand the explanation...
define a
struct S{char flag, data[63];}
then do
foreach s in std::vector<S>(big number){
if s.flag == 0{ //set this variable to be very rarely 0
do_something_with(s.data);
}
this will load the entire struct into memory each iteration because loads from memory happen on cacheline granularity (i.e. 64 bytes at a time)
then the optimized version is
>And are you using the same number of instructions in both cases?
//define S as struct of arrays now
struct S{char* flags, (*data)[63];}
s = initialize_S(big number);
for(i = 0; i < big number; i++){
if s.flags[i] == 0{ //set this variable to be very rarely 0
do_something_with(s.data, i);
}
This one will load 64 flags into the L1 cache at a time. because the L1 cache is much much faster than main memory access time will be ~64x faster for the checking of the flags. Because it is rare for the loop to enter the if, this means the overal speed will be ~64x faster. As proven by the graph I posted where it is about 16x faster (because I use an int instead of a byte as flag).
Syntax is slightly different between SoA and AoS, but functionally they are the same except for how multiple S's are laid out in memory.
>And are you using the same number of instructions in both cases?
>Note also: If you are moving data structures of different sizes, you are not using the same number of instructions(calculations), as the challenge required.
The same number of instructions is an arbitrary and impossible task. I'm not going to handwrite assembly for some internet points, I only do that for fun ;). Do they perform the same amount of cpu work though, yes. They are functionally the same and use the same types of instructions. That's about as good as anyone can get it. In any case, the program is completely limited by memory throughput not computation. I could make one of the inner loops a sqrt and the other a nop and the difference would be the exact same.
If you really want to see it in assembly to prove it uses the same amount of instructions, then it's just a mov rax, [rcx], cmp rax, r8, je INNER_FUNCTION, inc rcx {sizeof(S),1} depending on the method. Add 0x90 where appropriate to match instruction counts. The rest is left as an exercise to the reader.
Any entry level optimization course will make you implement something like that; do a matrix+matrix addition, but traverse it in row-major order in one program, and column-major in the other. If the matrix is big enough, you will easily get a 10x difference. For even larger sizes (once you thrash the TLB), you will get an additional factor.
You might not get up to 60, but this is just a simple matrix addition. Even just matrix multiplication might be enough to get a 60x difference just with optimizing for memory hierarchy. I expect that high-performance memory bound programs are more complex than matrix multiplication, so the parent comment's claim doesn't seem too unlikely to me.
Sorry, I didn't mean to question it whether it is possible. I just think that a factor of 60 sounds tricky to achieve if we are just talking about RAM (no disk involved).
The first time I encountered this myself, was doing texture mapping on the 486 back in the mid 90s. Texels would be laid out by row, and the mapper would draw horizontally. Texture maps that would fit within the L1 cache would draw fine no matter which rotation they were drawn it.
However texture maps that were larger than the L1-cache, would be drawn fine as long as the texture map wasn't significantly rotated. But, if you rotated the polygon by 90 degrees you'd see a significant drop in frames per second, because you'd skip a while row between drawing each texel, which would effectively cause a lot of cache misses. OK, I didn't manage to explain that well, but I hope it is understable.
Obviously, I have encountered this many times since, in many variants (including disk read/write), and I definitely agree that a factor of 10 should be possible.
I guess, I am just wondering if the difference between RAM and L1 has reached a factor 60, and how easy it is to come up with something where the cache misses all the time, or if caches have become smarter.
When N is 100, all fixed costs in the inner loop are multiplied by 1,000,000. So a nanosecond difference is now a millisecond. It's pretty easy to get worse than that by boneheaded data layout (row-major vs column-major is pretty common)
If you use a "bonehead" data structure each memory reference will force a cache-miss, i.e. taking q CPU cycles instead of 1 CPU cycles.
If the algorithms has a running time of O(N^3), it means it is using some number of instructions <= c * N^3 to complete. Let's for simplicity sake say 1 operation is identical to 1 CPU-cycle.
Now, if each operation takes q CPU cycles instead of 1 cycles due to cache-miss on every operation, the running time of your algorithm is still bounded by q * (c * N^3) operations.
For the example with N=100 -> The fast version takes c * 1,000,000 operations to complete, for some value c. For the slow version, as each operation is now q times slower, it will take q * c * 1,000,000 operations to complete for values q and c.
I.e. it takes q times as long, not q * q * q.
What you were saying is that, if you run a O(N) algorithm on a 100 MHZ CPU, it will take 10 times as long as on a 1000 MHZ CPU, but an O(N^3) algorithm would take 1000 times as long. And that is obviously, not true. Both will take 10 times longer.
I'm saying that a poor choice of data structure will have an amplified impact on an O(N^3) algorithm.
Giving it as an example of where you could have the different implementations of the same algorithm have massively different runtimes on the same processor.
If the algorithm has a running time of O(N^3), that implies that it takes O(N^3) < c * N^3 operations to complete. Regardless of datastructures. That's the definition of asymptotic running time.
If we disregard c, and set N = 100, that is 1,000,000 operations.
We have two situations. 1) Either we use a good layout where each operation is fast because each memory access is from cache. 2) Or we use a bad layout where each operation is slow because we have a cache-miss and have to access the RAM. These access times are independent of the specific algorithm, and only depends on CPU architecture.
Let say that the fast operations each take 1 second.
And the slow operations each take f seconds.
In the fast case, with good memory layout, the algorithm completes in 1,000,000 seconds.
In the slow case, with bad memory layout, the algorithm completes in f * 1,000,000 seconds.
The time of each operation we perform does not depend on the size of N. However, in practice, you'll typically have a lot of cache hits even with a bad memory layout, so the difference may be less than a factor of f for smaller values of N (where you'll have more cache hits in a cubic algorithm). However, it will be upper bounded by f. I.e. there is an upper bound on how much slower you can make an algorithm by changing the memory layout.
Obviously N and f are independent (although they might not be, your N may dictate how often a cache miss occurs).
Right, so now let's assume that our O(N^3) algorithm is simply a triple nested loop. And that we have an expensive operation f, but that through choice of data structures we can choose in which layer that operation occurs. All other costs are constant.
If f is in the outer loop, it happens N times, in the middle loop N^2, in the inner loop N^3.
So our choice of data structure impacts our running times by any factor from linear through cube of N.
I was simply answering the question of is it really possible to have two implementations of the same algorithm have such disparate run times, and it's trivially possible.
Another way of saying it, is that the mistake need not look expensive on cursory inspection, a simple way to have very different runtimes is through the accumulation of a large number of small costs.
Maybe the important point is to use a profiler.
I've recently gained a 60x improvement in real-world run time of what is considered in the literature to be the best known algorithm for a particular problem. The speed up didn't come from improving on the algorithm, it came from improving data locality, removing memory allocations, and general modern code hygiene (to be fair the canonical implementation is ~20 years old)
No SIMD support, there's no dynamic memory allocation happening at all, data layout should be mostly ok (the hottest data block is the RGBA8 framebuffer where video decoding goes to, and that's filled linearly).
Wait. You emulate 3.5MHz machine on 2.8GHz Core i5, reach 10 the speed of original (1.3 to 1.5 ms 'emulator time' per 16.6ms) and therefore proclaim success? :o
Games use every little glitch and quirk of an architecture. Emulating a zillion edge cases on another architecture takes a lot of power. In 2011, someone finally made an accurate SNES emulator but it needed a 3GHz CPU. https://arstechnica.com/gaming/2011/08/accuracy-takes-power-...
He, do it better then ;) The CPU emulation alone is about 300x faster then an original Z80 (runs at 1.2 GHz on the same MBP), most performance is currently burned in the CPC video emulation (way more expensive then the CPU). The CPU and memory system is quite optimized, the other systems not so much.
The video config can be changed by the CPU in the middle of a scanline so you can't decode an entire scanline in a tight loop but need to run video decoding interleaved with the CPU (currently it's ticked after an entire CPU instruction though, not per cycle), and every tick of the video system needs to figure out all the decoding parameters again, because they might have changed since the last tick (and demo-scene demos use this mid-scanline-stuff a lot).
Also the display isn't driven directly by the 6845, HSYNC and VSYNC go through the Gate Array where a few ticks of latency is added, the same is true for the interrupt request that happens every 52 scanlines.
I haven't spent any time at all yet optimizing the video system though, for instance the pixel decoding might benefit from lookup tables. I first want to get it accurate enough for demo-scene gfx demos, and than will take care of performance.
As the author says in sibling comment, impersonating a processor is much easier than making an accurate reproduction of period video and sound hardware.
This depends entirely on hardware. You can run good emulation of 8 bit computers (spectrum, c64, cpc) at ~hundreds to thousands the original speed using emulators written in C. Using emulator that barely reaches x10 factor as a performance benchmark for webassembly is questionable.
That's a pretty bold claim considering that (for instance) MAME's CPC emulation eats 25% of one CPU core (checked with vsync enabled). MAME might not be the fastest because they do a lot of the inter-chip-communication via function pointers, but they're also not idling around and the CPC emulation is quite accurate AFAIK. If you can point me to an emulator that's hundreds or thousands of times faster I'd really like to take a look, might steal some ideas ;)
The way I understand it, WebAssembly is all about the size of the binary and parsing overhead. Or, at a higher level, about enabling a level playing field between more languages than just JavaScript.
Speed improvements from a common runtime and bytecode are certainly welcome, but if they are possible with WebAssembly, they are also be possible with plain JavaScript, and therefore shouldn't be visible in a comparison between the two.
I, for one, hope that WebAssembly will enable, say, Lua as a first class citizen on the web.
> Speed improvements from a common runtime and bytecode are certainly welcome, but if they are possible with WebAssembly, they are also be possible with plain JavaScript
Nope, a static language will execute faster than a dynamic language, because the runtime knows precisely what type everything is and how much space to allocate. Additionally no faffing around with dictionaries for dynamic types. Currently JS engines will have to figure out the types dynamically at runtime, but even then can't always be sure so have to keep to slower dictionary lookups.
Once we have a static language -> webasm, it should execute an order of a magnitude faster than JS -> webasm.
Edit: Additionally, browsers will finally consume less power, CPU time and memory, because the execution runtime process will be so streamlined.
> Edit: Additionally, browsers will finally consume less power, CPU time and memory, because the execution runtime process will be so streamlined.
Don't worry, the software industry (and adtech) will find new cool ways to bloat the websites so that their resource usage will be back to usual high, while offered features stay more-less the same. That's pretty much how desktop, mobile and web all have been evolving so far.
This can also be seen the other way around. Given the speed at which websites pile on unnecessary shit, just imagine how websites would perform if browser vendors weren't busy optimizing all day long.
That only matters if browser vendors stay ahead of the bloat curve (which I'm not so sure). Otherwise, one could say that for given performance point of the system (hardware platform + browser), there's a set level of bloat software sticks at, making the perceived performance constant.
That's not really true, as JS engines will already compile things with the assumption that the types won't change and "bail out" if they do.
So if you can "hint" to the JIT that a variable is an integer, and it will stay an integer, then it will not only "unbox" it and treat it as an integer, it will compile the code very similar to how a static language would.
In asm.js, this is done by using little "tricks" of JS to hint to the compiler that something is (for example) an integer by appending `|0` to it.
So there really shouldn't be any major difference between "static language -> webasm" and "JS -> webasm" if the JS is written with performance in mind.
In practice you'll probably see greater performance from a static language simply because you "can't" violate those rules so there doesn't need to be really any "guarding" or checking, but you aren't looking at anything like a magnitude speed increase here, it'd be incremental in most cases.
After all, no JS engine out there would really tune their engines to benefit benchmarks at the expense of real-world performance.
Even with all the dynamic ability that something like JS gives you, most devs still create an object to store stuff, then don't change it. That means that an aggressive policy of "compile it as it is, and bail if it changes" ends up working much more often than it doesn't.
And in the cases where it doesn't, falling back to the "compile it dynamically" isn't going to be any slower than if they did that first, so it's basically a free optimization (as long as the compilation doesn't take that long).
the JIT being able to say "This here is an int, which means that this function takes an in and returns an int always, i'm going to compile it that way" means that now that function is "C-Speed".
If your program were filled with functions similar to that where the compiler is easily able to optimize the shit out of it, then it will run suprisingly close to "C-Speed".
Custom types (which I took to mean different data storage formats) aren't really something that the compiler can help with (any compiler). If you use a hashmap, you've got a hashmap, if you use a doubly-linked-list you've got a doubly-linked-list. C or JS iterating over an entire hashmap is going to be slower than iterating over memory addresses.
But luckily JS has stuff like TypedArrays which are basically C arrays with bounds checking.
And if you loop over it in a way that the compiler can determine you won't go out of bounds (say, with a simple for loop that runs while x < arr.length), you've got memory access at C speeds.
So you "can" (technically, on paper) make a JS program that would run at C speeds in most cases, but at the end of the day some part of your stack is going to begin using the dynamic parts of the language, and then you get the big speed roadblocks again. (by the way, the above is basically what ASM.js is, a well curated subset of JS which doesn't allow anything dynamic so that the JIT can compile it to run at "C speeds")
The benefits of a static language aren't in what they allow, but what they don't. Tying your hands and saying "No, you can't make an array of half strings and half integers" lets the compiler be simpler and not have to worry about those "edge cases", which means that it can focus on the "fast path" more. (in practice this almost always shows up as a faster language because JITs aren't perfect, so they can get confused by code that could technically be compiled more optimally but won't because the JIT can't identify it)
It's also why adding optional static typing to a language won't make it faster on it's own (after all, the JIT already knows what types variables are at calling time, telling it that again won't solve anything...)
I don't think you are getting it. Here is a custom type in C#:
public class CustomType
{
public int A;
public int B;
public CustomType2 C;
}
At runtime this will mean any instance of this type is 32bit * 3 = 96 bits, plus the size of any metadata. Any code that accesses an instance member such as 'customType.B' will simply need to deref the address of customType, then add 32bits, and then deref the address of 'B' to get straight to the value.
Now let's do the same in JS:
var list = [];
for (i = 0; i < userSpecifiedLength; i++)
{
var customType = {};
customType.A = i;
customType.B = i * 10;
if (i % 2 == 0)
customType.C = { }
list.push(customType);
}
Perfectly normal dynamic JS code, but now at runtime the JS engine has no idea what 'customType' could possibly be, until is has attempted some sort of static analysis at runtime to figure out the fields. Additionally 'C' is optional, and whereas in C# that mean it may be null, in JS that means it could be undefined too.
There is also no guarantee that 'customType' is left as is and more fields aren't added to it later. All this leads to having to use dictionaries to store the information, which then require hash lookups every time you access a member. Fast, but not remotely near as fast as the C# runtime who has already mapped everything to an address.
That's almost the exact scenario I am talking about.
Your custom type there is perfect in JS. The JIT will VERY quickly determine that i is always an integer (actually before the code is first run it will have figured it out), and will "unbox" that to treat it as such (and not a JS "number" type). Objects in JS have a "hidden class" where they can be stamped out literally just like the C# example, so the JIT will be working with the exact same thing. A memory-mapped custom type.
Read up on "Hidden classes" [0] which can give you some insight into how it works. Basically as long as you treat a JS object like you'd treat that C# "custom type" (don't dynamically add or remove properties, stick to one type per property), it will instantly treat it identically to how C# would treat it until it changes (at which point it bails out and tries again, and if that happens too much, it will then default back to fully-dynamic).
But even if that weren't the case, you could still write your own data storage in JS that works like that. You could do something like this:
const customType = new Uint32Array(3)
const A = customType[0]
const B = customType[1]
const C = customType[2]
That storage will not only be typed, but will also have LESS overhead than the C# example, at the expense of clarity (after all, that's basically a C array there). This is basically how asm.js works, and what the code looks like when you compile C code into JS. Obviously you don't want to do that in most cases, but if the JS engine is your bottleneck and your data storage is causing slowness, then it might help.
JITs have come a LONG way in the past 5-10 years. It's pretty incredible what they can do, and it's actually not that hard any more to make smaller functions that can run at C speeds in V8 and SpiderMonkey.
> Nope, a static language will execute faster than a dynamic language, because the runtime knows precisely what type everything is and how much space to allocate.
JavaScript is JIT compiled, it has to be parsed, typechecked, interpreted and eventually compiled. WebAssembly is parsed and then translated to machine code (with a much simpler compilation pipeline).
Typical JavaScript VMs have a lot of fast paths for small scripts which also benefits small benchmarks like this. For more realistic workloads, parsing and compiling takes a significant amount of time and you could skip (almost) all of it by precompiling to WebAssembly (not at the moment, since you can't generate JavaScript objects from WebAssembly yet, but that'll happen eventually).
JavaScript in particular also has the problem that its semantics is an utterly horrifying mess... This forces the generated code to be very conservative and include a lot of dynamic checks and escape hatches to the interpreter. If anything, the situation is worse than with Lua, and let me remind you that LuaJIT doesn't actually compile said "escape hatches" because of code size. There are just that many of them.
Javascript doesn't have the same type information as WASM, also it lacks a native 64-bit integer. The JIT can do some type inference to use integers internally in some cases, but it's not optimal.
Predicting what sort of Javascript turns into what kind of native code on different JIT implementations is far from simple, too.
Even if in theory, an equivalent result should be possible in some cases, it's unrealistic to expect that in practice.
> WebAssembly is all about the size of the binary and parsing overhead.
I don't think this is a good assessment.
WebAssembly needs to pass through a complete compiler backend before it can be executed.
WASM needs to be parsed, an in-memory representation is produced (probably LLVM IR), data flow and control flow graphs generated, a whole bunch of optimizations take place after which normal instruction selection, scheduling and register allocation and assembly take place.
This task is comparable in complexity to a JavaScript interpreter with a JIT compiler. However, JavaScript engines have optimized for a fast startup where the interpreter quickly starts running the program (while JIT'ing and optimizing in the background).
WASM has (or should have) two motivators: enabling near-native performance and allowing other languages to be compiled and executed in a browser. Based on this article, there's still a long way to go with performance. Most browser APIs are not accessible from WASM without some wrappers, so there's still work to be done before other languages become a viable replacement for JS.
It's not so much an assessment as one of the main stated goals of the project. The initial point of WebAssembly is to do what asm.js does but with a smaller on-the-wire size and much less processing overhead to get from the on-the-wire format to executable code. And more cross-vendor buy-in, so the results are more reliable.
> an in-memory representation is produced (probably LLVM IR),
It's not LLVM IR in either Firefox or Chrome, fwiw. In the case of Firefox, it's the IR that Ion (the "fastest" tier of the JIT) uses. I expect this to be a pretty common approach across browsers, actually; it lets you leverage your existing code generation infrastructure.
> This task is comparable in complexity to a JavaScript interpreter with a JIT compiler.
Not really. Once you've got the IR you just use your existing codepaths for it; the IR generation from webasm is a much simpler task than IR generation from JS, complete with runtime instrumentation and whatnot.
Most popular dynamic programming languages are well-loved because they have huge standard libraries, which can largely not be ported to a browser sandbox. This is why I don't see how, say, Python or Ruby would be good sources of WebAssembly.
Lua, on the other hand, is a tiny language. It is remarkably well-designed, with a very simple type system, a very simple syntax, but nevertheless, Lua is extremely flexible.
I have spent a few years with Lue, and I have come to admire it, for it feels as flexible as Ruby, but at the same time more consistent than Python, and faster and simpler than either of those. And, needless to say, none of the inconsistencies of JavaScript.
Lua's stated goal is embeddability. The whole language including the parser and the standard library fit in a few hundred Kb. The bytecode only has a handful of very simple instructions. The source code is exceedingly well-structured and easy to understand.
I think Lua could be a very simple drop-in replacement for JavaScript, doing exactly the same thing that JS does today, but using a much nicer programming language. And it already comes with an object system, modules, and a string manipulation library.
Sailor author here. They probably mean on the front-end! :)
For that we have Starlight (http://starlight.paulcuth.me.uk), which I use on Sailor as well. Very excited for the day we can run Lua natively on the browser!
I think there is no js version implemented which accesses the bodies parameters like this:
x = body.x[body_index]
I expect this should be much faster than accessing them like this:
x = body.[body_index].x
Because the latter requires pointer-from-property calculation for every single values access (which must be somehow optimised)
The former just requires pointer-from-property calculation for every array (not element) involved in the nbody loops, the individual element|values are picked out by array addressing which is inherently much simpler than property access.
I've used this method for nbody physics. I can modify a test to it and push to the repo.
Well I got it completing and its taking about 35% longer than the 'original' method. I do expect index lookups to be faster than property lookups so believe there must be some inefficiency in my hasty refactoring.
Or there might be an issue with there being only 5 bodies in the test. Im a bit puzzled.
I find this is running 20% faster than the fastest on my chrome browser (an old version) but is slightly slower on firefox, but catches up a bit with 11 bodies.
I expect it could run faster yet by crunching the code up more and maybe removing objects altogether, but i made few changes as possible just to compare the difference in addressing.
There is also function "offsetMomentum" which tweaks the suns movement according to momentum of the planets, it seems very extraneous to me, gravitational influence should surely be sufficient in itself regarding conservation of momentum. I suspect that function is more likely a catch for large errors which can result from point grav bodies getting to close to each other.
Thanks. I note a different aspect of SoA is when different functions or loops involve only a subset of the feilds, only the arrays of the involved fields are fetched. With AoS the feilds are scrunched in memory next to each other byRecord. That could benefit algorithms which only touch some of the records, however every record usually needs accessed to see whether its involved or not - the 'involving feild' would need stored separately for AoS to have any benefit.
Forgive me a boast about a nbody simulation ive been developing: I got the details 60 planets and moons etc. from Nasas JPL server and put them into it. Running at a timestep of 30 minutes or hours (virtual time) for a year, the Earth ends up within a moons orbit of where JPL says its supposed to go. I think the innaccuracy is due to limitation of javascripts 64bit float (rather than, possible relativistic effect). Not sure yet, Solar system scales are so huge.
When dt is too large and bodies get too close, they get super-attracted to each other in one step, and by the time of the next step, they have travelled too far to get slowed again, this is where I see major errors can occur. Its necessary to put a throttle in the force calculation, or to reduce dt to deal with that 'near missing'. Or to substep but that seems to require excessive adjustments.
Nice. Agreed on errors with big dt. For me it is usually when shortening the time step to try to minimize for these that floating point errors start to come in.
Now chuck in some velocity dependent forces and start screaming (that's what I've found with physical simulations anyway).
One trick which improves stability heaps is 'tempering' the data to fit the dt. Weve got the verlet integration thing where they say the velocities should be half a step behind before updating forces. I find they should also be reduced a little as objects are cutting through the analog curve of their orbits in the timestep - as a hypotenuse instead of an arc. When I tempered the data like this orbits were all much stabilised. Its required for objects if they are added to the model later too. To have their precise velocities measured while tempered they also need 'de-tempered' to tell where they are going precisely. Just calculate their acceleration move them half step and adjust for ratio of hypot/curve.
"velocity dependent forces"
I have been wondering how to implement a sort of quasi-electromagnetic force without the big expense of maintaining a magnetic field. I find the electro-magnetic force mind blowing compared to newtonian gravity. Like charges seem to be able to attract when travelling together, this would seem to produce patterns and structure more readily than gravitation.
Perhaps when things are very simple, but we can end up with thousands of objects each with dozens of optional properties. With the fields declared as arrays only each array needs a GVN.
Our benchmarks show amazing results... but not in speed. WebAssembly for us is about the size of the binary and the speed of parsing more than the speed of execution... A Javascript JIT with enough execution data can be even better than C in theory.
This is called profile-guided optimization.[0] Slightly off-topic, but even Rust has it, courtesy of LLVM.[1]
It doesn't seem very popular unfortunately. Most web/tech companies are obsessive about capturing user interaction data for UX purposes, but it would be nice if that data could be fed back into a profile-guided optimizer.
In theory the compiled WebAssembly output would always be optimized for the latest usage trends, provided there were frequent builds. Depending on the product, it may even be possible to classify users into various high-level categories based on their behavior, and serve them a binary that's optimized for their specific use case. The latter's level of specificity probably wouldn't be worth it though.
If you have dynamic data the best optimizations could change, and you can't feed the profile back to the compliler while the program is running. (Well... maybe in theory)
The JVM has been capable of this for a while. I think the overhead of profiling and applying optimizations has always been greater than any efficiency gains.
'the overhead of profiling and applying optimizations has always been greater than any efficiency gains.'
Not necessarily, especially if you explicitly rely on them.
For example in C++ you deliberately have to avoid overusing virtual functions if you don't want the overhead. In Java, virtual functions are the default, and you usually don't care: if you call them a lot (e.g. in a tight loop) JVM will adaptively give you the proper implementation - without per call indirection - or may even inline it.
If you code C++ in Java style then JVM will win - so the overhead of (runtime) profiling and applying optimizations not always greater
G++ is able to optimize virtual calls to non-virtual calls in most cases (speculative devirtualization) by adding quick checks into the code (like if(obj->vtable == Class::vtable) Class:virtualMethod(obj);) which the JVM would do as well.
So if G++ is able to tell the code paths leading to the virtual call, it does the same optimization as the JVM.
I think that's almost certainly false. JIT'd java is almost certainly faster than bytecode interpreted java.
edit: or do you mean that the overhead of applying optimizations and profiling outweigh the performance gains relative to compiling from something like c directly to machine code?
I mean the collection and application of the optimizations take up more CPU than the optimizations remove. This isn't JIT but real time optimizations (https://en.wikipedia.org/wiki/Adaptive_optimization) while the system is running.
It depends a lot on things like how long the program runs for, and whether you consider spare cores on the machine to be "taking up CPU" or not.
Profile-guided optimisations win you about 20% in most industrial Java, apparently. So if you have a server that starts, warms up in say 30 seconds, and then runs for a week, you can see that it is easily profitable. For a command line app that lives for 500 msec, not so much.
I've always wondered what C running on a JIT would be like. As far as I'm aware all the modern JIT languages are quite high level and do a lot more than C to begin with, so it would be interesting to see if you could eek out even more speed from C using a JIT.
A JIT faster than C? I thought in theory it would always have to be slower because you have the overhead of the JITing. Also, isn't compiling C to native, the best you can hope for anyway? What could be faster than that, except for asm?
I've written all my Advent of Code 2016 puzzles in Rust that I've then compiled to asm.js and WebAssembly and both asm.js and WebAssembly always outperformed the idiomatic JavaScript solutions by a factor of 10x and sometimes even up to 40x. Firefox has the best asm.js and wasm support, with Edge being the second fastest browser and Chrome being the slowest (I only tested those 3). WebAssembly is always faster by about 10% in all the browsers than asm.js. However asm.js supports vector instructions and wasm doesn't. So when your inner loops are vectorized your asm.js even runs as fast in Firefox as the same code compiled as native code.
Do you have a link to your puzzles or perhaps a blog entry with a more full write up of your findings? There are not too many examples like this in the wild yet and another data point would be interesting.
I also hoped to see a bigger improvement, but it turned out that Javascript is already very fast for nbody.
Maybe I should have picked a well known numeric benchmark where Javascript is still far behind - any suggestions for that? Or are the Javascript VMs already too good for numeric benchmarks?
You can try increasing `N` in the nbody problem, and also measuring the memory overhead.
You could also try one of the hashing / crypto algorithms in JS. They should involve a lot of integer arithmetic that should make WebAssembly stand out.
More tips:
* for performance measurement, setup the benchmark so that JS run takes atleast 30 seconds. (Increase N, etc)
* close all other applications and tabs
* If you have linux, set the CPU governor to performance
* Measure the CPU temperature and make sure you let the CPU cool down between runs. In modern CPUs, the cores get throttled automatically when they reach a certain temperature.
Probably yes ;) Javascript is already compiled down to machine code for quite a while. And if the code is very 'static' (like asm.js) the JS engine will leave the code alone, and the garbage collector will be inactive. Performance between asm.js and WebAssembly is pretty much the identical at the moment, with 10% to 20% disadvantage vs native there isn't that much room for drastic improvement. I think WebAssembly will mostly bring improvements for 64-bit integer operations and vector math, but for average integer and floating point code, JS is already very fast.
I think the biggest advantage of WASM will be predictable performance. E.g., no sudden garbage collector pauses, and more predictable across different browser brands (eventually).
Well remember, WebAssembly just a different representation format for the same instruction set, executed by the underlying virtual machine (v8, spidermonkey, etc). So you should expect to see much of a performance difference.
The main benefit of WebAssembly is so that we can write code using our favorite languages (not just javascript!) and compile to a common binary format the browser understands. The objectives of WebAssembly never had anything to do with performance.
> The main benefit of WebAssembly is so that we can write code using our favorite languages
JavaScript is already common format that browser understands and there are implementations for a lot of languages translating them to JavaScript. So WebAssembly definitely is about performance, because using other languages is already solved problem.
As far as I understand you can call JS APIs from WebAssembly, which means you use the already available DOM APIs. I don't know exactly what the calling convention from WASM to JS looks like, but I guess it's specified somewhere.
As a comparison, I wrote an identical program in both C++ and JavaScript. It was doing repeated calls to a kd-tree, and was CPU-limited in both cases. The JavaScript version was slower, but only by a factor of 5-7 or so. I was rather surprised, because I was expecting a factor of 100-500, as I get with C++/CPython.
I don't expect WebAssembly to improve much on the speed, because it is already pretty fast.
Javascript engines are fairly speedy compared to CPython; not only are they industrial-grade JITs (as opposed to CPython being a naive interpreter), but JS is an easier language to optimize. Further, WebAssembly has overhead.
The CPython interpreter has actually seen a lot of performance work, they've long been pushing the boundaries of interpreter optimizations that you can do portably & compatibly. I wouldn't describe it as a "naive interpreter".
The promise that WASM holds IMO is for portability, not (merely) performance. How cool is it that I could write a game or application or library in C/C++/Rust/golang/swift/etc and expect it to be able to run on any modern CPU but also a browser?
I wouldn't say that's "cool", that's just what browsers could do 20 years ago with Java applets and ActiveX. Lots of languages target the JVM and you can compile C to it as well if you're willing to treat memory as a large byte array (there is a project that runs JIT-compiles LLVM bytecode on the JVM even).
The article shows that WebAssembly implementations often struggle to match Java. I see no reason to believe that browsers are actually easier to secure than a JVM, especially Firefox which doesn't even really do renderer sandboxing. They seem about the same level of difficulty. All the browser guys have done here is reinvent the same concepts of 20 years ago with a different instruction set that has "Web" in the name.
Worth remembering this is an early version of WebAssembly that is semantically similar to ASM.JS. Further performance improvements are very likely, can see some of the planned improvements here:
JavaScript is not the slow beast it used to be; when Google built V8 when they introduced Chrome, javascript got seriously fast. A lot of improvements have been made since then.
Just wanted to say thanks for being so thorough in describing the compilation flags, versions, etc.
It's really refreshing :)
You also just may want to s/gcc/clang/, none of the gcc compile commands you are running are using gcc, they are all clang/llvm.
Apple just aliased the gcc binary to clang on macs.
I'm surprised Chrome doesn't manage to be the fasted in any of these tests, considering the focus Google has put on performance and the resources I thought they were devoting to Chrome.
It appears as if they'd come out ahead on average, but not by a definitive margin. Especially surprised by FF which I thought had somewhat dropped the ball (and/or simply been overrun once Google's business objectives aligned with the users' interests).
He does talk about the laptop he uses at the end of the post.
>All tests were performed on a 2015 MacBook Pro, 2.5 GHz Intel Core i7, 16 GB 1600 MHz DDR3. For all tests the best of three runs was selected for the result.
I didn't. I believe that is's better to run a few times and take the best run. I payed for the turbo mode CPU and I'd like to know the performance on my machine. There's a lot going on my machine, and even more when a browser is running. The only thing (besides running on a clean machine) one can do about it is to measure multiple runs. Three runs are maybe a bit to little for scientific results, but I considered them good enough for a casual benchmark.
I really want to like web assembly, but every time I read about it I SMH. We already have a bunch of great runtimes, compilers, and opcodes. I hate to see the effort duplicated yet again. Or maybe I'm missing something obvious?
> We already have a bunch of great runtimes, compilers, and opcodes.
I'm confused. Are you suggesting we, for example, use the JVM for this instead? WASM is being designed specifically with the web in mind, and has a lot of design goals which just weren't a concern for other runtimes. See: http://webassembly.org/docs/high-level-goals/
And Java was designed with the web in mind too. You could access the DOM from applets back in the day, it's not fundamentally so different.
I feel like "designed for the web" doesn't really mean much when you zoom out. When I look at that bytecode set, it is less strongly typed and has more opcodes than JVM bytecode, but otherwise is rather comparable. It isn't based on any actually existing hardware instruction set so it must be interpreted or JIT compiled. It has explicit opcodes for things like popcnt and clz instead of relying on compiler intrinsics, but that's not anything that matters when you need a translation layer anyway. It has bytecodes that do static and virtual method dispatch, just like JVM bytecode.
The main difference seems to be that it doesn't provide GC or locks or other things that higher level languages need.
As far as I can tell, the browser community insists on constantly reinventing the wheel because of a kind of NIH syndrome often justified by security, but how robust is that justification? You can't have regular sockets, you need "WebSockets", although the web has permissioned APIs in many other areas. You can't use any existing VM and bytecode set like the JVM because they're "insecure", although both Firefox and Chrome are constantly patching sandbox escapes too: it just comes with the space. However they're "web" and thus excused, Microsoft/Java aren't "web" and thus guilty. Ditto for Flash. You can't have threads like C#/Java have because Javascript engines happen not to be thread safe, instead you get "Web Workers" which are really just the Visual Basic 6 concurrency model reborn without the benefit of DCOM to assist with the messaging, but now it's "Web" it's hip and cool.
Maybe I'm just the grumpy old man yelling at the clouds, but browser tech is stuck in a hamster wheel where they insist on reinventing things that were already working fine elsewhere in the industry.
Well not exactly, but I feel like the JVM, LLVM, parrot, CLR, or any one of those that have well defined opcodes could be leveraged instead of producing something entirely new. There's quite a bit of investment in those projects... Does that make more sense what I was trying to say?
I am unsure how JVM or CLR are relevant. WebAssembly is not a virtual machine byte code (and neither is LLVM, despite the name). As the name "WebAssembly" suggests, it is like an assembly language level target for "the web" (more precisely, for JavaScript interpreters found in web browsers).
It is true that there is a lot of unfortunate duplication of effort and fracturing of technology happening here, due to historical legacy issues with web browsers. JVM and CLR are indeed highly mature and presumably secure VMs today. If we were to wave a magic wand and rewrite the history of the web browsers, perhaps it would be much better if JavaScript never existed and browsers used an off-the-shelf, standardized VM like JVM or CLR and we wrote web apps in C# or Java (or, at our discretion, F#, etc). This wouldn't solve the problem that WebAssembly is solving (i.e compiling native code, eg C++, for execution in the browser), but then maybe we wouldn't have needed that today if everyone (native and web) had standardized on a single VM target. But that's just not how things turned out, browsers created their own parallel technology stack from scratch so things like WebAssembly need to be created from scratch to work within the constraints that we are stuck with.
I wrote more on this above, so I won't repeat myself in this comment, but yes WebAssembly is a "virtual machine byte code". It is literally a bytecode language that doesn't target physical machines. It bears no resemblence to x86 or ARM so it has to be JIT compiled or interpreted.
Saying it doesn't target a VM because it targets "the web" is meaningless.
Fair enough. You are correct, WebAssembly is a bytecode for the underlying machinery of current ECMAScript/JavaScript engines. I think you raise a very good question, why did the WebAssembly committee decide to implement their own stack machine bytecode instead of re-using an existing bytecode like JVM or CLI (which, interestingly, is an ECMA standard)? It's troubling. The closest explanation I could find is:
"Why not a fully-general stack machine?
The WebAssembly stack machine is restricted to structured control flow and structured use of the stack. This greatly simplifies one-pass verification, avoiding a fixpoint computation like that of other stack machines such as the Java Virtual Machine (prior to stack maps). This also simplifies compilation and manipulation of WebAssembly code by other tools. Further generalization of the WebAssembly stack machine is planned post-MVP, such as the addition of multiple return values from control flow constructs and function calls." [1]
if you compare manually optimized JavaScript with manually optimized webassembly ( which was not available to the OP?), could you expect webassembly to be consistently faster than java on all browsers?
This 'core emulator loop' is pretty much 100% integer code, with a lot of 8- and 16-bit integer bit operations (shifting, masking, ...). No calls into the browser or operating system, and almost no calls into the CRT or C++ stdlib (may be a small memcpy here and there).
The performance differences for 'pure' C code between browser- and native-version are so small now that they are no longer relevant, at least for the use cases I encountered, or rather within the normal performance differences when running the code on a slightly different CPU.
Calling out into HTML5 APIs is a whole different topic though ;)