The point on the unicode to_lowercase is quite interesting. I wonder if the javascript version is faster because it's less safe or because the JIT compiler identifies properly that unicode-awareness is useless and specializes the code.
The author has a follow up article[1] on their blog about it. They rewrite the rust code to eliminate the extra allocations they identified at the end of this post. It then performs 8 times faster than the javascript version.
However, I think they could have simply iterated on the code from the first post:
There are a lot of allocations because two strings are created every time they are pushed to the queue (.to_string()):
This is unnecessary, because candidate and token are both a single character.
Rust has a char type that is useful to store single characters. Especially, unlike String which is heap-allocated, char is a primitive value that resides on the stack and CPU registers. Since we are comparing single characters, it's much better to use char.
So if the queue is defined as a Vec<char> instead of a Vec<String>, and the comparison function took as argument char rather than &String, it will spare a lot of allocations and this will speed up the final application.
This has also the advantage of removing the references, making the code cleaner. Using Vec::with_capacity(n) instead of Vec::new() will also help lowering the allocation load by preemptively allocating a large enough memory slice. With Vec::new(), the vector will reallocate a few times.
IMO this is no small part of why C still has its reputation for speed. Think how you'd write this in C. Maybe
/*
Recursively removes doublets from input. A doublet is a pair of adjacent chars,
identical or differing only in capitalization (eg. "aa" or "aA"). The result is
written to output, which must be at least input_length bytes long.
Returns the number of bytes in the result.
*/
size_t remove_doublets(char * output, const char * input, size_t input_length);
and the body is a dozen lines of pointer pushing and tolower. Do you faff around with malloc? unicode case conversions? hashmaps??? Of course not, that would be too hard, so you just write the stupid C code. For all its faults, the C idiom really does push you towards the simple, stupid answer to code like this.
That's also a problem with C though. Because the complex answer is such a pain to write, people go for the simple answer that doesn't handle all the cases that it actually should. Like ignoring Unicode case conversions means that it will only work as expected on English input.
Really, no matter what the language is, people need to actually put some thought into what they're doing.
Both good points. C gives you rope to hang yourself with when it comes to correctness; Rust gives you rope to hang yourself with when it comes to performance. Personally I prefer the latter, but awareness and understanding of what's really going on is important no matter what language you're using.
The other thing is say algorithm selection. Some data structures may be more performant for the job at hand, but do you really want to package or build them yourself? A lot of the time the answer is no.
> I wonder if the javascript version is faster because it's less safe or because the JIT compiler identifies properly that unicode-awareness is useless and specializes the code.
Yes I read the Rust code and just knew it would likely not be dramatically faster than the JS - certainly not as fast as it could be. This is why all languages reward study and depth of knowledge, as you just avoid some of the patterns which kill your performance almost instinctively after a while.
The whole "I wrote it in JS and then ported it" introduction meant I was expecting exactly that kind of problem.
IMO Rust's main performance advantages come from avoiding allocations. Not only does it give you more control over them like C and C++ do, it also lets you do things that you could technically do in those languages, but wouldn't, because they're too dangerous. So you avoid even more allocations in practice (you don't need to defensively clone strings, for example)
> char is a primitive value that resides on the stack and CPU registers
This is inaccurate. Vec<char> will store chars on the heap. & the concept of primitive values exists in Java, whereas in Rust any struct you declare can exist on the stack (or in CPU registers if the optimizer decides to)
But yes, char is a 32 bit integer as opposed to String being a Vec<u8> under the wraps, & Vec<char> is preferable to Vec<Vec<u8>>
Node.js incurs a fixed initialization time of 0.09s on my machine, which in a real-world scenario has already occurred, and isn't a factor in runtime performance. That makes Rust only 5 times faster.
If the author had provided the input data, I would optimize the JavaScript further so that Rust and JS would be more even.
They also call to_lowercase() too many times. They should start out by creating an char array with all letters lowercased and a boolean array to tell if the char originally was lowercased.
This is why no language is "just faster" than another. The languages that are commonly associated with being "fast" can be way slower than any "slow" language if written wrong.
Yes Rust, C, C++ can be fast. Really fast. Impossible-to-beat fast - if you know what you're doing. If you don't, it's going to be worse than any trivial python, lua or js that does the same thing.
Yeah well it's pretty obvious that you can write slow code in any language.
The interesting question is whether the code produced by the average programmer of that language is fast or not. And there language choice matters quite a bit, both because of language design (i.e. how easy is it to generate good machine code for that language) and because of the culture associated with that language (e.g. you can write extremely efficient Java, but typical Java programs chase a lot of pointers and generate a lot of garbage).
>whether the code produced by the average programmer of that language is fast or not
Not just this, but the ceiling of performance improvements matter a lot too.
Rust allows you to write high-level code, where you can "easily" jump into Rust coming from JavaScript. You performance might be on par in your initial versions while you are getting into the language.
It starts to matter a lot when you now have a working application, and you start using it. With JavaScript you will quickly reach a point where your idiomatic JavaScript is about as fast as it'll go.
With Rust, there usually keeps being tricks you can pull to squeeze more performance out of it, and you can do it just for the hot-paths of your code instead of rewriting the entire thing in a new language because you reached your limit.
A good example of this limit is something like the TypeScript compiler. It has huge performance issues with larger codebases, but unfortunately there's not much that can be done without rewriting the entire damn thing in e.g. Rust, because you've reached the limits of JavaScript.
>Not just this, but the ceiling of performance improvements matter a lot too.
But the floor can matter too. For someone trained to write high-performance JavaScript, JavaScript's performance floor is higher than Rust's, and they can write more JavaScript faster.
> A good example of this limit is something like the TypeScript compiler. It has huge performance issues with larger codebases, but unfortunately there's not much that can be done without rewriting the entire damn thing in e.g. Rust, because you've reached the limits of JavaScript.
Actually, you can fix it by rewriting it as a single-pass compiler to improve the algorithms.
>But the floor can matter too. For someone trained to write high-performance JavaScript, JavaScript's performance floor is higher than Rust's, and they can write more JavaScript faster.
I'm not sure what you mean by the floor here? Experience? The author of the article was experienced in JavaScript and was new to Rust for example.
>Actually, you can fix it by rewriting it as a single-pass compiler to improve the algorithms.
I'd say that's on the same scale of work as rewriting it in a new language.
And even then, this is most likely not a realistic solution, I can almost guarantee you there will be stuff they are currently doing that they cannot do in such an architecture.
The first problem with the words "single-pass" and "Javascript" that come to mind are the out-of-order variable declarations, i.e. var, which bubbles up.
It's a really interesting and important question though; Do you measure a languages "performance"
1. by how fast it runs when written by a very experienced developer, or
2. by how fast programs run that are written by the "average" programmer, or
3. by how fast it runs when written by a bad or inexprienced programmer?
For each of these I can think of (at least) one language that would beat the performance of any other. There are languages that are "very difficult to use wrong", and there are languages that are difficult to use right. For the latter, C++ is a great example. You will copy and leak yourself out of memory before you terminate, if you dont know what you're doing.
Maybe we should stop comparing random languages from different domains by how fast they run.
My first take at an answer is, what sort of code do I anticipate writing in each language? And what sort of code do I expect to see from my peers?
Rust makes it surprisingly easy to write code that absolutely thrashes the memory allocator - using String/Box/Vec and adding .clone() everywhere makes a lot of tricky borrowing problems go away. Is that the kind of code I expect to be writing? Or running? And if so, we shouldn’t expect that kind of code to perform much better in most cases than javascript.
You're absolutely right, these are the kinds of questions any team should be asking
* Is language X that we all know suitable?"
* Is there a language Y that we could all learn in an acceptable time optimal?
* What does the tooling look like?
* How would we bring in new people?
C, which is usually the fastest language in all benchmarks and the one others compare to is not inherently fast. In fact it is often considered faster than C++, which makes no sense since you can write C code in C++ (mostly) and the compiler usually has the same internals.
The biggest reason, I think, is that C encourages you to be memory efficient. Allocating memory is not trivial and you have to keep track of all your pointers. On the other hand, it makes it easy to modify things in place, where other languages favor immutability. But if you do a lot of allocations, create a makeshift garbage collector, and memcpy a lot, then your C code will be really slow, because you are basically doing what makes other languages slower, but worse because other languages are heavily optimized for that use cases and your code most likely isn't.
An illustration: the C++ solution takes a std::string by value, incurring an unintentional copy of the buffer. C++ makes it easy to make this mistake. C does not. But for the same reason C++ also makes it easy to use complicated objects and allocations when you need them for performance.
Wow, that's some horrible code. No std::vector::reserve, string by value, memset, "using namespace std", missing includes, c-header includes in c style, ...
Rarely see this kind of obscenity outside of beginner's university C++ classes.
For anyone who isn't well versed in C, int is most of the time 32 bits but it's architecture dependent with some machines being e.g. 16 bit. If you need the size to be exactly 32 bits then you can just use int32_t or similar and let the compiler make that happen regardless of the architecture it's compiling to. The original code should have specified 200 * sizeof(int) or better yet since it's a local variable just ditch the memset and change it to int hasz[200] = {}; which will do the same.
People keep saying C is fast. It’s not. You can’t write a reusable fast sort in C. C++’s std::sort will trounce qsort any day of the week because qsort uses function pointers that can’t be inlined. Without inlined generic algorithms, how can people claim C is performant?
Memory allocation is expensive. C++ pmr allocators provide powerful tools to manage memory.
It is not the case that Rust will tend to end up slower than javascript or python if you don’t know what you are doing. Yes that happened in this particular example but it isn’t a given or even normal
I think there are a few different definitions of "faster" that can all be useful in different situations:
1. faster when written and optimized by an expert, who writes the best blog on how to optimize X
2. faster when written and optimized by a competent-but-not-necessarily-expert programmer, who reads the best blog on how to optimize X
3. faster when written without speed in mind
It seems like C, C++, and Rust are all more or less tied at the head of the pack for #1 and #2. There's obviously a lot of room for debate here, but it seems like many other languages (Go? JavaScript running on V8?) can get pretty close for #1 but struggle for #2.
For #3, it seems like C is the king, because of the very low-level style that it forces on the programmer. Below that things get murky. I'm inclined to believe Rust does well here, because it doesn't usually allocate without being asked to, but I assume the same logic applies to other languages I'm less familiar with (D, Zig). How well C++ does probably depends on how you interpret the question. (Like does the programmer make even the slightest effort to avoid implicitly copying big data structures, to take things by reference, etc?)
I learned this back in uni, trying to optimize things with a friend who was an assembly language whiz. He could take one of my C(++) algorithms and hand code it to be 10x faster than I ever could... but sometimes I could come up with a new algorithm that was 100x faster. There's a reason we say algorithm optimisation >> cycle counting.
What I found is that the more you help the C++ compiler understand what you want the more optimized the code gets. Using std::fill_n instead of memset alone can cause good improvements, depending on the compiler.
PD: Just check the Rust reddit. Tons of case of naive newcomers with massive improvements on speed/performance just porting to Rust.
---
This kind of history is similar to "A virus happened on OS X/Linux!" Then say "No OS is more secure than others!" is a massive flaw. Some are more secure. Only this time is news precisely because is RARE!
If you have to work with C++, I recommend using -fsanitize=undefined so you get runtime-errors when your program runs into undefined behavior. Very handy!
The main flaw in this benchmark is that it's too little work, so the javascript numbers are very close to node's startup time alone.
Running 100 iterations gives you a much better idea of actual performance (JS is over 10x slower) but then you need to think about what compiler optimizations are being applied.
The HotSpot JVM, V8, LuaJIT and probably a few other such projects have become wickedly fast over the years, and using native languages because they're "faster" is folly.
What Rust, C, C++ and the like actually give you is control. It is up to you, as a developer, to leverage that control to get better performance. Control also gives you nearly limitless rope to hang yourself with.
Is this the developer version of motivated reasoning? Throw up two solutions, then spend hours optimising the one you assume should be faster and don't revisit the other one...
One issue I see is that the Rust code seems to use String objects for
single characters instead of e.g. a chars iterator or a single
String that you then slice up.
With a couple changes you could probably eliminate all
allocations after the initial one.
People seem to love coming into the Rust community and then saying - hey, I did this micro benchmark after just learning rust - and XYZ language is faster!
Anyone who's been writing rust for a little bit will notice the String clone like a sore thumb. The "go-to-perf-guide" also very explicitly states than clones on String/Vec/Box etc will cause an allocation and slow down your code.
In every one of these micro benchmarks, the comments will then show that by removing the alloc, the Rust version ends up much much faster.
String cloning is one of those things you learn to avoid early on, and any time you type "clone" it should be obvious that there's a better way to write your algorithm.
I think the following minor optimizations can further decrease the runtime:
* There is no use in first pop()ing a candidate from the "reacts" list and then potentially re-pushing them later. Getting a pointer to the last element, and removing it if it is not needed, is more efficient.
* Reordering the character equality condition could help a tiny bit since the inequality check should be faster than the ignore case check.
Good point. Here it is premature optimization anyway and too dependent on the input. In general, I think a balance between likelihood and cost should be found.
Out of curiosity, I just did a small benchmark (criterion) using the input.txt provided, and reordering results in a 10% difference:
Old Order time:[307.28 us 309.12 us 311.17 us]
Reversed time:[270.82 us 271.18 us 271.54 us]
This feels weird since the input.txt contains few pairs of equal characters, so there is likely something else going on. The inequality is very inexpensive but the ignore case compare should not be that expensive either.
Weirdly enough, PGO equalizes and increases the runtime, where both methods take approximately 330 us. (I simply applied PGO to the full benchmark harness, no idea if this is proper). Them being equal sounds more reasonable, but the increase in runtime doesn't feel right.
This really doesn't tell you much at all, because:
- Only one program is tested, exercising a very small portion of each language (and its built-in APIs).
- The runtime is too small; less than a second in both cases. I'd really need to see examples where the benchmark has been run in a loop multiple times, preferably with a warmup run beforehand, and an execution time measured in tens of seconds.
At least the author made the effort to figure out the reasons behind the differences (based on the hypothesis that Rust would be faster), so credit is due there.
This might be a silly observation, but I do notice in the perf section the author is using "cargo build" without the release flag. This will definitely slow Rust down considerably, although I'm sure the author is aware of this and ran the final benchmarks with Rust in release. Would just be good to note in the post.
I remember when I first discovered this. I was implementing a raytracer and test runs were taking forever. Then I read about enabling compiler optimizations, and with the flip of a switch my performance increased 16x...
While I totally agree that Rust is a great language the point I got from this article is more about how fast NodeJS has become over the years, the work done on V8 is astounding!
The performance of JavaScript VMs is without a doubt in my mind one of our most incredible technological achievements. The JVM may have more hours of work spent (largely due to academic attention) but it seems that JS and Java have roughly the same performance which is outstanding when you consider how much more dynamic JS is.
Just as another data point, trufflejs (that is GraalVM’s implementation of JavaScript on top of a JVM written in Java running on top a traditional JVM) has comparable peak performance to V8.
After reading this blog post, I had a question: why doesn't it take less than millisecond in Rust code?
And the answer is that the author allocates a string for each character, and does more allocation by calling to_uppercase/to_lowercase.
There's a hundredfold improvement in speed by using char or u8 to iterate over a string.
I can't help but feel the way people are supposed to optimize code written in languages like Rust doesn't scale too well. It relies heavily on exploratory work and the inner workings of the compiler and target platform. It can grow very slow very quickly with a large codebase.
This in itself is not the problem. The problem is that you shouldn't be doing this at code level, but at compiler level. The optimizer is supposed to do these manipulations. At code level, you should be able to describe your problem in lots of details, use identities to simplify your logic, but leave the optimizations to the compiler. From the little I read about rust, I believe its type system is powerful enough to enable the compiler to do this. I believe it is a matter of time until the compiler matures to the point of doing the kind of optimization it is required to do.
It's possible to design a standard library which makes it obvious to the user when a function allocates memory under the hood, for instance by taking an allocator argument. That way it becames very clear what "simple" functions are potentially expensive, and writing potentially slow code should be much less convenient than writing potentially fast code so that users are pushed to writing fast code, same as Rust pushes them to write memory-safe code.
In Rust to_lowercase returns a String, which is an owned struct with heap allocated data, so it is reasonably clear that it allocates. In contrast, trim_start for example returns a reference to a string slice with the same lifetime as the source string, so again it is reasonably obvious from the type signature what's going on.
The mistake here really is trying to do case insensitive string comparison by converting both arguments to the same case, then comparing them with case-sensitivity. It is much better to compare them directly, but case-insensitively.
Zig has a very simple and straightforward solution: Every standard library function which needs to allocate takes an allocator argument. That way it's immediately obvious what functions are potentially expensive. It's not even a magic compiler or language feature, just a standard library convention.
That is beside the point. I could produce something in Haskell, but the "magic" would all done in the compiler, so in the context of my original comment I see no reason why I should.
> At code level, you should be able to describe your problem in lots of details, use identities to simplify your logic, but leave the optimizations to the compiler.
1. As far as I'm aware, Haskell isn't able to do any of that at the code level. I'm interested in what a language that is able to do that would look like.
2. I'm curious as to what you consider an "optimization" to be left to the compiler, and what you consider part of the "problem". For example, does the compiler decide which implementation of a set to use?
3. I'm also wondering what you mean by "proving correctness." If you're alluding to formal verification, then that is notoriously hard to scale, which conflicts with the stated goal of making optimization easier to scale.
Some pseudocode would shed light on all these questions.
I did some of the Advent of Code 2020 challenges in both js and Rust. For one of them, the js solution took minutes to finish, while the same algorithm in rust finished in <200 ms. I could probably have optimized the js solution to remove a bunch of allocations- it must have been the gc-but it was really surprising with the speed-up.
Apart from the UNICODE to_lowercase(), what's worrying is the high memory allocation cost, even though it's not very obvious in the source code where those allocations are exactly happening (even if you "know" that String and Vector are allocating under the hood it's not obvious when and where exactly memory is allocated).
This is basically the same problem as in C++: convenient and "idiomatic" standard library features may end up doing tons of expensive hidden allocations. Isn't this a pretty big problem for a "systems programming language"?
Anyway as soon as I saw the problem I thought "this is either completely trivial or insanely difficult depending on whether you want proper unicode handling or not".
I haven't really thought about it but I doubt his solution is fully correct according to unicode. E.g. What happens with `aßSSb`? Who knows what should even happen in that case.
As for the malloc performance, MacOS has a significantly slower system memory allocator than Linux does. If you use jemalloc (https://crates.io/crates/jemallocator) instead I think you will see better performance than Javascript.
Replacing the function call with this (and flipping the checks around) will probably improve performance as well.
I'd also be interesting in the performance improvements from replacing those to_lowercase() calls with a lookup from a static lookup array. Something like:
if ( explosion_map[token1] == token2 )
There would need to be duplicates, explosion_map['a'] = 'A' and explosion_map['A'] = 'a'. But that's a small cost to pay for eliminating three function calls and two comparisons. As an added bonus, branch prediction will work really well on this strategy (since the majority of comparisons will have the same result). And doing that is a more of an actual language performance comparison, instead of a the standard library performance comparison this is.
It should be possible to have perf launch the program itself, avoiding the race condition or pgrep ambiguity, and having to reconfigure the kernel to allow debugging non-child processes:
This looks kind of a another flawed benchmark, because of the use of time, which includes the startup cost for the process. Rust is compiled to native code with an AOT compiler, which should have significantly lower overhead compared to the NodeJS Jit.
In a longer live context(more data crunching) Rust will win by even a greater margin, as performance improvements will compound. I know from benchmarking various Rust vs Go vs Node.js scenarios, and as the volume of data increases Node.js's performance gets worse.
Popping from the source is likely a big source of allocator / copy overhead. It’s probably better to just iterate over an input array.
The algorithm in the article would be quadratic in perl 5; not sure about rust or js, but the point still stands: It’s likely to see asymptotic performance differences in different languages.
(Edit: it’s not quadratic in perl. If it used shift to pull from the front of the array it would be.)
Popping doesn't touch the allocator, as it doesn't resize the backing storage. The only thing being copied is the String struct (24 bytes), which while unnecessarily large here, wouldn't be the bottleneck.
Possibly due to the lack of LTO, Link-Time Optimization. Rust standard library and user program get separately compiled and needs a special procedure to inline them out. I couldn't find Cargo.toml used in this benchmark but it ought to contain `lto = true` somewhere in the profile section to get the optimal performance especially for this kind of tight loops. (It makes compile slower, unfortunately, and no profile currently turns that on by default.)
So let ne be naive, is performance important? Would a user matter?
In my opinion what matters is how much energy is consumed when executed on a smartphone or IoT device. Because there the main question is, when do I have to recharge the next time. For IoT applications this is highly critical. In this regard fast is not automatically energy efficient, if you consider e.g. big.LITTLE cpu architectures...
If writing this in JS took an hour to deliver acceptable performance, and writing it in Rust took 6 hours of reading documentation and googling to produce a similar or faster by a few milliseconds result, then JS is the clear winner here.
More like: If you tune your code Rust (and any native language) is faster, with naive code however there is a good chance that JavaScript can be faster.
'After spending some weeks playing with Rust,'... he is new to Rust and keeps using String, accepting `&String` as args etc this is not tuning. This would be caught by linters like clippy telling you to double check it. I can show you naive Rust vs Naive Node.js and like his follow up article x10 performance over Node.js is not abnormal.
Yep. 8x is the right ballpark I usually see comparing decent C/Rust to javascript.
I saw a 10x slowdown porting the Chipmunk2d physics engine to javascript. (Which dropped to 5x or something after a few weeks of JS optimizations). Porting operational transform heuristics from JS to C got me a ~40x speedup. The code in question used heterogeneous data types which V8 hates (causing both heap thrashing and code deopt). In C I handed it with a tagged union and the code needed 0 allocations in the hot path.
Generally the rule is, if your code has complex data types, heterogeneous data types or a lot objects being created and destroyed it be much slower in javascript compared to C/Rust. A lot of that benefit goes away if the native code does sloppy allocations everywhere.
And on the other hand, if your code is Math heavy, operating on simple data types, V8 will do great work.
Sure but there’s an upper bound on how fast you can make javascript code run. The same amount of optimization work (if you know what you’re doing) will make a far greater difference to the performance of rust code than it will to the performance of javascript.
Carefully written javascript is still pretty slow because you can’t choose to put objects on the stack in JS, or allocate nested objects together in contiguous memory. And that makes a massive difference because you end up thrashing main memory when you could be spending those cycles getting work done.
I find it unsettling how one-sided the article is, if it's a "showdown" I expect the same effort to be put in JS side of things to maintain some pretense of fairness.
> that makes a massive difference because you end up thrashing main memory when you could be spending those cycles getting work done.
That is true, however there is a huge domain of problems where that isn't much of a concern, while development time spent on it is a concern.
Then sacrificing a bit of performance over development speed can be quite worthwhile.
Of course not a high frequency trading platform and not an AAA+ Game, but a small Webservice just taking data from a ("slow") database, doing little transformation, and sending over network .... less of a problem
The author has a follow up article[1] on their blog about it. They rewrite the rust code to eliminate the extra allocations they identified at the end of this post. It then performs 8 times faster than the javascript version.
However, I think they could have simply iterated on the code from the first post:
There are a lot of allocations because two strings are created every time they are pushed to the queue (.to_string()):
This is unnecessary, because candidate and token are both a single character.Rust has a char type that is useful to store single characters. Especially, unlike String which is heap-allocated, char is a primitive value that resides on the stack and CPU registers. Since we are comparing single characters, it's much better to use char.
So if the queue is defined as a Vec<char> instead of a Vec<String>, and the comparison function took as argument char rather than &String, it will spare a lot of allocations and this will speed up the final application.
This has also the advantage of removing the references, making the code cleaner. Using Vec::with_capacity(n) instead of Vec::new() will also help lowering the allocation load by preemptively allocating a large enough memory slice. With Vec::new(), the vector will reallocate a few times.[1] https://cesarvr.io/post/rust-vs-javascript/