Hacker News new | comments | show | ask | jobs | submit login
Announcing Rust 1.27 (rust-lang.org)
364 points by steveklabnik 4 months ago | hide | past | web | favorite | 79 comments

SIMD on stable is very exciting news. SIMD unlocks the power of the GPU-esque parallelism that is already inside your CPU. While compilers do try very hard to take advantage of this automatically, it’s not always predictable and can make performance fragile. This is what people are talking about when they refer to “low level control”. Don’t expect that with Rust 1.27 you’ll need to understand SIMD to get anything done. Do expect various libraries to just get faster on stable. For example, BurntSushi, the author of the regex crate and the ripgrep tool, has already enabled it if you’re on Rust 1.27 Stable. [1]

SIMD is another notch towards never needing to use nightly Rust. What this really looks like in practice is that less frequently will a user encounter a fast-mode and a slow-mode for a crate depending on whether they’re using nightly, with all the unstable features enabled, or not.

[1] https://github.com/rust-lang/regex/pull/490

How useful is SIMD on CPU these days given that most of the touted original applications (back in the MMX SSE days) have been moved over to GPUs?

Real-time audio processing is bound to CPUs forever. It's the opposite of the data that works with GPU computations : very small chunks of data (generally between 64 and 512 floats or doubles) that you have to process sequentially. For those, SIMD is particularly useful - I got a ~7* increase in throughput recently when porting an effect from naive processing to AVX2 intrinsics - which means that artists can then put much more instances of the effect in parallel.

Stuff like compression algorithms, some codecs (not all use GPU), some high-performance parsers, some encryption stuff (e.g. openssl), some databases (often column stores, also redis I think), language VMs etc use SIMD.

More generally SIMD is useful when you are repetitively performing the same instruction on a long stream of data but you don't want to send it over to the GPU because you don't want to incur the many order of magnitude slowdown of sending stuff back and forth to another chip or you can't rely on a GPU being there (e.g. embedded), or it's just overkill.

> [when] you can't rely on a GPU being there (e.g. embedded)

Also, most servers have only very basic GPUs (if at all). Unless you're on a dedicated GPU server for machine learning etc., you're going to work with the CPU only.

Very useful. SIMD has a much lower barrier to use (doesn't need graphics drivers, GPGPU frameworks etc., almost universally available and fallbacks are easily implemented) and is much easier to target (same language, same toolchain, same memory model).

Also notice that the execution model of GPUs and CPUs is quite different. You need a far larger "breadth" of execution to efficiently use a GPU, compared to a CPU.

A GPU is much wider than a single core, but only slightly wider than a server CPU. For example, a 28-core Xeon has dual-issue FMA with 6-cycle latency and 16-wide packed SP registers, thus reaches peak floating point performance with 5376 independent operations in-flight at any instant. It's only about 4x higher for V100, which has higher TDP.

Heh. You can barely move your mouse across the screen without some layer in the architecture executing some SIMD instruction. It's everywhere, and it's going nowhere.

Everything from counting the length of a string to rasterizing the text you're reading right now.

I wish font rasterization used SIMD more, but it frequently doesn't. The final blitting step usually does, but important things like the Gaussian blur used for subpixel AA color defringing are still not accelerated on Mac, for instance :(

Good news, macOS isn't doing subpixel AA any more as of macOS 10.14! ;)

But seriously, if SIMD would be useful there, why doesn't it use it?

Well, I'm not Apple, so I can only speculate. I wouldn't be surprised if core low-level font rasterization isn't that well maintained, though. Often that stuff was written in the '90s and early aughts and so hasn't been touched.

I thought macs gave up on subpixel aa after the advent of retina displays.

There's a big cost of moving stuff in and out of the gpu. Unless it's a big/heavy workload, the CPU will be faster because of this.

Well, it seems pretty useful - e.g. it helps make ripgrep fast, it can speed up regex. There's a lot it can make faster. How much faster, I'm not sure.

Still really useful. We use NEON on Android all the time since there's not a great API there and spinning up a GPU on a mobile device is not something you want to do unless you absolutely have to.

> We use NEON on Android all the time since there's not a great API there and spinning up a GPU on a mobile device is not something you want to do unless you absolutely have to.

True regarding Android's APIs being awful, but taken literally, your statement implies that even Core Animation on-GPU compositing from 2007 is bad, which I'm sure you didn't mean. :)

Yeah, this is all with in the context of compute, of course GPUs are going to be great at rasterizing(although they're rarely used for blitting/compositing different layers as there's discrete hardware that's even better for that).

I don't want to worry you, but on a mobile device with a screen the GPU is always spinning. That is where the pixels come from after all.

I get your point that there are energy efficiency tradeoffs to consider. However for massively data parallel tasks like image operations, 3d graphics, machine learning, vision etc, I haven't seen SIMD implementation come close to GPU compute in terms of efficiency. Perhaps short lived text pressing and file compression where data access is more random could benefit from SIMD.

Nope, I'm well aware. Built/worked on two android based 3D graphics stacks at two different companies and I've worked directly with just about every mobile GPU vendor aside from PowerVR(which is pretty similar to all the other tiled GPUs that are out there).

Your GPU is actually running much less than you'd expect. Unless pixels are changing there's a 95% chance that everything on the system is sleeping and just the display controller is being kept spun up. Battery requirements mean that DVCS on any mobile chipset is going to be really aggressive.

Either way GPU compute has a pretty large overhead both in terms of latency and scheduling. Earlier GPUs didn't schedule nice(hello waiting 16.6ms for your next compute request) and generally the type of places where you use SIMD(3D transforms, audio processing, etc) are so tightly coupled with other CPU operations that even the act of moving them to the SIMD registers is something you need to consider before diving into it. A lot of times waiting for some work queue to complete(or adding pipeline latency by waiting for next frame) just isn't feasible.

Pixels on screen come from the display controller, which on mobile is often a distinct piece of IP from the GPU, unlike desktop systems. Scenarios like watching a video full screen can often happen without GPU involvement at all.

The screen of a mobile device is off just as much (or more) as it is on. More and more processing is happening in the background, when user is not actively doing something.

Have you tried to use Renderscript as well?

No because RS abstracts away how the computation is queued, what latency is involved and has some pretty non-trival overhead in the framework its self.

Like I mentioned above, earlier GPUs are pretty poor at fine grained scheduling + latency and NEON has pretty ubiquitous support on the target devices we were shipping.

I recently ported some code to use AVX intrinsics. The performance gain is somewhere from 30x to 3000x depending on the size of the array. The larger the array the more performance gain I get. It’s linear algebra but not related to matrix multiplication that I could mention.

Also, if your code has a lot of branching (most of my work wouldn’t benefit from offloading to GPU), or if the data being processed in parallel at a time is too small to make up for memory transfer, it can be the right approach and provide a huge performance boost.

A friend of mine is playing with neural networks and training them to play reversi. He's working with a lot of matrices so he tried AVX extensions and CUDA. AVX runs circles around CUDA, probably because of the setup time of moving things back and forth to the GPU. Also, CUDA can be a big pain in the ass to get working.

In matemathics (and thus HPC/scientific computing/image processing/data science/scientific computing/etc) it is very common to multiply different matrices together which SIMD brings huge speedups for.

GPUs are sadly not generally usable yet.

Correct my possible misunderstanding but isn't this fabled "power" at best x4 performance gain and not really in the same league as GPU-esque?

Depending on your operations, e.g. With small buffers that fit in a cache line, transfers to GPU memory could cause GPU esque performance of less than 4x, even less than 1x whereas SIMD still provides benefits.

The new syntax for dynamically-dispatched traits, “dyn Trait”, is an example of Rust’s persistently excellent consideration of what should be explicit and what should be implicit. Python’s mantra of “explicit is better than Implicit” mostly captures my general feeling, but you can’t make everything explicit. Back before impl Trait existed, just Box<Trait> seemed very clear. Now that there’s an important thing to differentiate it from, it seems better to have dyn Trait and impl Trait both prefixed with keywords.

It also makes for much easier googling. ;)

Python is my goto language but let's not forget it disrespects it's own mantra very often. Like a certain pirate would say, "it's more what you call a guideline than actual rules".

Duck typing is implicit interface.

Foo().bar() is an implicit Foo.bar(Bar()) as self is only explicit on declaration, not call.

__init__ is called and passed parameters implicitly.

__new__ is an implicit class method.

Some time the magic is nice to have. Sometime it's like yield.

Yield implicitly turns your function into a very different object, which makes everybody goes wtf the few first times.

They fixed that with async / await, but I still wish we had a "gen def" to allow yield in bodies, like we have "async def".

> Rust’s trait object syntax is one that we ultimately regret.

Would someone please explain the problem (and the solution) for someone who doesn't know Rust yet?

The linked thread is good, but the short answer:

If Foo is a struct, this is a single pointer to the heap. If Foo is a trait, this is a double pointer: a pointer to the data on the heap, and a pointer to a vtable for its methods. These two things are very different, but look similar. That's the mistake.

This change separates the two. Now:

    Box<dyn Foo>
is always the latter, a "trait object". We have to keep the old syntax working for stability reasons, but can lint against it, so that it nudges you into the better syntax. So in this world,

  // always a single pointer to a struct (or enum)

  // always a double pointer
  Box<dyn Foo>
This is more clear for humans. The compiler can tell with 100% fidelity, of course. But humans aren't compilers. Different things with different costs look different.

As an addendum, the parent mentions another feature, "impl Trait", which can be used with traits. Their point was, originally, Box<Trait> was the only thing that existed like this. With "impl Trait", it was "impl Trait" vs "Trait", and is now "impl Trait" vs "dyn Trait", which is much more clear, at least in many people's opinions. :)

>We have to keep the old syntax working for stability reasons

Will Rust 2018 enable you to remove the old syntax? Or is that a bigger change than the opt-in is allowed to cover?

Rust's built-in lints are configurable with compiler directives, so you'll be able to add

To the top of your crate and cause a compiler warning if someone uses the old syntax on your project.

EDIT: just realized I mis-parsed "you" in parent. Oh well.

This kind of change is the kind an edition can make, though I’m not 100% sure if the plan for 2018 is to warn or error on the older syntax.

I feel like a warning for one edition and error on the next would make it straightforward as to how much time is left to update the code.

That’s generally true, the question is, do you warn in 2015 or 2018?

That being said, this particular fix is 100% automatable, and rustfix can already do it, so the time to update is “a few seconds”.

IIRC, it will take two editions before removing something: at the first edition the compiler will lint against it (which is already quite a big change: you get tons of new warnings you need to fix) and one edition later, it's an error.

Edit: I didn't see the answer from steveklabnik below. Apparently, it's not clearly decided yet.

Yes, the next edition will allow breaking changes, while still allowing you to use crates written in the older edition.

some forms of breaking changes. This is of that kind, though we cannot make arbitrary ones.

A trait is like an interface. A struct implementing a trait means it takes on the methods of that interface (but there's more to it, because a trait can have default implementations of of the trait methods).

If you want to express something like "this is a variable that holds something that fulfils this trait", without knowing the _actual_ type it is, that variable effectively has an unknown runtime size. std::io::Read is an interface for reading bytes of some source, like a file or a socket.

This matters because we're talking about a stack frame. So the size needs to be known at compile time.

    let a: u64 = 42; // ok, because well known size.

    let b: Read = ...; // illegal, because unknown size.
A "trait object" places the object on the heap and has a pointer in its place.

    let b: Box<Read> = ... // legal, because pointer is a known size
However it's a bit more complicated, because this syntax allows for dynamic dispatch at runtime using a vtable. So there's a quite big difference between Box<u64> (a 64 bit unsigned integer on the heap) vs Box<Read> (a runtime dispatched lookup via a vtable).

This difference is not obvious at a glance though. Hence the new syntax: Box<dyn Read>.

(I think I got that right)

You did, except that it’s not one pointer, it’s two. This is one way that rust is different than C++; the vtable isn’t stored with the data, but separately, with the “trait object” itself being two pointers to the two things.

>This is one way that rust is different than C++; the vtable isn’t stored with the data

Which “data” is the vtable stored with in C++? An object contains only a pointer to its vtable, not the vtable itself...

A C++ object with virtual methods and some members looks like this, if you put it in terms of C:

  struct ClassVTable {
   void (*firstMethod)();
   void (*secondMethod)();

  struct Class {
   struct ClassVTable *virt;
   int firstMember;
   int secondMember;
and a pointer to it looks like:

  struct Class *objectRef;
A language that uses fat pointers, like rust, has this separated completely:

  struct Class {
   int firstMember;
   int secondMember;

  struct FatPointerToClass {
   struct ClassVTable *virt;
   struct Class *data;

  FatPointerToClass objectRef; // note lack of pointer, it'd be stored on the stack directly for eg.
This means a much simpler object layout in exchange for passing around larger pointers, and it's a good fit for trait-based typing since it doesn't require you to know all possible interface subtypes of the object in order to describe or use its layout.

(please note that I'm using C struct to be precise about the concept, and this should not be taken as a perfectly verbatim description of the actual object layouts in memory)

Oh, I didn't know that: this is also great for serializing code where you might have an array of `Class`, it would still be a POD-type to use C++ language.

My understanding is that, generally (because (again in my understanding) this is all compiler-specific, not mandated by the standard) is that in C++, there's a single pointer that points to the combo of the vtable and then the data, after it. See http://www.drdobbs.com/cpp/storage-layout-of-polymorphic-obj... for example. While a Shape is a position, outline, and fill, if you have virtual functions, the pointer points to a vtable pointer, and then all of the data.

In other words, in Rust, a pointer to a shape is

  (pointer to vtable, pointer to data)
whereas, in C++, a pointer to a shape is

   (pointer to shape)
where shape is

   (pointer to vtable, data)
If that's incorrect, I'm quite happy to be corrected! This isn't an area I'm an expert in.

That’s absolutely correct.

I get it now, that you meant the “pointer to the vtable” under “vtable” in your original post.

Ah, I see. Yes, I should probably have been more clear there. Thanks for pointing that out.

Good to know!

The linked port says that it has proven to be a bad design decision specifically in the contect of "impl Read" etc. Can you give an example of that, and why it was confusing before?

Reading up an static vs dynamic dispatch may help - https://en.wikipedia.org/wiki/Dynamic_dispatch.

Essentially Box<Foo> is static dispatch. You know at compile time which exact methods you are going to call. Box<dyn Foo> is dynamic dispatch. Since Foo is a trait, at compile time you won't know which methods are called, it depends on the type of the object passed in (as long is it implements the Foo trait).

Your example is misleading. If Foo is a trait Box<Foo> and Box<dyn Foo> are the same thing. Impl Foo is for static dispatch.

Illustrating the need to dump the old unqualified syntax.

Ah ok, I'm back to being confused again then!

Is this right:

If Foo is a struct then Box<Foo> is static dispatch. If Foo is a trait Box<dyn Foo> is dynamic dispatch, but then so is Box<Foo> - but that is the regrettable point. So from now on, if Foo is a trait we should be writing Box<dyn Foo>, which is largely syntactic sugar to make the situation more clear?

I don't fully understand what the release notes are saying about using impl Trait. When would you use Box<impl Foo> vs Box<dyn Foo>?

You are right.

It’s not that you would use Box<impl Foo>, but when taking about the two features, it’s easier to compare them, as they’re actually distinct syntactically.


Yes!! I have been waiting for the new time helpers to arrive in stable. It may be a small thing but is so much nicer and makes Rust feel more like a higher level language when I can do `some_time.subsec_millis()` rather than `some_time.subsec_nanos() / 1_000_000`.

Heh, was just bemoaning this two days ago. Glad I can use them now.

I don't understand your comment. It feels like a higher level language because you can use a different unit for measuring time?

Having written enough of boilerplate to do millisecond timestamps, or including a quite big chrono library for a nicer interface for a few time operations, these additions are quite luxurious.

SIMD! For anyone curious about the performance impact of this feature and a real-world implementation, check out the PR adding support to the regex library: https://github.com/rust-lang/regex/pull/456

And just because this kind of thing is fun, if you use the right kind of pattern on a big enough file, SIMD can be quite noticeable:

    $ rg-with-simd --version
    ripgrep 0.8.1 (rev 223d7d9846)
    +SIMD +AVX
    $ rg-without-simd --version
    ripgrep 0.8.1
    -SIMD -AVX
    $ time cat OpenSubtitles2016.raw.en > /dev/null
    real    0m1.280s
    user    0m0.020s
    sys     0m1.257s
    $ time wc -l OpenSubtitles2016.raw.en
    336602465 OpenSubtitles2016.raw.en
    real    0m4.303s
    user    0m3.132s
    sys     0m1.167s
    $ time rg-with-simd -c 'Sherlock Holmes|John Watson|Professor Moriarty' OpenSubtitles2016.raw.en
    real    0m2.099s
    user    0m1.750s
    sys     0m0.347s
    $ time rg-without-simd -c 'Sherlock Holmes|John Watson|Professor Moriarty' OpenSubtitles2016.raw.en
    real    0m4.128s
    user    0m3.781s
    sys     0m0.343s
    $ time rg-with-simd -c 'Sherlock Holmes|John Watson|Irene Adler|Inspector Lestrade|Professor Moriarty' OpenSubtitles2016.raw.en
    real    0m1.989s
    user    0m1.621s
    sys     0m0.366s
    $ time rg-without-simd -c 'Sherlock Holmes|John Watson|Irene Adler|Inspector Lestrade|Professor Moriarty' OpenSubtitles2016.raw.en
    real    0m18.417s
    user    0m18.000s
    sys     0m0.403s

Looks like `cat` is still faster, so there's some room for improvement. ;-) With a single pattern, we're almost there:

    $ time rg -c 'Sherlock Holmes' OpenSubtitles2016.raw.en
    real    0m1.333s
    user    0m0.974s
    sys     0m0.357s
This one is mostly thanks to glibc's memchr implementation (which uses SIMD of course), and the regex crate's frequency based searcher.

Of course, I'm presenting best cases here. Plenty of inputs can make ripgrep run quite a bit more slowly than what's shown here!

The crazy thing is that we're still only barely scratching the surface. Check out Intel's Hyperscan project for some truly next level SIMD use in regex searching!

Uf da, that's still some good speedups.

BTW as a happy daily user of rg thanks for all the work you put into it, definitely shows.

Anyone happen to know when inline asm will hit stable?

There are a lot of inline-asm related bugs in the compiler. I'm hitting them almost daily (I'm writing a kernel, so a fair bit of asm is going on here). You can see them in the bug tracker[0].

Stabilizing inline asm in its current form would be a mistake. It's simply not ready.

[0]: https://github.com/rust-lang/rust/issues?q=is%3Aopen+is%3Ais...

"not ready, bugs." Is a perfectly good reason to hold it back. Thanks.

Not slated any time soon. There are major questions about the approach; the current implementation is "do whatever LLVM does" which is not in itself stable, and so cannot be our final implementation.

clang manages with "Do whatever gcc does" I wonder why that isn't good enough. Doing nothing useful now as we might want to do something more useful later than what we can now seems a bit self defeating, but I may be missing the point because I want to us inline assembly language in unsafe blocks.

The problem is, rust has stability guarantees to uphold. As such, depending on the unstable behavior of an upstream component is not possible in the stable branch of rust.

Furthermore, there are worries that depending on the llvm behavior might impede work on alternative rust backends, such as cretonne.

Clang isn't using LLVM's inline asm directly, it translates from the (stable) GCC format into the internal LLVM one. rustc could theoretically do the same translation, but implementing that is significantly more work than just stabilizing the current implementation of asm!().

That and GCC's format is not exactly great. I'm generally looking unfavorably towards NIH but in this instance I really hope they figure out something better.

GCC's format mostly exposes parts of GCC's internal machine description language (to the point that, in the past, most of it was documented only on the "GCC internals" part of the manual). It comes from a simpler time.

What exactly are you wanting to do with it? There's a few different things might want, and for many of them, there may be alternatives that work just as well, if not better (e.g. instrinsics for the operations like SIMD: https://github.com/rust-lang-nursery/embedded-wg/issues/63).

Any idea what SIMD in Rust via WebAssembly will look like?

Wasm doesn’t support SIMD yet, so no way to tell.

Thanks, wasn't able to easily discern how far along the browser SIMD wasm implementations were.

Do you see a future where wasm has its own std::arch module?

Yeah, SIMD is on the wasm roadmap; we’ll see what form that takes for exact details, but that’s what I’d expect.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact