
My Favorite Rust Function Signature - brundolf
https://www.brandonsmith.ninja/blog/favorite-rust-function
======
steveklabnik
You don't have to write the lifetimes here, this can just be

    
    
      fn tokenize(code: &str) -> impl Iterator<Item=&str> {
    

but including them to be more explicit does make sense for the post.

~~~
milliams
In that case, how does the caller know that the results will last as long as
the input? Does it assume `'static` or is there some cleverness?

~~~
pornel
Elision never assumes 'static, since that's an unusual case that has very few
real-world uses.

This leaves the only remaining possibility that references in the return type
must have the same lifetime as references in function's arguments. Without a
GC, or leaking memory, you can't make a Rust reference/lifetime out of thin
air, so there's always such relationship.

When a function takes multiple arguments by reference, the compiler will ask
to disambiguate which one is used for the return type. Borrow checker
intentionally checks borrowing against function prototypes, not function
bodies.

~~~
IshKebab
> Elision never assumes 'static

I'm pretty sure boxed trait objects (maybe all boxed objects?) are assumed to
be 'static. That is, Box<dyn MyTrait> is shorthand for Box<dyn
MyTrait<'static>). If you want non-static lifetimes you have to do something
like Box<dyn MyTrait<'_> \+ '_>

(I only just learnt this so take it with a pinch of salt.)

~~~
estebank
You might be interested to know that the + '_ is inferred from the lifetime of
the trait bound it comes from. By default if no lifetime is involved what will
be inferred is 'static (namely, an owned thing). This means that if you write
Box<dyn My trait<'_>> that's enough. If MyTrait _doesn 't_ have a lifetime
parameter then you might have to write Box<dyn MyTrait + '_> to "flip the
inference" from the default, but that only comes up in practice on return
types, particularly with impl Trait.

TL;DR: Trait<'a> is actually understood as Trait<'a> \+ 'a.

~~~
IshKebab
I don't think so. Try removing the `+ '_` from this code:

[https://play.rust-
lang.org/?version=stable&mode=debug&editio...](https://play.rust-
lang.org/?version=stable&mode=debug&edition=2018&gist=97692026a1d749b0e3fba760a2b1759a)

~~~
estebank
Oh, you're right. Clearly I'd forgotten the rule. `'a` is not inferred for
`dyn Trait<'a>`, it is inferred for `&'a dyn Trait`[1].

[1]: [https://github.com/rust-
lang/rfcs/blob/master/text/0599-defa...](https://github.com/rust-
lang/rfcs/blob/master/text/0599-default-object-bound.md)

~~~
IshKebab
Aha! Thanks for the link. Why isn't this in the official documentation
anywhere?

Also I'm going to point people here next time someone tries to claim that Rust
is simple.

~~~
estebank
If you look at the date that RFC is from before my time. My guess would be
that you don't _need_ to know the rule: 95% of the time you don't need to use
`+ '_` anywhere (these rules predate the ability to use `'_` to begin with,
which is why the RFC talks about `+ 'a` only) because the behavior matches the
user's expectation (not having to write `&'a Foo<'a> \+ 'a`), and, nowadays,
when on the 5% of cases you _do_ need 'a the compiler will tell you about it

    
    
      help: to declare that the trait object captures data from argument `self`, you can add an explicit `'_` lifetime bound
         |
      14 |     fn things(&self) -> Box<dyn Things<'_> + '_> {
         |                                            ^^^^
    

Edit: having said that, the docs _do_ explain these rules[1], reachable from
the trait object documentation[2], while it seems that The Book doesn't.

I'm also noticing that my confusion was likely due to `&'r Ref<'q, Trait>`
being interpreted as `&'r Ref<'q, Trait+'q>`, which is behavior that was
introduced in a follow up RFC[3].

[1]: [https://doc.rust-lang.org/reference/lifetime-
elision.html#de...](https://doc.rust-lang.org/reference/lifetime-
elision.html#default-trait-object-lifetimes)

[2]: [https://doc.rust-lang.org/reference/types/trait-
object.html?...](https://doc.rust-lang.org/reference/types/trait-
object.html?highlight=trait,object#trait-object-lifetime-bounds)

[3]: [https://github.com/rust-
lang/rfcs/blob/master/text/1156-adju...](https://github.com/rust-
lang/rfcs/blob/master/text/1156-adjust-default-object-bounds.md)

------
quietbritishjim
> The downside is that, if you perform this as a separate pass, your parser
> now has to iterate over all of the source code twice. This may not be the
> end of the world: tokenizing isn't the most expensive operation. But it
> isn't ideal, so some parsers combine the two passes into a single one,
> saving cycles at the expense of readability.

This paragraph is a bit muddled:

* If you keep the two passes separate, then the second pass is over the token stream. So it's irrelevant that "tokenizing isn't the most expensive operation" because you're never going to do the actual tokenising a second time.

* If you combine the two passes then you're not "saving cycles" because you're _still doing both of the loops_ \- it's just that they're now happening in parrallel rather than in sequence. This confusion is repeated later when he says the combined one is "without the runtime overhead of a second loop".

* What you _are_ saving is memory, because you don't have to keep the whole token stream in memory at once.

But really this is a nitpick on a small note. The overall article is great.

~~~
Rotten194
Nitpick^2: You are probably saving a few cycles due to loop overhead around
pointer offsets and bounds checking, + not having to load the items into cache
twice, if the sequence is too long to fit into cache.

~~~
quietbritishjim
Agreed! I definitely wasn't telling the whole truth by saying the amount of
cycles spent on looping is ultimately the same. Although the difference is
complex enough that I could imagine it going either way (but you're right that
the cache effects would probably be the clincher).

------
csears
Really enjoyed seeing this broken down and explained. Is a function like that
considered idiomatic Rust?

~~~
steveklabnik
Yep! See my comment about the lifetimes, though, you may not actually write
them out in the real code, but conceptually, this is Rust's bread and butter.

------
db48x
That function really is a good example for explaining lifetimes in Rust.

~~~
rob74
The function may be a good example, but this blog post doesn't really do a
good job at explaining it, at least not to people unfamiliar with Rust (which
I assume it's aimed at). If I understood correctly, if you assign the lifetime
_' a_ (or _' x_, _' l_, whatever) to all parameters, you tell the compiler
that they all have the same lifetime? And if you wanted to have different
lifetimes, you could use _' a_ and _' b_? But the article doesn't explicitly
mention that, so you could end up thinking that _' a_ is some kind of a
magical constant that you have to use...

~~~
brundolf
The target audience was programmers who are at least familiar with the basic
concepts of stack, heap, and pointers, including but not limited to newer Rust
programmers.

You make a good point about 'a; I could have made that clearer. In practice it
works the same way as a generic type (the letter is arbitrary, and it's
declared inside the < >), which I assumed people would be familiar with, but I
can see how the connection between the two may not have been obvious.

~~~
db48x
You can still edit it to improve it in that regard. Maybe add a footnote or
something.

~~~
brundolf
Yes, I plan to go back and make several improvements this evening after I'm
off work :)

------
wwright
It'll be fun when we get first-class generators and you can write a simple,
imperative function and no helper types and still use this type signature :)

~~~
steveklabnik
Until then, [https://doc.rust-
lang.org/stable/std/iter/fn.from_fn.html](https://doc.rust-
lang.org/stable/std/iter/fn.from_fn.html) can handle many cases.

~~~
brundolf
Oh that's really neat, I didn't know about that

------
haskellandchill
> This is something that you cannot do both a) this safely and b) this
> efficiently in any other language.

This is wrong. Pretty sure you could do this in ATS for example.

~~~
brundolf
I may have overstated that. What I really meant was "any other language that
I'm passingly familiar with", which includes the majority of mainstream
languages.

~~~
jrpelkonen
Substrings in older versions of Java used to reuse the character array of the
original string instead making a copy. The flip side is that the original
string must be retained in memory until all the substrings have “become
garbage”. So there is a trade-off.

~~~
ameliaquining
The Rust version has that same tradeoff; the string can't be deallocated until
all the tokens' lifetimes have ended. (In this particular use case, that's not
really a cost, since you probably need to keep the string available the whole
time anyway.)

Go, which is garbage-collected, also does the slice-by-reference trick (but
fails criterion #1 because it doesn't really have iterators).

------
timvisee
You might find generics useful as well, to support any parameter that can be
referenced as a `&str`, such as a `String`:

    
    
        fn tokenize<S: AsRef<str>>(code: S) -> ...
            code.as_ref()
    

Edit: whoops, was too quick, apparently this isn't a great idea here.

~~~
alilleybrinker
In cases like this, it can be useful to have the generic function call a
separate, private function which does the implementation and is _not_ generic.
Generic functions are monomorphized, meaning copies are generated for each
concrete type they're called with. Separating them out in this way can reduce
the size of the generated code after monomorphization.

~~~
Arnavion
In cases like this, the generic function cannot return an Iterator of &str
that borrow from the input because the input is being consumed, so it's a bad
suggestion. The input needs to be &str to be able to return borrows from it.

If they had tried to complete their code and compile it, they would've noticed
there's no way the function could return `code.as_ref()`.

~~~
alilleybrinker
That's true that in this case their plan of making the input generic wouldn't
work, as it relies on the output being borrowed. The technique of reducing the
size of monomorphized functions would be useful in other contexts, but not
this one.

------
ridiculous_fish
Tokenizers usually need to retain some information about where each token came
from, like the range in the source.

In C++ I might do it like this:

    
    
        struct token {
            string::iterator start;
            string::iterator end;
        };
    

and then the range in the original source can be recovered via:

    
    
        size_t offset = tok.start - src.begin();
    

What is the preferred way to do this in Rust? Is there something better than
using indexes?

Rust doesn't appear to have a way to say "give me the location of this &str in
that &str" (even though it seems like it would be safe).

~~~
masklinn
> What is the preferred way to do this in Rust? Is there something better than
> using indexes?

I don't think so.

> Rust doesn't appear to have a way to say "give me the location of this &str
> in that &str" (even though it seems like it would be safe).

'course there is: str::find ([https://doc.rust-
lang.org/std/primitive.str.html#method.find](https://doc.rust-
lang.org/std/primitive.str.html#method.find))

It looks a bit abstract because it takes a Pattern rather than an `&str`, but
a char, a slice of chars, an &str, or a predicate function, are all patterns.

~~~
ridiculous_fish
str::find performs a substring search, which is incorrect if the same token
appears more than once.

What I want is the literal offset of a string slice in another, aka pointer
arithmetic in C:

    
    
       fn derp() -> usize {
          let src = "xxxxxxx";
          let tok = &src[3..4];
          let offset = tok.as_ptr() - src.as_ptr();
          offset // should be 3
       }
    

this should be doable safely; I think it's just an annoying hole in the API.

~~~
steveklabnik
cast both of those as_ptr()s to usize and this does what you want.

~~~
defen
I know it's just an example, but in case anyone tries to grab the code - it's
worth noting that this will panic if you try to slice into the middle of a
UTF-8 sequence, and will give byte offsets and not char offsets, so it might
not give the answer you expect if src contains multibyte chars before tok.
Also I'm fairly new to Rust so it's possible there's some other way to do it
that I just don't know about.

------
mrweasel
I really don’t like usage of ‘ in the function signature, it simply looks
wrong.

In generally that is my main complaint as I am trying to learn Rust, the
syntax seems to be all over the place. Maybe that feeling will fade away as I
get more use to the language.

~~~
steveklabnik
Almost none of it is unique to Rust, but it does draw inspiration from a wide
variety of influences, so it can feel like that for sure.

This syntax was taken from OCaml.

Incidentally, nobody really loves it, but nobody could come up with anything
clearly better either.

------
jchook
This might be a good thread to mention nom[1], a parser combinator in Rust.

1\. [https://github.com/Geal/nom](https://github.com/Geal/nom)

------
vlovich123
You can actually kind of do this with `std::shared_ptr` [0], but it's going to
be nowhere near as efficient[1] as the Rust mechanism. Maybe a single-threaded
implementation of it could come close but you'd have to roll it yourself or
use something like `boost::local_shared_ptr`. You may also need to go to
intrusive pointers to make sure there's memory alignment.

Of course the Rust variant is not only cheaper (no ref counts), it's also
guaranteed thread-safe (which again, C++ would have to use something like the
atomic `shared_ptr` to achieve)

TLDR: Even when C++ can kind of do it, you have to give up concurrency safety
to get close while adding memory pressure & still not being as fast while Rust
gives you guaranteed concurrency safety for free regardless of accidental
misuse of your API.

[0] Currently that would be constructor #8 on
[https://en.cppreference.com/w/cpp/memory/shared_ptr/shared_p...](https://en.cppreference.com/w/cpp/memory/shared_ptr/shared_ptr)
[1] Not only are you always paying for atomic ref counting even when you don't
need it, aliased `shared_ptr` will be as expensive to take a ref for as
normally constructing it - there's possible no way to `make_shared` with
aliasing. This means the aliased shared_ptr will be even more expensive every
time you need to access the ref count since the control block loses cache
coherency. On the other hand if you structure it more carefully maybe you
could vend `const std::shared_ptr<std::string>&` in your iterator so that in
the single-threaded case your only overhead is constructing the aliased
shared_ptr once & you leave it to the user to take a copy when transferring to
another thread.

~~~
boris
Nobody in their right mind will use `shared_ptr` in this context. It will be
`string_view` in and out with lifetime management left as an exercise to the
programmer.

~~~
steveklabnik
I read your parent as saying "assume we start from needing this to be
checked," and then saying that this is the best that can be done. You're
absolutely 1000% right that nobody would write that ever and would use
string_view and not have it statically checked.

~~~
vlovich123
That’s exactly it. Just showing how trying to get the same safety guarantee in
c++ is an either or between performance and safety.

------
noncoml
why do we need the lifetime after the function name? i.e. `tokenize<'a>`

~~~
dpbriggs
To make the function generic over some lifetime 'a, similar to how generic
functions can be generic over some type T (e.g. foobar<T>). This is usually
not necessary to write as it is elided in common use cases.

~~~
noncoml
thanks! I think what was confusing me was that in some examples i see it
omitted and in some not. Now it's clear.

