Hacker News new | past | comments | ask | show | jobs | submit login
Working with strings in Rust (2020) (fasterthanli.me)
38 points by tigerlily on Sept 23, 2022 | hide | past | favorite | 49 comments



I found the writing style, the presentation as a "naive beginner" discovering facts about programming by writing test programs, to be insufferable. Eg

> Okay, so, argv is an array of addresses, and at those addresses, there is.. string data. Something like that:

> It looks like every argument is terminated by the value 0. Indeed, C has null-terminated strings.

> If I'm, uh, reading this correctly, "é" is not a char, it's actually two chars in a trenchcoat. That seems... strange.

The inclusion of all the "naive test programs" serves no purpose but to inflate the article's length to ridiculous levels.


I'm the opposite. As someone less familiar with C style string manipulation, all of these examples to map to my current knowledge are great.

Lots of complete working examples let me test my working mental model I develop while learning too.


Yeah I learned so much from this article. Generally I find the OP pretty verbose and not that interesting but I really loved this one.

It may or may not be relevant that I've been reading TAOCP assembly programs recently so may be thinking more in terms of pointers.


I think the article's introduction is unhelpful. If it made it clearer early on who the intended audience is, it would do better.

(I think the intended audience is something like "people who want to read about things they already sort-of-know, set out as a nice story to help everything fit into place". The apparent audience of "people who are comfortable reading C but don't know that C strings are null-terminated" is presumably a rather small one.)


I find this tyle of writing fantastic. The author pretending to realize shows exactly when there is a realization to be had. Examples and counterexamples, problems and their resolution. I have read a number of these and they are all brilliant I think. The length density and "rythm" is just right.


Half of the article is below the level of /r/coding as far as C goes, only to boast for the billionth time about how safer Rust is.


What I hear you saying is: "this article goes over a bunch of basic C gotchas that I've hit enough times to reflexively avoid, C isn't that hard, you just have to stop making newb mistakes to use it moderately correctly".

Is that an accurate summary of your stance?


I'd say no, because this leads to the "macho programmer", "leet programmer" or "cowboy programmer" clichés and it is not very constructive.

If you want me to go there, I will say that indeed, saying "just pay attention" is not an answer - "find ways to not let yourself make those mistakes" is already a more valuable piece of advice.

I will also say that one should try this style (that is, program stuff in a type-less language like assembly) rather than dismiss it as "old" or "obsolete". Lack of safety will never be obsolete, so learning how to deal with it is a valuable skill.


one core thesis of rust is that history has adequately demonstrated that humans do not have sufficient vigilance


I think this is obvious if you've ever worked in a large project.

Yes, you can always avoid making mistakes, but you can't make others avoid the same mistakes. It just never scales.

The software industry on the other hand is scaling heavily. Which is why what worked in the 80s isn't good enough anymore.


My line of thought - which is perhaps partly and indirectly based on "Small is beautiful" [1] applied by others to software - is that if "large project" is what gets you into (various) troubles, then you should avoid doing this. And indeed, the idea of breaking large projects into small, human-size components is a common "good practices" recommendation.

OTOH I experience quite the opposite everyday in the software industry, so I understand one could reject that position as too idealistic.

[1] https://en.wikipedia.org/wiki/Small_Is_Beautiful


> [I]f "large project" is what gets you into (various) troubles, then you should avoid doing this.

This makes perfect sense and I do agree. But I also wonder what you would say to someone writing, say, an operating system or a browser? Projects that are inherently large and where the stakes of errors are high.

I do think you need great safety and abstractions to raise your complexity ceiling to the level of these projects, if that's what you've set out to do.


An OS is actually not that "large" - at least if one looks at Linux. If you do a simple LoC count you do get an answer in the millions, yet the core is much smaller and the rest is in a bazillion of drivers.

Browser are the typical case of a project/product that became bloated over time; they began as relatively simple document downloader and viewer, but now they are also application downloader and execution environments. This causes all kind of problems for users and implementors. The answer is somewhat eye-rolling: we should go back to Flashplayer times (with something better than FlashPlayer though). Let browsers handle HTML/CSS, and delegate everything else (PDF, big media files,...) to dedicated applications chosen by the end user. One can still do that, indeed, except it has become more and more difficult - even for simple text-and-images documents - because websites are abusing Javascript capabilities.

"we have to handle complexity better" is the common way of thinking, and I think it is wrong as a first reaction. Trying to lower the complexity first is better.


I agree that it's better to reduce complexity and that browsers have been subject to mission creep, but a.) I'm not really convinced that adding more integration points will reduce complexity and b.) the mission creep already happened - I don't think we get to reverse it. Creating an application platform _is_ the mission of a browser now. I think there would be value in an alternative such was just a simple document viewer and form filler, but it'd be a different kind of application and customers would need to be sold on it from scratch.

Is your position that no project needs to be large if we employ the right strategy, or that we should look at our complexity ceiling and say, "okay, that's how large a project we can take on?" Or something else?


After skimming that Wikipedia article, I think I understand better that your criticism is that these are inappropriate technologies & that we're increasing our complexity budget instead of making technologies that are more appropriate. And I do fully agree with that.

Thank you for mentioning this book, I look forward to reading it.


> below /r/coding

What does this mean?


A reference to the Reddit subforum: https://www.reddit.com/r/coding/


I believe they're asking, "what are you implying?"


No. What does “below <subforum>” mean? Below their standards?


Yes. You can call me smug/elitist, I deserve it for that unnecessary belittling.


Ok, I see now it says “below the level of”, which it apparently didn’t originally say according to my copy-paste.


Comparing a modern language (Rust) to an ancient one like C seems like a straw man argument. For Rust to succeed it needs to be better than other current alternatives, so for an article written in 2020 C++-17 would seem a more relevant comparison.

"Rust is safer than C" - meh, so what? So is C++ ...


This is a straw man of the article though. It's a discussion of string representation and Rust language features - decisions that were motivated by C.

This isn't an article about why you should use Rust, it's about strings in Rust.


> This isn't an article about why you should use Rust, it's about strings in Rust.

Well, the article literally has more C code in it than Rust, so it sure seems to be about comparing Rust to C.

Why the author thought it would be useful to compare Rust's strings to C, a language that literally doesn't even have a string type is beyond me. Kind of like saying "Look how fast I can run! Even faster than a walrus!"


Like I explained, it's difficult to understand Rust's language design - especially why it has multiple string types - without understanding C. Really, it's difficult to understand the decisions of most languages, without understanding C.


Rust hype seems to be reaching critical levels. Is it really that perfect?


It's very nice language and not just because of the language features. The tooling is awesome and modern. Compiler errors are top class, cargo is a great package manager, rust-analyzer is a great LSP.

However, I think the way the language has implemented `async` is going to cause a riff in the usage. Because they didn't standardize the executors interface, you're essentially forced to using a single runtime (Tokio) since library developers would have to code for each runtime. Also, because of the async keyword, library developers would also have to make a different sync/async of each function they want to declare.

These are not small issues and will only continue to get worse as people are forced to choose one route, leaving others to use a different language.


> Also, because of the async keyword, library developers would also have to make a different sync/async of each function they want to declare.

No. If you are an async program calling a sync library, you can use `tokio::task::spawn_blocking`, and Tokio will run your sync function on a worker thread that's allowed to block https://docs.rs/tokio/latest/tokio/task/fn.spawn_blocking.ht...

The `reqwest` library offers a sync API by just wrapping its own async API: https://docs.rs/reqwest/0.11.12/src/reqwest/blocking/client....

For libraries that don't do this convenience wrapping, you can just create an async runtime like `tokio::Runtime` and use a function like `block_on`: https://docs.rs/tokio/latest/tokio/runtime/struct.Runtime.ht...

Since all Rust programs have sync entrypoints, there is always some way to call async functions from sync functions. It's not like C# or Go where the async runtime is a singleton built deep into the language. You always have control of it.

In practice, if I'm doing anything with networking, I always end up wanting async, so I prefer to have the runtime and use spawn_blocking for sync APIs.


That's async calling sync. But if you're in sync and the only option is to call an async library because the author didn't want to make two versions then you have to pull in the async runtime as well.

Not everyone wants to pull in an async runtime just to run make a call to an API.

My whole point is async puts a huge burden on library authors that will force users to diverge in their approaches and that will ultimately hurt the community


Rust certainly wants to standardize the executors interface. People are working on it. See https://www.ncameron.org/blog/portable-and-interoperable-asy...


> The plan

> Kidding, I don't have a plan.

:|


> [B]ecause of the async keyword, library developers would also have to make a different sync/async of each function they want to declare.

This is a program with async in general, no? Go avoids this by just having everything be async, but that seems unsuitable for Rust since you won't have control over when your code executes and what thread it executes on, not a problem for applications but it would seem to rule out things like kernel modules and firmwares, as well as require every type to be Send + Sync (which ultimately means the language must be garbage collected).

In Python for example I've written async and sync versions of a library, and some magic testing that makes sure both implementations return the same results.

> These are not small issues and will only continue to get worse as people are forced to choose one route, leaving others to use a different language.

Honestly this doesn't seem like a problem to me. I think there's room for many languages in this world. Rust is a wonderful language but it isn't a "one true language." People should choose a different language if it suits them better. That's no more an indictment of Rust than it's an indictment of Python when I embed Rust in a web app - that's not a place where Python shines.


Sadly, I agree. It's a lot like C#'s quest against nullability: the feature was introduced too late into a stable design in a form that is obviously incomplete, tries too hard to maintain compatibility with existing code and fails.


I've never worked with C# so I need to look into that.

The one saving grace with Rust is if everyone decides to say "screw async" and just builds synchronous APIs, then we use something like [May](https://github.com/Xudong-Huang/may) for green threading.


It's not, but it's very good for writing system software. Modern C++ is also good, probably good enough for the majority of the cases, but Rust is there for you when you can't afford a GC and you must be absolutely right you're not introducing memory management errors.


If you're using modern C++ then you really have to be screwing up to introduce memory management errors. For 99% of programs high level (STL-based) types are sufficient, and if you really need to be allocating things yourself, then smart pointers and RAII make it pretty hard to mess up, unless you choose to circumvent the scope-based lifetime they provide and start doing things like ptr.release() instead.

To me C++ pretty well follows the "make simple things easy/safe, and hard things possible" philosophy. I'm glad this it still does support low level stuff, but the use cases for that are minimal and I do wish there was a compiler flag to reject full C backwards compatibility and only support a modern safe subset, so that projects could protect themselves from junior developers shooting themself in the foot.


Smart pointers and other managed types are nice, but memory errors are still easy to make, e.g. pushing into a std::vector from several threads.


Well, yeah, but that's not really a memory error - it's an error of not using mt-safe data structures for mt programming. It's no different than using int vs std::atomic<int> in an mt context.

It's a shame that C++ doesn't provide mt-safe versions of STL types in the standard library, although trivial to wrap them yourself (1 line of code per method, using std::lock_guard).

Are all Rust data types mt-safe? Does unsafe mode provide faster unprotected versions if you need those?


> Are all Rust data types mt-safe?

Yes, but not in the sense you probably think, given your second sentence:

> Does unsafe mode provide faster unprotected versions if you need those?

Some data types can be safely shared between threads, and some cannot. Rust checks at compile time if you try and use a non-thread safe data structure from multiple threads, and if you do, will give you an error. So in that sense, all of them are safe, yes.

You don’t use unsafe to get access to non-thread safe data structures, you may have both kinds, and the compiler checks you use them correctly.


Interesting. So I assume there are at least both basic and thread-safe versions of basic types like string/list/vector/map ?

Is there any provision for building your own thread-safe types (e.g. a structure composed of other types) out of non-thread-safe types and mutexes, and if so how does that work in terms of compile-time errors ?


> I assume there are at least both basic and thread-safe versions of basic types like string/list/vector/map ?

To my knowledge, no. If you want to push to a vector from multiple threads, you "wrap" the vector in a `Mutex`. The difference between C++'s std::mutex or std::lock_guard and Rust's Mutex, is that the compiler refuses to let you touch the data protected by the mutex unless you have a lock acquired on the mutex.

> Is there any provision ...

Yes!

1. In Rust, types can implement "traits".

2. There's two traits that control thread safety: `Send`, and `Sync` Basically, any type that "implements the trait" `Send` is safe to send across threads. And a reference of any type that implements `Sync` is safe to to send across threads.

3. `Send` and `Sync` are "auto-traits", which means that if you make a data structure out of primitives that all implement `Send`, your data structure will also implement `Send`, same for `Sync`.

4. There are a bunch of thread-safety primitives that you can use (like `Arc` (atomic reference counters), `Mutex`, ..etc) to build thread safe data structures.

> how does that work in terms of compile-time errors?

The compiler will not let send a type across threads if it doesn't implement `Send`, and it won't let you send a reference to a type if the type doesn't implement `Sync`!

That way, if you avoid using the `unsafe` keyword, and the compiler agrees to compile your code, you can be sure you won't have data races!


There are mutexes & other synchronization primitives. The stdlib doesn't ship a thread safe Vec or HashMap, but there are many on offer from the community. Rust's stdlib is small; lots of functionality you might expect is in community provided libraries (at least presently - I imagine thread safe collections like that will be added in the future, when the community comes to a consensus about how they should look). This works out better than you might expect, and avoids hazards that say Python encounters, where parts of the stdlib are so bad there is no safe use case (looking at you, `csv.Sniffer`).

In Rust, there are traits (you might know them as interfaces or protocols from other languages) called Send and Sync which tell the language whether or not something can be send to a different thread, or shared between threads.

Vec<T> is Send but not Sync - you could pass ownership of it to a different thread, surrendering your access to it in the process, but two threads couldn't share access to it. Almost everything is Send, unless it contains references to some thread local state.

Mutex<Vec<T>> is Sync - you can share it between threads. Basically if you take the lock, you'll get back a smart pointer to your Vec<T>.

These traits are generally implemented automatically; you can implement them on a type yourself, using unsafe, but you'd only do that if you were writing your own synchronization primitives or thread safe data structures.

So; the compiler is able to infer which types are and are not thread safe, and what flavor of thread safety it has. It's then able to use that to check for thread safety violations at compile time.

There's more to it than that, in particular it's possible to share a read-only reference to a Vec<T> between threads (with a compile time guarantee that no one has a mutable reference to it), but I'd refer you to the book[1] or to other articles on HN of you wanted the specifics.

[1] https://doc.rust-lang.org/stable/book/


Far from it. It's main selling point is that it's friendly.

I had C/C++ in college and spent most of my career in front-end.

In comparison to the Cs Rust at least tries to get out your way. Stuff generally compiles and when it doesn't, it tells you what might be wrong using human language.

I mean, it's the first time I've seen a compiler error start with "perhaps..." to tell you what steps could be taken to fix it.


It's great to have a compiler give useful diagnostics, but it has to be said that's mostly a function of the compiler vs language.

LLVM-based compilers such as the rust compiler and Clang (C++) tend to be way better than g++ in this regard, although Clang competition has forced g++ to be a little better in recent years.


It's got about the same hype that node or Go got, so it's not like you need a high level of perfection for that. Never mind that some of that is coming from a crowd I'm not too fond of.

Personally, I don't think it's worth all that just to avoid a garbage collector, but it's not like there are that many decent compile languages with GC currently on the hype train.


This is FAR too long. The first 10 pages is all about C strings. Just get to the point of why Rust has these 2 types and give people a link to a piece on unicode and UTF8 if that background is going to help. I didn't even get to the Rust part before giving up. I then scanned it, found a little Rust stuff but still couldn't easily pick out the answer to the top-level question. Now I don't care.


For anyone who does care, the answer is that String is not only mutable it's always heap allocated and can grow, whereas &str is just an immutable reference and so it could be pointing at the text in your String, but equally it could point to the stack, or your text segment, or a ROM.

C++ programmers might do OK to think of &str as like a C++ 17 string_view, except it's promised to always be UTF-8 text, and because it was provided from the outset in the language this type is always used where appropriate, so the string literal "Test" is a &'static str, and all the APIs you use work with &str and never some archaic pointer type.

Rust's String is a bit like C++ std::string, though deliberately lacking the Small String Optimisation but again always UTF-8, and String implements the Trait Deref<Target=str> which means most read-only methods are defined on &str but work fine on String anyway.

[edited: A handful of read-only methods make sense specifically for String, like asking how much space is allocated for the String, so those are in fact methods on String]


Rust has multiple string types _because_ of the legacy of C strings.


Can you explain how Rust's string types would be different if there was no "legacy of C strings" to deal with?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: