Hacker News new | past | comments | ask | show | jobs | submit login
C++ still isn't cutting it (da-data.blogspot.com)
190 points by dgacmu on Oct 18, 2020 | hide | past | favorite | 205 comments



To summarize:

1) Rust's standard idioms (in this case, Results and Iterators) encourage less bug-prone or ambiguous APIs (standard and otherwise)

2) Rust's ecosystem makes it easy to find and integrate solutions not offered by the standard API

3) And then finally, Rust's borrow checker prevented a threading issue

Others are rightly pointing out that only #3 is fundamentally unique to Rust. All of these other things could be done for C++. But, importantly, they haven't.

This raises an interesting dimension to the comparison which is that many of Rust's advantages don't really come down to its unique traits (heh) but to the simple fact that it's "C++ without baggage". It had a fresh start when it came to establishing standard idioms for everyone to use, providing cross-platform tooling for everyone to use, providing centralized package management for everyone to use, etc. If you introduced to C++ a Result type, or a package manager (I assume people have done this already), most C++ developers wouldn't end up using them. Most libraries wouldn't be using the same vocabulary. It would be arduous to even move the standard library over, because it would be a breaking change. Much of the tooling out there would probably never get specialized support.

I don't think these network effects and cultural forces get talked about enough. Sure, this stuff is "just a library", and C++ has libraries. But the culture and the baggage and the stakeholders around a language have a huge impact on what ends up being practical to do with it, independent from what's technically possible.


> 2) Rust's ecosystem makes it easy to find and integrate solutions not offered by the standard API

This is a double-edged sword. I'm a huge fan of Rust's tooling, but when it's so easy to add dependencies, this inevitably leads to deep dependency trees, slower compile times, and less knowledge about what's actually happening inside most rust code bases. Often times when asking about how to solve a problem in Rust, the answer comes in the form of "use crate X" rather than an explanation of how to solve the problem using the language itself.


> Often times when asking about how to solve a problem in Rust, the answer comes in the form of "use crate X" rather than an explanation of how to solve the problem using the language itself.

Do you think this is a bad thing? How frequently would you say it's a better use of a software engineer's time to carefully reimplement a commoditized library that already exists for a speed improvement or to understand it?

I don't really understand arguments that you shouldn't introduce convenience and quality of life features because programmers will lean on them too much. Leaning on them is the point; the implicit thesis is that programmers no longer have to understand what's happening under the hood. It's specialization and division of labor applied to programming.

And if you really care, you can probably just look at the source code or pick up a book to learn how it works?


I think it's a subtle problem and it takes more than a few sentences to make the required nuance in the argument clear (or the one line "zingers" found elsewhere in this very thread). But just as one example, see my comment elsewhere in this thread about the use of globwalk in this example. That's bringing in a ton of code when compared to just using walkdir and checking the file extension directly. The crate ecosystem encourages this kind of emergent behavior.

People make the mistake of treating this issue as black and white: either you're for or against code reuse. But the reality is far more nuanced. Often, a dependency will solve a much more general problem than what you actually need, and thus, avoiding the dependency might result in solving a considerably simpler problem than what the dependency does. In exchange, you use less code, which means less to audit/review and less to compile.

Given my position as the author of a few core crates, I actually often find myself advocating against the use of those very crates when the problem could be solved nearly as simply without beinging in the dependency. (I did not author globwalk, but I did author its 'ignore' and 'walkdir' dependencies.)


I would throw in that it's very hard to design APIs well and it's very hard to not couple things and include the kitchen sink of features.

Let's say someone makes an XML parser (just trying to pick an example). IMO a bad XML library would read files. Instead it should just, at best, take some abstract interface to read text and outside the library the docs should provide examples of how to provide a file reader, a network reader, an in memory reader, etc...

But, I rarely see that. Instead (this is more the npm experience) such library would include as many possible integrations as possible. Parse XML from files, from the network, from fetch, from web sockets, there would be a command line utility to parse XML to JSON, and some watch option that continually parses any new or updated XML file and spawns an task to do something with it, all integrated into this "XML Library". Parts of it will respond to environment variables and it will have prettifiers integrated with ANSI color codes and you'll be able to choose themes, and it will have a progress bar integrated into the command line tool for large files

And the worst part is the noobs all love it. They down load this 746 dependency XML library and then ask for even more things to be integrated.

Maybe someday there will be a language with a package manager/community/guidelines that mostly somehow rejects bad bloated packages. It seems like a nearly impossible task though.

Note: I don't know rust well but I tried up fix a bug in the docs. The docs are built with rust. The doc builder downloaded a ton of packages which to me at least was not a good sign


Personally, I don't see this particular problem as widespread in the Rust crate ecosystem. It's definitely one way in which dependencies can proliferate, but not a terribly common one in my experience. There's a lot of focus on inter-operation using common interfaces.


Very true.

Yet another possibility that I think has a place in this nuance conversation is adapting existing code with changes. It's interesting that we don't culturally today have an accepted way of doing copying + adaptation, but over most of the history of computing it has been very common. It has the obvious downsides of not easily getting improvements and bug fixes of later versions of the upstream code, and rightly is passed over in most cases, but in some cases it's still a win.


It's interesting that this line of argument perhaps leads to the conclusion "keep libraries small, to allow more granular selection of dependencies".

But this is the kind of reasoning that leads to the situation in npm of thousands of tiny libraries such as "leftpad" that many systems programmers are so derisive of.


I'm not sure that's the only choice. I think another choice is to help educate folks when a dependency could be removed in favor of a little code. Take regex or aho-corasick for example. aho-corasick is a lot of code and only has one tiny dependency itself (memchr). There's really not much left to break apart. But I wouldn't recommend the use of aho-corasick for every case in which you need to search for multiple strings in one pass. A trivial solution using multiple passes is quite serviceable in a number of cases.


it would be great to have an https://alternativeto.net/ style recommendation engine integrated into crates.io (or npm) that can narrow down the minimal crate required for a specific use case.


It's an interesting idea, but would take a fair bit of creativity to execute well I think. It's really hard to enumerate this grey area! But I think something that elaborates on this grey area with lots of examples would be great. Sounds like a blog post I should write. :)


> It's an interesting idea, but would take a fair bit of creativity to execute well I think.

yeah for sure, it's a hard problem. but at least a high level listing of similar libraries in the same category with # of recursive deps and total bloat size clearly visible so it was easy to shop around and sort by whatever attribute the developer wants to prioritize.

it would be similar to category-specific filters & comparison tables for features of e.g. SLR cameras, usb3 power supplies on amazon.


The problem is that walkdir isn't really an alternative to globwalk. It's only a good alternative in this specific simple case. globwalk even depends on walkdir. :)


i don't see a problem with listing both, as long as the high-level featureset is sufficiently enumerated in a comparison table for me to quickly discard or add the lib to my to-try list.


There's no problem in listing both. The point is that a high level feature list comparison wouldn't have necessarily helped here.

Having lists of alternative libraries to solve a particular problem is great. It is valuable on its own. But it doesn't really fix the nuance that I'm focusing on here in this example. And my general claim is that this sort of nuance is a fairly common problem that leads to unnecessary dependencies.


What about a less ambitious goal: fuzzy keyword search that displays the dependency graph and cumulative build time of each node in the package graph (activity stats on hover)? Users can assess for themselves whether a simpler library meets their needs, but this would make the relevant information more accessible.


Sure. Also sounds useful. crates.rs (an alternative UI for crates.io) has the beginnings of this by listing the binary size of the crate's dependencies. e.g., https://lib.rs/crates/globwalk vs https://lib.rs/crates/walkdir


I think that Clojure does something interesting here, but I can't quite put my finger on it. Apparently, it uses very small and abstract libraries and seems to lead to narrowed down dependencies.


It does (inevitably) really depend on the quality of the library. We really saw this for say, SSL deprecation.

If you made good choices (e.g. maybe you depend on Python's Requests) you had to go out of your way to make problems for yourself, by default the software understands that it should prefer shiny modern cryptography, allow anything that's probably fine, and reject nasty broken stuff. What counts as "shiny modern" or "nasty broken" will vary over time, but it's not likely you, as the application developer, were best placed to decide when you wrote the code, so unless you specifically did so the runtime decides.

In contrast Microsoft shipped not one but several distinct C++ APIs and .NET APIs where you'd need to go back and periodically modify old code (e.g. to specify newer TLS versions) or now it doesn't work because there was no provision for programmers to just let the runtime do the Right Thing.


For almost any real world application, where people's data and businesses are at stake, developer convenience is a distant priority to security. Of course it's great when we can have both, and making secure solutions the most convenient is key, but Rust doesn't quite get this right yet. If there's a security issue in one of those dependencies you are importing, how do you even know that you are impacted? What about when it's a dependency of a dependency? Making it easy for developers to suck in a huge TCB at build time has some serious downsides.


The ecosystem also includes stuff like cargo-audit[0], with which you can easily add dependency auditing to your CI pipelines or whatever. People are definitely thinking about this problem.

[0]: https://github.com/rustsec/cargo-audit


I very much agree with your point, and want to add that in certain contexts, the same goes for performance.

I do think that Rust making it easy doesn't necessarily mean that you have to use it. It's fine to write applications using only the stdlib, and only link to some known libs like openssl or libpq.

In Rust's defense, it is easy to list all dependencies, and they don't automagically upgrade from one checkout to another.


When security is a concern, developers can simply avoid using untrusted crates. But concerns over security are not reason for a package manager to not even exist. Pretty much every major language and operating system has a package manager. While supply chain attacks are real, they are not unique to Rust in any way.


As someone who's spent a few decades working in C++ codebases I'm not sure how this is a Rust specific problem.

Last large-scale C++ codebase I worked on took 60+ minutes to compile when you touched the precompiled header.

In fact, I've found it to be common to have binary-only library dependencies in C++ which makes it harder to tell what those downstream dependencies do.


I think you just run into it faster in Rust: I tried making a simple REST HTTP server (like 60 LoC) using actix-web and it pulled in over 200 sub-dependencies and took something like 10 minutes to compile.

This toy program written using Boost.Beast or something similar would compile much faster. Of course I would have spent all day setting up the library instead of waiting for it to compile...


> this inevitably leads to deep dependency trees, slower compile times

I've never thought that "having more friction when adding dependencies" is a valid strategy for preventing overuse of third-party code. Unlike other "trust the developer" scenarios, dependency bloat is a minor issue at best and is very easy to detect and diagnose. IMO you should cut down your dependencies when they actually become a problem that's bigger than the one they're solving.

> and less knowledge about what's actually happening inside most rust code bases

I actually think the package ecosystem (along with macros) is crucial for the breadth and accessibility that Rust provides. Rust is a low-level language, but you have high-level libraries at your fingertips if that's the level where you need to be working. This means that instead of writing hot-paths in a low level language and application code in a high-level language (and dealing with the added complexity in terms of FFI, builds, tooling, developer knowledge, etc), it's very possible to just write the entire thing in Rust. Again in this case, I think "forcing devs to learn how everything works" is a weak argument for increasing the friction for adding dependencies. I'm pretty sure that everything on crates.io is required to include source, not just a binary, so it's trivial to dive in and learn how it works if you're so inclined.


I think it's important to say the silent part out loud: "just use crate x and you can look up how the crate works behind the scenes if you are curious".


The worst part is how easy it is to add derive and proc macros that kill compile times. I have a ~3k LOC project that blows up to over 45k LOC after `cargo expand` due to a dozen or so macros and the compile time is really starting to hurt iteration speed. Sadly, macros are by far one of my favorite Rust features.

I ended up paying for Clion just to get better debugger integration and a quick action for cargo expanding a file in the UI so I can copy paste the macro results into my codebase. I'm hoping that will improve incremental compilation times until I buy a Threadripper workstation.


Does the expanded code take longer to compile than the equivalent handwritten code? Is it just the expansion itself that takes a long time?

In the latter case, maybe cargo needs an intermediate macro-expansion cache (instead of just a crate-level build cache)


Often, you would write a lot less code if you were doing it by hand.


I adore CLion, it's brilliant for coding. At the moment I'm using it with Python and Git.


Compile times in Rust are really the only problem with this. Rust with LTO on is really good about removing unused code, so it doesn't hurt you during runtime

Whenever you run anything there's millions of lines involved across the OS API's directly or indirectly, so in practice it doesn't matter what's happening behind the scenes. In languages with fast compile times like Go and Java people add dependencies without even caring. And I think that's how it should be.

All our Java backends at work are like 200MB and it doesn't matter and nobody cares. It still compiles in less than ten seconds, and all that code barely affects startup time.


I'm not a rust/cpp guy but I have a feeling that people skilled enough to need rust or cpp can probably use public libs for rapid prototyping and then recreate a smaller version if needs be.


This statement is a bit confusing: are you saying that there is such a thing as too much code reuse? That it should be necessarily a bit difficult to integrate a third party library in your codebase?


There's always a balance point. For something like JSON parsing, or an HTTP client, for sure I want to use a library. But when code reuse means pulling in a giant library to do one specific thing, or else introducing a runtime like tokio where it otherwise might not be needed, I would probably rather implement a small tailored solution specifically for my use-case.


Ideally, there should be no harm in pulling in a library for just 1% of its functionally. A combination of careful library design and tooling should take care of dependency bloat ("shaking the tree"). In practice of course, libraries don't always consider this aspect, and the tooling never seems quite up to the task.


Yeah I think it’s also a matter of API design. Like a crate which by necessity has to handle the “general case” to some degree will often have more API complexity than a solution which is tailored to one specific use-case. For instance, you might have to concern yourself with configuration parameters which have zero relevance to your use-case.


In one sense, the way that larger crates in Rust are fragmented does address some of that.

This does result in large dependency trees, but it does also mean that if you need that one specific functionality, you may not need to pull in the full library, but only that smaller section.


> All of these other things could be done for C++. But, importantly, they haven't.

I think programmers often under-emphasize the importance of culture, so I want to boost this line.

Lots of "debates" around programming languages are, I think, better understood as differences in beliefs about what a language should prioritize. C++ could add an allocation checker that provides stronger memory guarantees by outlawing certain operations (or requiring they be marked a'la `unsafe`). They don't add it because, I think, they view other additions to the language as having a higher "return" on allowing people to create high quality programs.

It seems to me that a lot of this comes back to first-principles about what makes a good programming language. These ideas are, to some degree, set before a line of code is written. The rust folks thought memory and concurrency safety were very important when they were designing rust - even though they couldn't have formed those opinions by using the kind of language they were making (because nothing quite like it existed yet).

I also think about "infamous" qualities of languages (generics in go, the GIL in python). These ideas exist within wider cultural beliefs about what makes a good programming language. In a sense, it's inevitable that most of the people working on cpython aren't too concerned about the GIL, because they chose to work on a project that has developed around the GIL and believe its limitations are worth the gains from being able to rely on its coordinating function.

The implications of this are, to me, that when advocate for languages to adopt features we'd like to see we should remember the wider cultural context. It seems silly to expect the current golang team to implement generics "the way the community wants" because they've spent years building a fully featured language that functions without generics. Culture runs deep and it's not an intellectual or professional flaw to value improvements differently than our peers.


>nothing quite like it existed yet.

Cyclone has been around some 2002 according to Wikipedia. I don’t think a similar language (RAII + annotations + a mostly-mandatory validation step in the compiler) has had a well funded and sustained marketing campaign until rust, though. RAII also predates Rust by a couple decades.

Rust is also not very similar to itself from its early days.


Oh, I didn't know about Cyclone - that's cool! I wasn't trying to say that rust isn't influenced by languages that came before it.

I mean that, when they decided it was time to make a new language, the team would have had some understanding of what the language would be like. That understanding couldn't come from using the language itself (it doesn't exist yet). Instead, teams do cultural work to communicate & come to shared understandings about what the language they are making "should be like." That shared understanding survives past when the language is working and influences decisions to improve the language.

It's just worth considering and talking about project culture! It's common for there to be many possible improvements to a language and have the culture of the project (or wider community) be what tips the scales.


> 2) Rust's ecosystem makes it easy to find and integrate solutions not offered by the standard API

This made me think back to something Rich Hickey said in his "Are we there yet?" talk. In short rust has memory management, and so can better function as a library language, similar to how Java did back in the 90's. Rich's quote below.

>> And it is a big problem. I think the lack of garbage collection really impeded C++ in one of its design objectives, which is: it was supposed to be a library language. All the original design stuff and any time you heard Stroustrup talk about it, it is like, "C++ is going to be a library language". But it only ever ended up being a parochial library language. Every shop had a library, but there were not a library, and still are not a lot of libraries, that go between places, because of this problem.

https://github.com/matthiasn/talk-transcripts/blob/ccc4a0172...


C++ multi-threading is easy by not sharing data. Just have each thread produce a result and have a thread that accumulates the results via std::future.

I've converted many mono-thread C++ algorithm this way.


> Others are rightly pointing out that only #3 is fundamentally unique to Rust. All of these other things could be done for C++. But, importantly, they haven't.

Except they have, but some people aren't keeping up with C++ news, only bashing it.

1) can be easily done with bounds checked enabled STL, like modern C++ compilers do in debug builds, or force enable them in release builds.

2) Conan, vcpkg, and better than cargo, because they support binary dependencies.


I tested most cpp pkg managers (including vcpkg and conan); They suck. They suck so much that I reached the conclusion that using git submodules was the way to go, and saw “just copy the header file” in readmes of modern projects. The pkg managers are hard to install and configure, don’t integrate well with the tooling, don’t support most stuff, have out-of-date documentation, and are just buggy.

Cpp API design also seems to suck. Doing a simple curl request is bloody complicated.


Yet millions of devs are adopting them every day, they might be really masochistics, like myself every time I rebuild from scratch my Rust compile time check, that with compiling everything from scratch takes 18 minutes on average laptop, versus 5 minutes in C++ (thanks binary libs).

Cargo keeps failing my litmus test for usability of compiled languages.

I just look forward that something like Cranelift will fix it.


>because they support binary dependencies

Without an ability to tweak build flags. Which makes them useless. And since they are not really crossplatform either, you better stick with portage, which is still far behind cargo.


So useless, that they get used daily by millions of C++ developers, and was one of the top requested features for vcpkg.

Sword fighting gets tiresome after a while, and not all businesses enjoy shipping source code anyway.


> If you introduced to C++ a Result type

I don't know too much about the details of rust, but: std::optional? Whether people will use it is of course up to them, but my feeling is that they will.


Not the same, Optional is like Option in Rust, it’s about existence, not success or failure.

I believe std::expected is the correct answer but it’s unclear to me if it actually made it into the standard yet or not. I thought it had but can only find proposals...


AFAIK its not yet into any standards, possibly because the semantics are really complex (don't know how rust specifies them) - useful discussion here: https://www.reddit.com/r/cpp/comments/c75ipk/why_stdexpected...


Isn't Optional used in a lot of cases where empty means failure?


It might be, but that doesn’t make it semantically the correct thing.

As always, it depends. But traditionally, most languages that use this strategy have both, and they aren’t interchangeable. I can see this happening more when you only have optional, though.


That's not enough.

Hacking failure reporting into an existence API will make your code confusing, and will leave you no way to report multiple types of errors.


std::optional is like Rust's Option. But even then, without compile-time pattern-matching you leave the error to be checked at runtime, which defeats the purpose of this type. On top of that, it wouldn't be C++ if there wasn't some UB involved.


operator*() and operator->() on an empty std::optional is Undefined Behavior. This is mind-boggling. It defeats the whole purpose of using a wrapper type to enforce bulletproof error checking. It's the worst of both worlds: you have to use a cumbersome type, and it still fails like a nullable pointer.


yes, it is a somewhat baffling feature. the only practical use I can imagine for it is to avoid calling the copy constructor with an invalid instance of the type and/or to avoid a heap allocation. otherwise, it would be much more natural (and no less safe) for a c++ programmer to just use some flavor of pointer. in the case of pair<bool,T>, one would expect an optimizing compiler to skip the constructor call for an invalid instance anyway if the object was only accessed after checking the bool (unless the constructor has side-effects, yikes!).


I think most of this is just because C++ doesn't have very good pattern matching and didn't want to introduce new operators. (FWIW, using operator*() and operator->() feel very natural, although they are unsafe without using operator bool() as you have mentioned.)


from a safety perspective, std::optional is not much different than std::unique_ptr. both will happily allow you to invoke UB if you don't explicitly check for validity.


There are proposals for std::outcome and std::expected. I think the internal workings for one of them have also been passed to the C committee for standardisation.


> If you introduced to C++ a Result type, or a package manager (I assume people have done this already), most C++ developers wouldn't end up using them. Most libraries wouldn't be using the same vocabulary. It would be arduous to even move the standard library over, because it would be a breaking change. Much of the tooling out there would probably never get specialized support.

CMake kinda makes me think there's definite interest in this kind of thing in the C++ community. It's janky and horrible, but it has tackled the module/package/library issue a little bit (not in an elegant sense but its a little more standardized)


There’s no real thing preventing #1 from being done in C++, but it is incredibly hard to remove bad patterns from a language that allows them. This is both problematic because these languages don’t provide first class support to remove unwanted features, and because huge amounts of existing code will use patterns you’re hoping to excise. In order to successfully do this you have to do a huge amount of both linting and dependency checking, and the resulting requirement that you not use any existing libraries makes using Result in C++ about as hard as doing a progressive rewrite into Rust.


> or a package manager (I assume people have done this already),

Conan, the C/C++ Package Manager [1]

[1] https://conan.io/


> but to the simple fact that it's "C++ without baggage".

That's how I've always viewed it. Or rather a better replacement for C++.


I don't understand this article.

I think there is no language level feature in Rust to prevent TTCTTOU. TTCTTOU can happen in both c++ and rust.

for example:

    let f = File::open("username.txt");

    /*what if the file is deleted here?*/

    let mut f = match f {

        Ok(file) => /*what if the file is deleted here?*/ file,

        Err(e) => return Err(e),

    };

    /*what if the file is deleted here*/

I mean even if you write it like: let f = File::open("username.txt")?;

it looks atomic, it is not.

the c++ example, on the other hand, can do a filtering too, using:

http://www.cplusplus.com/reference/fstream/ifstream/is_open/

I mean Rust forces you to do error handling, which is nice. But that's not enough for preventing TTCTTOU. I don't think the first example is a strong argument.


File::open("username.txt")? looks fine to me, given the correct mental model of file systems—it returns an error if opening the file fails. Reading is then a different operation that can yield its own set of errors. Therefore each read is fallible; hence reader.lines().filter_map(Result::ok), which says “read it line by line, and just ignore any errors”. If the file is deleted before you’re done reading it, further reads may fail (depends on the file system’s behaviour), and be properly handled as specified.

If you want atomic open-and-read-the-whole-file, you can reach for something like std::fs::read_to_string (https://doc.rust-lang.org/stable/std/fs/fn.read_to_string.ht...).


> /what if the file is deleted here?/

Nothing. In all major operating systems I'm aware of, opening a file creates a handle to the file contents. This handle is distinct from the handle implied by the existence of the path in the filesystem. Some filesystems allow you to create another handle to the same file by a different name (called a hardlink). Removing a file only deletes the handle implied by the path; if another handle to it exists (via a hardlink or the file currently being open in a process) the data itself is unaltered.

In general, TOCTTOU attacks are caused by introspection being done on paths instead of file handles. Once you have handles, and you are querying exclusively using handles, your TOCTTOU issues largely go away. There is still a related issue that filesystem atomicity guarantees (or rather, a general lack thereof) is still a major headache, but it's not the same issue as TOCTTOU attacks.


On POSIX filesystems, deleting a file removes the name from the directory list, but the actual file is still available to open file descriptors.

The point is, if I go by names, there is no guarantee of identity. File-descrpitors ensure that.

Same could be done with c++, but the libraries are not designed that way.

Still, I don't find it terribly convincing.


OP says you can't really avoid it in Rust either, but argues that because there's not a separate check it's more robust. I agree w/ you though that it's not a strong argument, and that you can do the same thing in C++; it's a confusing/arguably incorrect point IMO.

More broadly, this is kind of a bad example. Handling files is surprisingly difficult and that's really where the dragons lie here. IME the dragons exist for both Rust and C++, because they're platform problems rather than stdlib/language problems.

NB: the correct thing to do here for POSIX platforms is to call fstat, which is true regardless of systems language.


I found the filesystems complaint about C++ bizarre too, because that is a system-dependent behaviour. How an open file behaves when it is deleted has very little to do with the programming language. And it does not matter if that language is Python, Rust or C++. These ecosystems interact with the underlying OS and report back what the OS layer tells them.

I absolutely understand some of the niceness about Rust preventing programmers from shooting themselves in the foot(it is still possible nonetheless), but sometimes it feels people just want to complain about C++ for the sake of complaining.


I am not sure what TTCTTOU is, but what you are saying needs transaction like semantics in filesystems. So its more of a property of an FS instead of a language.

All the common FS (in linux .. ext* xfs) do not have that these.


“time to check to time to use”

As you intuited, a problem that arises in environments that lack transactions/locks/etc


> All the common FS (in linux .. ext* xfs) do not have that these

You know that files are nameless on Unix, right?


> do not have that these

I was referring to the transaction semantics


> let f = File::open("username.txt");

> /what if the file is deleted here?/

you don't need transaction semantics here.

On Unix file is basically inode with counter. If it doesn't have record (aka name) in directory but non-zero counter, it would exist.

So on open(2) counter is bumped up and is equal to 2 (or higher if there are hardlinks), one for the directory record and one for program running it.

If you remove record from directory (aka `rm filename`), counter would decrease by 1, but it is still non-zero, program still has it open, could read its content, etc. Only after close(2) counter would decrease to 0 and file would be gone.

Just tested it on Ubuntu 20.4 with simple C program


What do you imagine is a more thorough handling of the problem than forced error checking? Its not a problem that can be prevented. It's caused by an external process.


That is not a particularly good example of a TOCTTOU (2-3 Ts, not 4), but the rust standard library does fall a bit short for filesystem operations in that regard. E.g. it doesn't expose the *at syscall family some sort of directory handle. Doubly so if you want to perform the atomic write dance securely inside a specific direcory without being subject symlink substitution.


In equivalent C++ on Windows, f is a file handle without delete share permissions so the act of successful opening the file locks out delete operations until the file handle is closed, preventing TTCTTOU issues. Am assuming Rust would be backed by the same is mechanism.


On Windows Rust calls out to `CreateFileW`[0], and by default uses the `FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE` share permission[1].

[0] https://github.com/rust-lang/rust/blob/master/library/std/sr...

[1] https://doc.rust-lang.org/std/os/windows/fs/trait.OpenOption...


That's a design difference from C++ then, however even if the file is deleted the delete on Windows doesn't persist for existing open handles until the file is closed.


> what if the file is deleted here?

You know that files are nameless on Unix, right?


C++'s more recent changes, in the past decade, have made it easier to write. However, adding more to C++ is never going to make it simpler. C++ still carries all of the baggage of C and all of the baggage of the older C++. None of these can be removed without breaking millions of programs, but all of these things can still bite C++ devs and will still exist in older C++ code.

For example, undefined behavior by default, value categories (rvalues, lvalues, etc), exceptions in destructors, SFINAE, and so much more.

Even if C++ were to end up making its syntax closer and closer to Python, which it seems to be doing, beneath the sugar is still a haggard mound of a ridiculously complex and unsafe language. The only way to change that is to start fresh, as Rust did. Rust is complex, indeed, but it removes a whole lot of the safety concerns and legacy crud.

Presumably, 20 years from now something else will be created which improves upon Rust in such profound ways and we can all talk about how complex Rust is and how its idioms aren't as <something> as they could be. Maybe we'll be using machine learning to write and compile code for us and Rust will be shat on for being so complex and manual. But, for now, even with its complexity, it should be a breath of fresh air for anyone working on non-trivial, multi-threaded C++ applications.

There will be those who don't dig it, just as there were those who have stayed with C all this time. But how else would we keep the vulnerability finders busy? Everyone needs to eat.


I think it would be totally possible for C++ to adopt and "epochs" model where you mark your units of compilation to support specific language features and that would make life tremendously easy. You could even mark epochs where legacy features get marked as deprecation warnings, etc.


From an ecosystem p.o.v Rust has a big advantage. The strictness is built into the compiler and the entire ecosystem is subjected to same high standards. One has to fix the errors.

The same strictness in C++ can arguably be availed by using static linters. This has 3 problems:

1. Static linter errors are inherently optional to fix. One has to have some CI rules to block check-ins if linter flags something

2. Your project might follow best practices but not your dependencies.

3. Static linters are not foolproof (last I checked back 4-5 years)


As a Rust user, I'm afraid I disagree with these following points:

> The strictness is built into the compiler and the entire ecosystem is subjected to same high standards.

> 2. Your project might follow best practices but not your dependencies.

The crates ecosystem is the weakest part of Rust and it is not always subjected to the 'same' high standards, despite the 'strictness' of the compiler.

A simple command line app or library (even worse) can depend on hundreds of other third-party libraries and cargo reminds others of the same ills of npm to some extent as developers throw in unnecessary libraries in your project. Point (2) can be still said about Rust given that the developer can still abuse unsafe{} or still have dangerous .unwraps() written by the developer. Just look at the immense scrutiny that actix-web was under because of this.


Just the fact that a large amount of scrutiny is placed on crates using unsafe in the ecosystem is a huge step up from C/C++. Compare to how critical openssl was and how few people were actually looking at the code before heartbleed.


It's not a big step up when the software was written by a single person who did it to learn rust and doesn't understand the problem domain, over a weekend. Software doesn't need to use unsafe blocks to have dangerous bugs and vulnerabilities.

I would take openssl over most (though not all, for sure) rust cryptographic code. OpenSSL's bugs have not been RCEs for the most part, and many of them could happily exist in rust. The cult like perspective that language choice can make software automatically safe in the absence of things like domain expertise or long term production usage increases risk even where the language itself theoretically reduces them. ( See also: https://en.wikipedia.org/wiki/Risk_compensation )

Especially with the deeply nested dependencies the rust ecosystem seems to me like a ticking time bomb for security disasters. The same bad practices exist in NPM and have already produced a number of high profile incidents, as it repeats in rust there really won't be any excuse but to admit that it resulted from an excess of hubris.


Rust has a higher floor set by the compiler, which still might not be up to your standards. That doesn’t mean that either C++ or Rust can’t have dependencies all over the quality spectrum, but it does mean that there are certain problems that Rust crates can’t have.


And npm style of libraries, which does very little to improve quality versus C++.


Specifically how does it make it worse than any other library package manager?


Having a thin standard library, which makes reaching out to third party libraries a must have in any project, many of each far from reaching 1.0, and not necessarly available in all deployment targets of the compiler.

There is stuff in ISO C++ standard library that requires third party libraries in Rust, yet another thing to certify in security sensitive domains.


Why is this downvoted?

Being politically incorrect to web developers?


Alternative title: mediocre C++ is arguably worse than mediocre rust.

I’d agree with that? I think the power and danger of C++ is that it assumes the programmer knows best. Rust assumes the programmer is wrong if it looks dangerous. So.. yes?


Two points should be considered.

First, trivial: what's exactly mediocre in the Rust example? I don't think it's mediocre at all.

Now, the second, which is significant: what constitutes a "mediocre" C++ program? The author is the engineer who built the Meson build system - his code is representing the output of an experienced engineer.

This brings to the usual C/++ criticism: it makes it easy to make certain class of mistakes even for experienced programmers.


> First, trivial: what's exactly mediocre in the Rust example? I don't think it's mediocre at all.

A better way to phrase it would be "code produced by a middle-level engineer". It just so happens that in Rust a regular, adequate, middle-level engineer can easily write the best possible code because the language guides him to it, but in C++, a regular engineer still wouldn't know about some gotchas that would make his code inferior.

It's unknown unknowns: Rust shows most of them to you, while C++ stays silent.


Yes, there’s nothing wrong with C++ with very good programmers who know the language well and what they’re doing.

Unfortunately not every programmer is or knows their limitations.


> very good programmers who know the language well and what they’re doing

And never make a mistake.

Honestly, it's probably just because I'm not good enough, but I prefer to use my focus on improving the functionality of my code instead of ensuring any change I make is correct.

I've successfully maintained some 100k lines C++ code in the past. I would rather avoid that in the future.


Such programmers are mythical for practical purposes. Look eg at the constant stream of memory safety vulnerabilities in browsers and operating systems.


There are too many changes and refactorings for C++.

Additionally, if the project is owned or influenced by a corporation, the mythical programmers have little influence:

Some authoritarian ex-mediocre-programmer-turned-middle-manager calls the shots and is more anxious to "grow the team" than to ship robust software.


Nothing saves you in that situation. For example, some authoritarian ex-mediocre-programmer-turned-middle-manager who cares more about growing the team than shipping robust software could be the one choosing what language to write it in.


I'd argue that even very good C++ programmers make mistakes that can be harmfull, and that those mistakes will be hard to catch by the average programmer who will deploy the new code. That's in my opinion why on a small to medium project where objects are not needed, it is better to use C than C++.


> even very good C++ programmers make mistakes that can be harmfull

Are you seriously saying that this isn't also true for C, (and indeed for all programming languages)?

A challenge - try writing a program in C to read a text file containing lines of unspecified length, sort them, and output them in sorted order. Compare the C++ and C code and tell me which you think is the cleaner and safer.


Yes, very good programmers can make mistakes.

To meet your challenge, think about the problem. How about this?

  file size + 1 => s
  allocate char buffer b[s]
  read file to b (all data must be in memory, or merge sort. if on linux and allocation worked, read may fail! oom will get us, or we bail on read failure)
  add final newline if missing
  count newlines => n
  allocate line pointers p[n] (if line pointers will not fit, we bail -- we could use offsets and fancy virtualization, if this is not a Z80)
  fill pointer array p
  sort => sorted[n] (if sorting fails, bail)
  output sorted lines
Same for C and C++. UTF-8 is ok.

Do you want more data than will fit into memory? Even then, C++ doesn't bring much to the table. Separating the sort algorithm? Yes C++ BEGINs to pull ahead by a bit.

safe? yes. easy to write? yes. no fancy bits? yes. Transcribe to C C++ as you will. Java? won't be as pretty. Rust, Go? don't know. Javascript. Not even close.

Since this is a "beginning programmer" problem, give a design that works for another language that makes sense to a beginning programmer.


Safe? If you never use any C functions that expect strings to end in a null character, maybe. (In your approach, the strings end in newlines, including, critically, the last string in the buffer, so the buffer is not null-terminated either.) So, your sort needs to not use strcmp, and your output needs to not use printf (or even puts).

I would argue that this is setting up for catastrophic failure. If not when writing, sometime later when maintenance happens.


That is not C or C++. And you cannot "transcribe" pseudo code like this into a real programming language.


But also, the Rust compiler teaches programmers to known their limitations so the Rust programmer will be forced to become a good programmer more easily than the C++ programmer.


As always new languages have a huge benefit of having new code bases. If you take a C++ job now, 95% of the time it will be a code base that is at least 10, and likely 20-30 years old. If you start a Rust job it will be modern.


Very unfamiliar with Rust, very familiar with C++. As a reader of this Rust program, how can I figure out what's happening? I don't see anything in GlobWalker documentation that even slightly implies that the filter_map calls that follow are appropriate. Also, why the first filter_map(Result::Ok)? Implies the iterator emits Result? Hard to follow, to be honest.


So, this[0] is the doc page for the `GlobWalker` type. One of the traits it implements, and the only one relevant here, is the `Iterator` trait from the standard library[1]. That's how you'd know that the `filter_map` function is appropriate.

The implementation there also tells you that the `Item` type of the iterator is a `Result<DirEntry, WalkError>`, presumably because any of the file system reads can fail. The `WalkError` type is from the globwalk library.

Going back to `filter_map`[2] that function takes a function or closure that maps from `Item` to the `Option` enum, and then just discards anything that is the `None` variant of the enum, filtering them out. `Result::ok` is a function on `Result` that converts the `Result` to an `Option`, converting `Err` to a `None`. So this is discarding the errors without reporting them.

As someone who is familiar with Rust, this wasn't particularly hard to follow, but that comes down to familiarity with these adaptor functions and how they can be plugged into each other.

[0] https://docs.rs/globwalk/0.8.0/globwalk/struct.GlobWalker.ht... [1] https://doc.rust-lang.org/stable/std/iter/trait.Iterator.htm... [2] https://doc.rust-lang.org/stable/std/iter/trait.Iterator.htm...


Aha, I didn't notice the little [+] widget next to the Iterator trait in the docs. You have to click it to get the Item type, otherwise it's hidden. Thanks!


You're welcome. The implementation in the OP uses a more functional style, but that's not the only way Rust is commonly written. I also did an implementation[0] (because I have nothing better to do) using more of an imperative style, which you may be more familiar with. Admittedly I did go a bit overkill on string normalization and avoiding allocations while doing so.

It also includes a test case which the OP's implementation will get wrong because the order of the combining diacritics is different in both words. If you don't normalize them you'll get two different words.

[0] https://gist.github.com/Measter/e2e287ee21311d34ea8eb8cd9d57...


IMO, globwalk is inappropriate to use here. It's bringing in a ton of unnecessary dependencies for such a simple task. Just using walkdir directly (which globwalk depends on) and filtering on the file extension would be just nearly as simple and use oodles less code.

This is IMO a good example of poor choices that an "npm-like ecosystem" can encourage. Obviously there are deep and fundamental trade offs here, but other than watching crates compile, you're almost completely removed from just how much code you're bringing in relative to what is really necessary.


Thanks - I appreciate this feedback. I've updated the post with a modified version at the end that directly uses walkdir (and I agree with you).

To the GP, the place it shows up in the docs is mostly in its example: if let Ok(img) = img <-- this is a sign they're unpacking a Result type.


Awesome! And globwalk is a great crate, when you really do need high performance glob walking. I think there is room to improve it so that it doesn't use so much code, but it's a bit of work to do that. (I see this as largely my failure, as the 'ignore' crate is primarily to blame here.)


The author talks a big game about safety and security, but fails to mention the ENORMOUS attack surface brought by the fact that the Rust variant uses third party code, written and hosted by <An Ostensibly Nice Person Who Never Gets Hacked>.


Security

First of all, the proliferation of header-only libraries that people vendor into their Cpp projects has the same attack surface. Additionally, there is less tooling to help you track vulnerabilities in Cpp dependencies and upgrade when fixes come out. If you do rely on external cpp deps, then you must have a code review process before those deps are whitelisted for use.

Now, you can extend this code review process to rust crates.

Run a security review of a crate after which that crate is pushed to an internal registry. Every external crate upgrade can go through the same security review process until which point developers will build with the old version present in the internal registry.

In the context of a large enough organisation, an increasing number of crates will become internal, thus solving the trust/responsibility issues.

by no means, do i suggest that crates is unhackable. That's why I want to raise awareness of already existing infrastructure and procedures to vet code before including in your systems.

Safety

Before we get to ownership and some thread-safety, how do you explain the fact that integer addition and number conversion are safe operations in Rust, which throw/panic when they fail instead of the default Cpp behaviour, which _silently_ corrupts data?


The C++ article did not use any header-only libraries, though, it just used the standard library, unlike the post about Rust. And Rust's behavior on overflow is "safe" for a certain definition of safe–in release builds by default it will wrap, which is often undesirable.


I assumed "enormous attack surface" was referring to real systems, not toy examples. AFAIK, even C++ needs more than the std library to solve business problems in the domains I am familiar with. There are many ways to manage dependencies including vendoring in header-only libraries, where my point about the lack of tooling and clunky UX still stands.

I just want to use tools that help me get stuff done. C++ has had a 30-year head start, why can they not see the overall value-add of a default build and test tool and a format for package declaration and management.

All integer types have "checked_" arithmetic operations, which return None in case of overflow. https://doc.rust-lang.org/std/?search=checked_


I think most C++ developers do see the value of something like Cargo…but it must also be considered that making it easy to bring in dependencies does not also mean the standard library can be poor and you can farm everything out to third parties.

C++ does not have a standard checked arithmetic operation, so I'll concede with you there. It really should, although most people just use the fairly widespread compiler builtins that behave similarly. (That being said, having it not be the default means that people who really need it won't use it, which is the same situation that Rust is in.)


It seems like this is more a critique of the design of the standard library than the language itself. You could create a library that behaved similar to the Rust version, offering the same compile-time safety checks, in C++. I'm sure it would also be equally trivial to parallelize the C++ version with Boost or some other library (except that you wouldn't get the nice error about the unlocked hash map, so I have to hand it to Rust on that point).

I think the author's problem is that C++ even allows you to do it the wrong way in the first place, but I'm sure that you could design an unsafe FS API with Rust as well (though it might be a little harder).


Do you truly feel that the standard library is not part of cpp?

In general, i think a lot of valid criticism of cpp is dismissed with ”But you could do it correctly with X”

Perhapse, but if everything is a special case, what does that say about cpp?


The article does seem to nitpick a C++ design and then highlight the same feature in Rust. The way getline failed there is not "accidental" but the design of fstreams/getline, so that "less-careful" programmers don't have those corners here.

Thing is, all of the solutions to the issue here would be a real PITA if done as one function as both examples are. That is why example code often skips checks to allow the point of the article to be highlighted and not hide it. This is more true with filesystems related code as filesystems are not reliable. One example of how many corners filesytems have is the unit tests for sqlite, they are amazing.


Sure, you're right. In practice nobody actually does what you're saying though.


Another, perhaps more important difference: the Rust version converts to lowercase in a Unicode-safe way, whereas the C++ version really can't work for multibyte character encodings.


There is no such thing as converting to lowercase in a Unicode-safe way. Or rather, there are many ways.


This article suggests it refutes the linked article. That was more about the developer experience than about safety. That rust is better for safety is not news.


Title needs to be fixed:

C++ still isn't cutting it for a Rust fanboy


Just curious can we try a C version?


C does not support strings, regex and maps natively. You have external libraries that can do it, but from my experience, they are painful to use. String manipulation in general is painful in C and probably not the best language for this demo.

However, if you really want high performance, you can go ahead and write it in C with a custom matching code and a custom data structure for keeping your word count. This will most likely result in hundreds of lines of code, and enough performance to max out a PCIe SSD. However that would be an different exercise.


First, if Rust's solution gets to pull in crates, then a C program can use regexps, with pcre or cre2.

Second, it's not clear to me at all why this problem asks for regexps. Both examples used a regexp to check a file extension, and then to find word boundaries. Both are trivial† to code directly.

Am I misunderstanding the problem here? It seems like there's hardly any string manipulation in it at all.

admittedly, I didn't bother being careful about word boundaries


It shouldn’t require them. I thought I saw a comment by the author explaining why they chose to use them, but there’s been like six different threads on three different forums and I can’t find it right now...

I know some C++ folks were annoyed it’s using regex due to criticisms of the standard regex libraries that can’t be fixed, or something, so it’s also not like folks agree that the original solution was optimal. I don’t think it was trying to be, so seems fine, but it is what it is.


The idea here is that it is a toy problem. The author is not trying to make the best word counter, he is comparing two languages by implementing the same algorithm on both and giving us his conclusions.

Using a different technique, for example by not using regexp is a bit like cheating in that context. You are not comparing two languages, you are comparing two different solutions to the problem.

That's what I meant with my previous post. Either you do it as specified in the article, with regex and maps, which require pulling libraries and working in a way that is much less convenient than with the C++ and Rust example for no good reason. Or you reimplement it using different techniques and this is not the point of the article.

Or to put it simply, the example in the article is not good for C.


The program is going to look substantially the same if I use pcre to check for ".txt" and split up words. And, if you read the original C++ article, it is not at all about comparing regex interfaces!

So, no, I don't think you're right about this.

More to the point, though: I'm talking about regex libraries because the parent comment is. My point: the Rust example uses 3p dependencies, so what C does "out of the box" is already out the window.


I assumed the original example was a bit of an overkill/lazy coding just to demonstrate that regexes and other conveniences are readily available, so the language would work nicely not just for this trivial example, but for bigger things you may want to write too.

I'm fed up with C dependencies, which like Makefiles, always seem to be very easy in principle, and then kill by thousand cuts (like a regex library that defines a symbol that happens to conflict with a POSIX regex function, which I didn't even use, but it corrupted memory of a completely different dependency elsewhere).


You definitely can't manage a C program with dependencies the way you would a Rust program with Cargo. You'd ordinarily bring in a couple major deps --- OpenSSL, zlib, pcre, &c --- and then maybe vendor in the small stuff. Most large C projects don't dep in things like logging or error handling the way Rust programs all do.

This particular program needs no third party dependencies at all; in fact, I bet it'd get longer if I added them (like a glib hash table, or pcre).


Thanks. How about PERL? Hear it's good at string manipulation but never used it.


Sure.

https://gist.github.com/tqbf/4de61a3e34d2e4664044666c107abe7...

(I don't vouch for this code; I wrote it off the top of my head, and, in keeping with the exercise, I wrote it in pico).

It's not 10x longer. I'm honestly not sure why either Rust or C++ bothered with regexps for this problem.

Anyways, you get the gist of what this looks like in C now. Obviously, don't write things like this in C.


You have/had a bug in your code. The realloc() needs to multiply its second argument csz by sizeof *counts or sizeof(struct wordcount).


Of course I have a bug in this code. Thanks, fixed! :)

With that fixed, this dumb program does my whole (very large) homedir in about 10 seconds, for some definition of "does" that may or may not include counting every word in every txt file. For the curious:

   0. get / 566006
   1. the / 158168
   2. pkg / 119419
   3. syscall / 105828
   4. const / 98259
   5. and / 86670
   6. ideal-int / 43078
   7. that / 31609
   8. for / 31129
   9. this / 24980
   10. text / 22907
:P


Just trying to help, not criticize. As to correctness, if you change the tokenization to:

  while (fgets(buf, 1024, fp)) {
    char *c, word[1024], *w = word;
    for (c = buf; *c; c++) {
      *c = tolower(*c);
      if (*c >= 'a' && *c <= 'z')
        *w++ = *c;
      else {
        *w = '\0';
        count(word, (size_t)(w - word));
        w = word;
      }
    }
    *w = '\0';
    count(word, (size_t)(w - word));
  }
and you have that `count` function skip words with len < 2 then I can match results with the other 3 (Nim, Rust, C++) versions. Also, your C version runs in just 7 ms on Tale Of Two Cities, about half the time of the PGO optimized Nim.


Oh, no, you don't sound critical at all. I'm making fun of myself. Thank you for actually reading it!


Cool. You could probably simultaneously speed it up and simplify it by changing from linked list collision to open addressing with linear probes (but you may need a larger TSZ). Besides being more cache friendly for larger files, at the end you can then just qsort your table instead of first building a flat copy. :-) You could also use the unreduced hash value as a comparison prefix (only calling strcmp on different values). But it's not bad for potato code. ;-)


I didn't do open addressing because I didn't want to write code to resize the whole table. :)

I thought about also keeping the top 10 as I go instead of copying the whole table. But I'm guessing that virtually all the time this program spends is in I/O.


I kind of suspected that might be the reason..but you do still have a scalability problem for very large vocabularies. :-)

I just did a profile and saw about 15% in strcmp in the hot cache run, but sure if it's not in RAM then IO is likely the bottleneck.


(You could also just save the len and use that as a comparison prefix, but the hash value is less likely to yield a false positive.)


Isn't this code kind of supporting the first point in the article? I mean, the article started by pointing how rust "protected" you from a "TTCTTOU" error.

This C code is not only not protecting you against that, but also it seems to have little to none error checking. What happens if fopen fails? It just skips to the next file and happily ignores all the entries in the file. What if fget fails? Again, it stops processing the file and goes to the next.

I won't even get into what happens if calloc fails and returns null. Or if ++ wraps around and you get a negative value (let's hope we are using 64 bits)

This program will likely work, but if it doesn't you will just get an invalid number and never find out.

Don't get me wrong, I understand this is "not serious" code, written in a Sunday and for fun. And I would be ok with it if the article wasn't about language safety.


Try as hard as you like, you aren't going to find anyone to argue that C is as safe as Rust.

As for the TOCTTOU issue --- which is kind of silly, and had really nothing to do with Rust --- the diff to fix that problem is tiny:

https://gist.github.com/tqbf/4de61a3e34d2e4664044666c107abe7...

On a sane system, what's going to happen if calloc (or realloc, or strdup) returns NULL is the same as what's going to happen in the Rust program: the program will abort. Checking malloc returns is a bad idea, and causes more problems than it solves.

Similarly: this code does essentially the same thing with errors that the Rust one does: it swallows them and continues.

At any rate, while this C++ vs Rust discussion might be about language safety (I vote Rust, like any other software security person would), this subthread isn't.


So, actually, this entire arc of TOCTTOU is little bit off track or at least incomplete. If someone replaces an ordinary file with a FIFO/named pipe "foo.txt" during the traversal (or simply if your recursive file tree walk does not filter out FIFOs) then any program will block during the open(2) of the FIFO. Things will never advance to the fstat() to check its file type.

I just checked and whatever default behavior dga's Rust code uses (both for traversal and for the open call) does indeed just block forever if there is a FIFO present with a name ending in .txt.

There is a way out - by adding O_NONBLOCK to the OS flags for the open call. { You could also stat() just before open(2) instad of fstat() just after, but then you still have a race condition, just with a much smaller time window. }


I love this thread so much.


Well, then you may also appreciate changing my Nim program from the `continue` to the final output `for` to:

    let mf = mopen(path)
    if mf == nil:
      stderr.write "wc: \"",path,"\" missing/irregular\n"
    else:
      var word = ""
      for c in toOpenArray(cast[ptr UncheckedArray[char]](mf.mem), 0, mf.len-1):
        let c = c.toLowerASCII
        if c in {'a'..'z'}:
          word.add c
        else:
          count.maybeCount word
          if word.len > 0: word.setLen 0
      count.maybeCount word # file ends in a word
      mf.close
You also need to add ",cligen/mfile" to the import line and install a version control head of the cligen package.

With that the Nim time goes down to 6.1 ms (no PGO) and 5.0 ms (PGO). Just a quick after dinner fix-up.

{ There is actually a trick related to ASCII where we could add `, 'A'..'Z'` to the character set and then `or` in a bit to make the letter lowercase (because 'a' - 'A' is 32), but I just did the above to test that cligen/mfile worked with FIFOs in the mix, not to maximize performance. I suspect pre-sizing the table by adding e.g., 10_000 or 20_000 to the init would make more difference.}


Beyond the above couple ideas, the next steps to make it faster would be to get rid of almost all dynamic memory allocation by having a table of byte-offset,len,count triples (or maybe add in a hashcode for a quadruple) and rolling your own "memcasecmp" if your stdlib does not provide one. This could probably make the histograms so small that they are likely fully L2 resident..roughly (8..12) times vocabSize bytes is 192 KiB for a 16384 vocabSize. https://github.com/c-blake/adix/blob/master/tests/anaPrime.n... has an example of how to do this in Nim (for values in a table, though in our case here it would be for keys and so need `hash` and `==`).

At that point if you had a lot of data, the better way to parallelize it is also not either of the two ways DGA mentions in his original blogpost with one shared table, but rather one histogram per CPU with no locks or memory contention at all and a final merge of the CPU-private histos (after processing all files) that can probably be single-threaded, though you could also organize it as a "tree" of parallel merges (e.g. first merge every pair, then every pair of the results and so on).

Whether to parallelize over files or over big chunks of files would depend upon how big your files are. The big chunks approach leads to more regular work balance because making all subproblems the same size is easy, but does need "clean-up" for any words that span each byte boundary which is a bit annoying, but there are only as many fix-ups as you have threads which is small. It would be unsurprising to me if Tale of Two Cities could be done in like 0.1 ms with 32 cores up to 500x faster than that initial Rust version (maybe omitting thread start-up overheads).

This particular problem is actually so fast that you might need to use all of Dickens cat'd together or more to get numbers less polluted by system noise and microsecond-scale thread start-up times and small L2 histo merges or do min-of-more-than-5 runs for times, but all the principles I've mentioned apply more broadly (except maybe the ASCII case trick).

Anyway, these are all sort of obvious points on a topic that was more about prog.langs, but a good systems programming language + ecosystem ought to make it "easy-ish" to get that last factor of 10..500x that is easy enough to express in a HackerNews comment.


>you aren't going to find anyone to argue that C is as safe as Rust.

That's good, I was starting to doubt humanity :)

>As for the TOCTTOU issue --- which is kind of silly, and had really nothing to do with Rust

agree. But well, was what it was claimed in the article.

>On a sane system...

I disagree with this one, although I know I am in the minority. A malloc failure is not a unrecoverable error, and the only reason to abort on memory error is because if you allow malloc errors, then virtually every line of code can fail, because it is either allocating memory or calling some other method that allocates memory. (I imagine you've already read http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p070... section 4.3 but I mention just in case someone hasn't heard of it)

Something similar happens with integer overflow, although we've collectively decided that we prefer to have 2,147,483,647 + 1 = -2,147,483,648. (in both C++ or rust in release mode). But if we assumed that i++ can cause an error, it would mean that every line of code could possible be an error.

But well, back to memory allocation: Just a month ago I had a situation where using a language (C#) which just throws an exception in memory overflow saved my day. I left running overnight a program that builds and creates setups for the programs I make. There are about 190 different setups (win32, 64, linux, and variations), all compressed using max settings in 7zip. Those builds run in parallel in a threadpool, and each 7zip compression takes a lot of memory. So much, that during the night one thread raised an out of memory error. When I woke up next day, I had 189 setups created correctly, and one that failed. I rerun that missing setup and was ready to publish them. Had I used a "sane" language, the full app would have aborted and I would have to run the 190 setups again.

But seriously, it can happen a lot, specially in embedded. You are processing images and one image is too big. Do you abort the app, or report the error and move to the next image?

>this code does essentially the same thing with errors that the Rust one does: it swallows them and continues

Thats was my point in the answer: Rust and C++ don't shallow the error. You have the line: reader.lines().filter_map(Result::ok)

And then: if let Err(e) = scanfiles() { println!("Error: Something unexpected happened: {:#?}", e);

So, if there is an error, it will report it to you, not shallow it. And that is to me the full point: The guy who coded the Rust version would have to go out of his way to shallow the error, while in C it is the default.


The Rust program surfaces errors in globwalk, but swallows the fopen errors you were talking about (it filter_maps them away, which, to be fair, is what I'd do too).

None of the counters in that C program are going to wrap. If it makes you feel better, you can replace size_t with "unsigned long long", but that's already what it is on modern systems.

That Rust program will also abort on allocation failures, just like I'd expect my C program to, so recovering from memory exhaustion isn't really in the scope of the discussion here.

If you care about security, you abort a C program when allocation fails. If you ask programmers to check failures, you get malloc checking bugs. Especially in embedded systems, where offset-from-NULL pointers can get attackers something useful. There are exceptions to every rule (at least in C programming), but if the task is "count words in files", you're nuts if you do anything but blow up when malloc fails.


> it filter_maps them away, which, to be fair, is what I'd do too

Ok, I don't know much rust and I imagined filter_maps reported to the main app. If it is shallowing them, then to me it is a big + for C++ in that area. It seems I will have to agree to disagree here, but shallowing the error is not what I would do at all. This program has the possibility to give you a completely wrong result, and you would never know. You could use that result in other thing, say publish a study where you claim there were only 100 words in the files you processed but there were 1000 because your program crashed and didn't warn you.

> If you care about security, you abort a C program when allocation fails.

Well, that's why I personally prefer C++ and exceptions. They abort by default, let you handle if you want/can.

But I think my original point was lost. My complain was that this code wasn't checking at all for failure. Shouldn't it be something like:

if(!w->next) w->next = calloc(1, sizeof(*w));

if (!w->next) abort();

    w = w->next; 
I mean, in this particular case it will indeed blow up because when it goes back to the while(w->word) w will be null and you will reference a null pointer. But that's mostly luck, and might be not be true anymore if you make changes to the code in the future. If you care about security, you should check all allocs, and abort if null. And that's my beef with C: It shallows and continues instead of crashing or reporting when there is an error.

> but if the task is "count words in files", you're nuts if you do anything but blow up when malloc fails.

I won't disagree here, if the task is count words in files. I was just trying to explain why imo, not all "sane" programs should abort in failed malloc.


In serious C programs, you rig malloc to abort rather than return NULL. With jemalloc, for instance, you'd do it with opt.xmalloc.


> A malloc failure is not a unrecoverable error

It's not, but it's also a fairly difficult error to recover from effectively. You really need to know what you're doing to get out of this–I think someone did a study for how many STL implementations handled std::bad_alloc correctly (i.e. unwound and didn't accidentally take down the program) and I think the number was zero.

> we've collectively decided that we prefer to have 2,147,483,647 + 1 = -2,147,483,648

Not always, overflowing is often exploitable since it lets you bypass less than checks.


>although we've collectively decided that we prefer to have 2,147,483,647 + 1 = -2,147,483,648. (in both C++ or rust in release mode)

Signed integer overflow is undefined behavior in C++. Only unsigned integers are defined to wrap like that.

https://stackoverflow.com/a/16188846


Thanks I got the gist. Doesn't look too bad IMO.


You can try, but it would likely be at least 10x the length of the C++ (or rust) program.


Maybe 5x, with most of that increase being a handrolled associative array.


You also need at least dynamic string of some sort. Actually, I would estimate 20x - I was being kind with 10x. But feel free to prove me wrong.


So it turns out that factor is just over 2.5x: https://gist.github.com/saagarjha/00faa1963023206a8ccd987798.... You can split hairs about it being POSIX and inefficient or not checking all the errors, but it works and isn't really all that massive.


Why do you need a dynamic string of any sort for this problem?

(I think you estimated badly, for what it's worth.)


You'd need it to read words, unless you're using a libc function that does the allocation for you. (Not that you actually need a full string, realloc works just fine for the code I'm writing.)


Yeah, I just used strdup and strsep, like a C programmer. You don't need a dynamic string to count words.

:P


Pulling this up out of a subthread, if you happen to have some sort of notification set up for these: https://gist.github.com/saagarjha/00faa1963023206a8ccd987798.... Not intended to be robust or even correct beyond "I tried it a little and it seems to work".


TIL <ftw.h>!

You can golf out about 10 lines of this by getting rid of the table free. :)


Yes, ftw.h has a couple of nice functions–it's an apt name in this case ;)

I could have done a number of things to keep the line count down–in the spirit of the challenge I tried to keep it clean, pedantic, idiomatic POSIX C because if I didn't I'm sure someone would have jumped on it for it not being that :P So it's written in nano in the style of how I might write a homework assignment rather than something I specifically golfed. In retrospect the fixed-depth buckets probably added more complexity than they saved, and using a linked list or open addressing would have probably been much easier since it'd clean up some of the resizing and traversal code. Plus, the load factor was in general fairly poor, the final table (when I ran it on my Downlods folder) had just over 8 million buckets of depth 5 allocated and only about 600k got used at all, and of that 570k were filled just once.

Oh, and here's the results on my Downloads folder:

  514464 Developer
  425236 Xcode
  386616 saagarjha
  386075 Users
  361324 Library
  337374 build
  333056 Wno
  306758 WebKit
  302892 DerivedData
  296179 eugbibmfmfphgbhczsxiimkhynol
I have a couple WebKit build logs that dominated the results…


C doesn't even have a standard filesystem library!


Every modern OS filesystem interface was designed in C. The places filesystems are hard in C, they're hard in Rust too. In practice: no, fileystems are easy in C.


FWIW, the Nim version is (no deps besides stdlib):

  import os, strutils, tables, heapqueue

  iterator topByVal*[K, V](c: Table[K, V], n = 10, min = V.low): (K, V) =
    var q = initHeapQueue[(V, K)]()
    for key, val in c:
      if val >= min:
        let elem = (val, key)
        if q.len < n: q.push(elem)
        elif elem > q[0]: discard q.replace(elem)
    var y: (K, V)         # yielded tuple
    while q.len > 0:      # q now has top n entries
      let r = q.pop
      y[0] = r[1]
      y[1] = r[0]
      yield y             # yield in ASCENDING order

  template maybeCount(count, word: untyped) =
    if word.len > 1: count.mgetOrPut(word, 0).inc

  proc main() =
    var count = initTable[string, uint]()
    for path in ".".walkDirRec:
      if not path.endsWith(".txt"):
        continue
      for line in path.lines:
        var word = ""
        for c in line.toLower:
          if c in {'a'..'z'}:
            word.add c
          else:
            count.maybeCount word
            if word.len > 0: word.setLen 0
        count.maybeCount word # line ends in a word
    for word, cnt in count.topByVal:
      echo cnt, " ", word

  main()
Doing a gcc-10 profile-guided-optimization build on this, I get the min time of 5 trials running on Charles Dickens' Tale Of Two Cities { opening paragraph is very optimization claims relevant! :-) } available here: http://www.textfiles.com/etext/AUTHORS/DICKENS/dickens-tale-...

   7.3 ms  EDIT- tptacek's C version (w/bug-fix & new tokenization)
  15.8 ms  Above Nim 1.4 PGO --gc:arc
  24.8 ms  Above Nim 1.4 --gc:arc
  27.3 ms  Original C++ PGO
  30.2 ms  Original C++ -O3 only
  42.6 ms  dga's Rust with rustc-1.47.0 build --release
This is on a i7-6700k. So, Skylake core and probably more importantly 8 MiB L3. So, also FWIW, I could not reproduce the Rust/C++ timing ratios with the above mentioned input file. I would also point out that not only is globwalk unnecessary as @burntsushi points out, neither are regexes nor even sorting. A heap is a better way to do "top N", but you may need to reverse the answer if you care. I mean, maybe the point of the original C++ is to also bench/exhibit use of regex engines and part of the Rust point is utf8 or some such, but to me that seems more like a distraction.

Anyway, maybe my rust build environment is screwy. Hence, my inclusion of an exact input .txt file for others to try out if they care. I had to add some extern crates and [dependencies] of anyhow, lazysort, regex, globwalk. I've never done a PGO build with rust, either. So, that may help.


Interesting. I've used this same file for a test on:

   - I7-7920HQ, MacOS
       Rust version:  90ms    (rustc 1.47.0 (18bf6b4f0 2020-10-07))
       C++ version:  340ms   (Apple clang version 12.0.0 (clang-1200.0.32.2))

   - Xeon(R) Gold 6130  (skylake, Ubuntu 20.04)
      Rust:  76ms  (rustc-1.47.0)
      C++:  60ms   (gcc version 9.3.0 (Ubuntu 9.3.0-10ubuntu2)), -O3, no PGO
Seems quite CPU and compiler-dependent. Odd that the results on MacOS were so horrible for C++.

And thanks - I've updated the post to include this more-reproducible benchmark, and included both macOS and linux results.

One note: I modified the single threaded one to use walkdir, but that shouldn't affect time in a major way. The macos timings were about the same.

And yes, I agree about the top N part. I deliberately tried to remain "algorithm-compatible" with the C++ example; there are lots of tricks to use to speed this up more. Getting rid of the line-at-a-time processing would be a good start, for example - it results in an unnecessary double-scan of the input.


Cool. And, yeah..mmap()ing the whole file would also auto-reject non-regular files at the OS layer fixing that TOCTTOU issue..I almost did that, but I'm sure there are various other tricks, too.

I mostly thought Nim deserved to be seen and then it happened to also be faster..perhaps giving folks a slight Bayesian update on presumptions of performance. :-)


Would you be willing to test my implementation[0] also? It'd be interesting to see the overhead of the (overkill) string normalization I do, along with the much reduced pressure on the allocator compared to yours.

[0] https://gist.github.com/Measter/e2e287ee21311d34ea8eb8cd9d57...


Well, I got 93 ms on the same Tale Of Two Cities file, 2x slower than dga's version and 18x slower than my last 5.0 ms Nim version (done to properly avoid hanging forever if someone does a "mknod foo.txt p") mentioned elsewhere in this thread ( https://news.ycombinator.com/item?id=24822429).

Really, though, all the code for all 6 versions (two Nim, one C, two Rust, one C++) as well as the input file is available to all. So, you should/could double check yourself. As dga mentions in his updated blog there is a lot of compiler/CPU sensitivity.


Huh... that's actually faster than I expected. I figured that the normalization processing overhead would be higher than that.


I don't think that the Rust version is as different as the author seems to believe.

For example: the author writes "What happens if the file has been replaced by, e.g., a pipe or other not regular_file?" But it seems to me that the same problem -- replacing the file which is regular during the directory walk with the pipe exactly before the said pipe is opened -- can exist in Rust version too.

It doesn't matter how the syntax looks like, I suspect the equivalent system calls are the same, Rust will also walk the directory at T1 obtaining metainformation, at T2 open the file, even if it is at that moment actually a pipe, and then at T3 check if the metainformation obtained at T1 is a regular file ("file" in the source?), which could still be that? That says to me that the both version can suffer from the same bug.

Luckily it's something that's not going to happen too often in practice, knowing how typical programs don't use the same file name for a pipe and a file in rapid succession.


The Rust version in fact doesn't have that problem. It opens the file, then checks the type of the open file.

You could do the same in C++, but in Rust it's the default. So...


How is it the default in Rust? Like C++, the Rust stdlib gives you both options (std::path::Path::metadata). It boils down to calling stat vs fstat (on posix systems), and it's up to the programmer to make the right choice.


It's about nudging the developer to do the right thing out of the box even if they aren't really aware of all the intricacies. Ergonomic design like this will become a big deal in the future: even if you have very little idea of what you are doing the program actually will be reliable and do the right thing, provided you did not stray too far from the path the language (and the library) nudges you towards.


I agree that that is generally what Rust and the Rust stdlib try to do, but I don't think it holds in this case.

Iterating through a directory generally yields a Path or a DirEntry, both of which have a metadata method readily available, and neither will nudge you to open the file first and then calling metadata on the open file.


Well, I don't see that the "metadata" are actually obtained from the opened file handle in the Rust version?

Metadata request must happen at the different time point than the open of the file and the API as far as I see works on the path, not on the handle:

https://doc.rust-lang.org/std/fs/fn.metadata.html

which says to me that the Rust version can manifest the bug of the same kind -- there are different time points in which different independent information is checked, and that all the calls correspond to the same opened file handle just can't be assured?

As far as I understand only checking the properties of the opened file via the same file handle is safe for the described kind of bugs, nothing else.


> Metadata request must happen at the different time point than the open of the file and the API as far as I see works on the path, not on the handle

std::fs::File::metadata operates on the open file. If the file is replaced with something else, your open file handle will still refer to the (orphaned) file, and the metadata still reflects its properties.


Then, if C++ could also obtain metadata using the file handle, and the Rust programmer must also take care to call the right "metadata" which even when it does different thing is named completely the same and as the code as written doesn't make obvious if the right overloaded metadata is called, how is Rust better? I.e. how Rust in this example helps me as a reader of the code to be sure if the right metadata function was called?

Call me old fashioned, but only C is more sure to be explicit there as it wouldn't allow two different functions having the same name. If in C++ the corresponding names are also overloaded, both C and C++ "aren't cutting" if the obviousness is not there.


I never claimed that Rust was better in this regard, I merely pointed out that getting metadata from a path is not the only way.

Filesystem handling is not trivial, and some knowledge is required beyond the language itself. I do think that File::metadata and Path::metadata make a nice API together (better than lstat, stat, and fstat).

As for C not allowing two functions with the same name, sure, but then it doesn't have namespaces. In a nicely designed C library for file handling, these functions might well be called std_path_metadata and std_file_metadata, which boils down to the same thing.


You haven't answered my question? How I as a reader see which of the functions with different semantic but named the same was used in the Rust version, and how Rust helps me for that?


Ah right. I misunderstood your question.

No, Rust won't help you with that. On the contrary, Rust has excellent type inference, and in this case the name belongs to a type, which itself is inferred. As a casual reader, you'd have to track the documentation on the preceding calls.


That‘s not different from C++ however.


You can get it from a File too: https://doc.rust-lang.org/std/fs/struct.File.html#method.met...

(This doesn’t invalidate your point, of course, just saying you can do this.)


The rust code gets the metadata from the opened file handle, which internally uses fstat (on Unix).

it would have been just as possible to write the code incorrectly by using the path version to get metadata.


The post that this is responding to was discussed here: https://news.ycombinator.com/item?id=24805717


Quality of a specific program (that is not a good C++ program) is hardly a reflection on the quality of a language.


It is after the compiler has accepted it with no warnings or errors.


Though I've been a rust fanboy for quite a while (and I just switched jobs and since Thursday, I work in a company and more specifically team where rust is a primary language), I'm missing the point here. Happy days and all but... I don't think that C++ isn't cutting it here. I'll grossly oversimplify it but tl;dr you can achieve roughly the same results with C++ but it will take more effort. But that has more to do with the nature and the design of both languages(and their respective standard libraries) more than anything.


And yet another "C++ sucks, Rust is better" thread.

Why are these C++ bashing threads practically always about Rust? I have nothing against the language but I do feel the Rust devs have some sort of inferiority complex and a compulsive need to convince everyone about the superiority of their tool.

Guys, these endless pissing contests are not really productive.


I think it's because Rust was started by a high-profile, respected company specifically because they felt C++ was inadequate for their use case, and they had the expertise to be authoritative about that inadequacy.

That means Rust is a technologically natural counterpoint to C++ because it was literally designed to be a C++ replacement, and coming from a high-profile company means its entire target audience found out about it very quickly, so any discussion relevant to Rust can find plenty of participants. The rest is just due to all the usual reasons programmers bicker about things, and we've seen this language bickering again and again before, it's just a different language's turn in the spotlight.

I'd love to see these discussions include a more coherent analysis of when a language is appropriate, instead of assuming everyone should always make the same choice. If I'm starting greenfield, when should I pick C++ over Rust? If I'm looking at C++ alternatives, how would I choose between Rust and Golang? Is D ever a better choice? When? Why do we act like you can only pick one, when IPC and FFI and microservices mean you can split the difference if a language is only best for a portion of your project?


(Have [mostly embedded] rust experience, only dabbled in go)

> Is D ever a better choice? When?

Probably not, very small community, few resources, etc. …

> Why do we act like you can only pick one, when IPC and FFI and microservices mean you can split the difference if a language is only best for a portion of your project?

FFI is a huge pain and pretty slow in golang if not already done for you. IPC & microservices do have a lot of overhead, so you probably don't want use it for e.g. your text parser. It also makes it harder to use established knowledge in another language.

> If I'm starting greenfield, when should I pick C++ over Rust?

- Do you have experienced C++ developers/are you one?

- Is your environment geared for C/C++ code, e.g. kernel modules or RTOSes, maybe game development?

- Do you have dependencies that strongly expect you to use a certain language, e.g. QT or (less so) GTK? Might be a pain to use with Rust

> If I'm looking at C++ alternatives, how would I choose between Rust and Golang?

- Rust is more speed/correctness focused while Go feels a bit like a faster scripting language. So something that might have started in python is probably a better match for go (a webserver, something script-like)

- Go really isn't meant for constrained environments (although there do exist solutions) like small linux IOT devices or even bare-metal stuff,

I feel like Go & Rust are pretty good complements and while you can do a lot of things with both, they are clearly very different. For some examples, with what I'd recommend, not necessarily what I'd use, since I do have a huge fondness for Rust, not so much for go

- Normal web backend: Go

- High performance web server: Rust

- DB: Rust

- App server: Go

- Crypto (the non-money kind) stuff: Rust

- WASM (it's pretty embedded-like): Rust


> > Is D ever a better choice? When? > > Probably not, very small community, few resources, etc. …

Have you used both? Not seriously, I'm going to say, or else you wouldn't say something like that.

As one example - one reason you'd be looking for a C++ alternative is to interact with existing scientific computing codebase written in C. Good chance you won't care about all the stuff that gives Rust the steep learning curve, and you don't want to mix weird symbols all over your code. I used Rust first, then moved to D. There's no way someone used to programming in C is going to feel more comfortable with Rust than with D. (They might still choose Rust, but there's just no way they'd do it because of comfort.)


The first person to write a response post to the original post used Go, with an even more aggressive title than this (IMHO) http://jmoiron.net/blog/cpp-deserves-its-bad-reputation/

A bunch of people have been doing it in a bunch of languages. It’s kind of the natural way to explore the topic.


It's an article about Rust, why wouldn't people be talking about Rust in the comments?


I'm sure the implicit question was "why are these articles always about Rust".


Because alternatives are either not mature or too old and not C-like enough to be cool ig


Another plausible explanation is that Rust job numbers continue to be rather thin, which gives Rust fans on average more free time to write blog posts and open-source libraries. :)


Don't worry, it's not always about Rust. Sometimes it's about D :)


>"What happens if the file has been deleted between the directory listing and the open?"

Did author even know that one can check stream state or have it throw exception. It is a very poor attempt at pissing contest.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: