For context: the author (see his other posts) is exploring the possibilities of writing C with no C runtime to avoid having to deal with it on Windows. He began to kind of treat it as a new language, with the string type, arenas and such, which help avoid memory bugs (and from my experience, are very useful).
This is a pretty cool hack. Makes me want to write a regex library again.
TBH, most of the C stdlib is quite useless anyway because the APIs are firmly stuck in the 80's and never had been updated for the new C99 language features and more recent common OS features (like non-blocking IO) - and that's coming from a C die hard ;)
> TBH, most of the C stdlib is quite useless anyway because the APIs are firmly stuck in the 80's (...)
This. A big reason behind Rust managing to get some traction from the onset was how Rust presented itself as an alternative to C for system's programming that offered a modern set of libraries designed with the benefit of having decades of usability research.
Completely agreed. My Rust origin story wasn't about memory safety, fearless concurrency, a modern type system, or anything else like that. Not that I didn't care about those things — I did — but none of them were what convinced me to start learning Rust.
What did convince me was being able to prototype things for the C project I was working on while having access to a standard library that included basic data structures, synchronisation primitives, and I/O handling in a way that used best practices from recent decades. Everything else was just a bonus that I got to learn and use as I went.
Truth be told, I wouldn't mind crowdfunding an alternate stdlib for C for that reason. Everyone has a wishlist of patterns discovered after K&R bestowed upon us System V, that mesh perfectly well with C's philosophy and very minimal environments, and make modern programming much easier when you're dealing with lots of dependencies. (e.g. bounded strings/small strings, arena allocation, error chaining/coalescing, etc..).
Almost many of those benefits were already present in Modula-2, Ada, Object Pascal, among others.
Unfortunely they came up when UNIX and C were the cool kids on the block, and who cares about a buffer overflow or a few miscompilations due to UB, any good coder is going to get them right anyway, no need for straighjaket programming languages.
I don't think everything has to be "modernized" and "updated." When I look at software from the 80s that is still with us, I think: "This is robust, keeps working, and has withstood the test of time" not "This must be changed." I still use C and the standard C library because I know how it worked in the past, I know it works today, and I know it will work for decades to come.
(minus the known foot-guns like strcpy() that we learned long ago were not great)
In contrast when I see software from the 80s that is not a security and performance disaster it’s because of continued investment and most of the basics like string handling have been replaced with bespoke or third party libraries and IO heavy lifting is done with OS specific interfaces anyway.
This is all well and good, but just because something came from before doesn’t mean it was a good idea then, or especially now. You’re basically citing survivorship bias. Of course something still used from the 80s is well made, otherwise it would have been replaced 30 years ago.
Most of the networking stuff from the 80s at least wasn’t particularly well made, it’s just been maintained and significantly reworked to not have massive security vulnerabilities.
You are right in that the C stdlib is mostly useful as an SDK for writing simple UNIX command line tools. But for other things it's better to go down to OS-specific APIs or up to POSIX (if a POSIX environment is available) - which isn't a great deal to be honest. One of the greatest features of C is that it doesn't depend too much on its stdlib.
> What is grep going to do while it waits for data?
Two things:
- Search the data it’s already read in. If data is coming in fast enough, it’s better to read and search in parallel rather than alternating.
- If this is a recursive grep, then list, open, or read from additional files.
Even so, thread pools work fine for this kind of thing. An optimized grep already wants to use threads to split the CPU work into parallelizable chunks (where helpful), so using threads for syscalls too shouldn’t make much difference.
However, for recursive grep, you might need to open large numbers of small files, in which case syscall latency might be a big enough factor that something like io_uring would be significantly faster.
Disclaimer: I’m mostly thinking about tools like ripgrep that are only grep-like. I’m not aware of any actual grep implementations that use parallelism to the same extent. But there’s no reason a grep implementation couldn’t do that; it’s just that most grep implementations were written in the age of single-core processors. Also note that I don’t actually know much about ripgrep’s implementation, so this post is mostly speculative.
> It’s just that most grep implementations were written...
...with a design philosophy of composition. Rather than a hundred tools that each try to make too-clever predictions about how to parallelize your work, the idea is to have small streamlined tools that you can compose into the optimal solution for your task. If you need parallelization, you can introduce that in the ways you need to using other small, streamlined tools that provide that.
It had nothing to do with some prevalence of "single-core processors" and was simply just a different way of building things.
That just pushes the task of optimising the workload up to you, complete with opportunities to forget about it & do it badly.
I don't relish the idea of splitting sections of a file up into N chunks and running N grep's in parallel, and would much rather that kind of "smarts" to be in the grep tool
I propose the search tool decide how to split up the region I want searched, rather than me trying to compose simpler tools to try to achieve the same result.
POSIX isn't the C stdlib though, that's mostly a confusion caused by UNIXes where the libc is the defacto operating system API (and fully implements the POSIX standard).
TBF though, I guess one can implement non-blocking IO in C11 with just the stdlib by moving blocking IO calls into threads.
Isn't threading generally handled by POSIX as well? The p in pthreads?
If you're writing C code in 2024 and your target is a system that has an OS, then it's safe to use select and poll. They're going to be there. This hand wringing over "oh no, they aren't supported on every platform" is silly because the only platforms where they don't exist are the ones where they don't make sense anyway.
AFAIK select() and poll() are still not supported in MSVC though. IIRC at least select() is provided by 'WinSock' (the Berkeley socket API emulation on Windows), but it only works for socket handles, not for C stdlib file descriptors.
In general, if you're used to POSIX, Windows and MSVC is a world of pain. Sometimes a function under the same name exists but works differently, and sometimes a function exists with an underscore, and still works differently. It's usually better to write your own higher level wrapper functions which call into POSIX functions on UNIX-like operating systems, and into Win32 functions on Windows (e.g. ignoring the C stdlib for those feature areas).
Mostly because before Satya took over, C on Windows was considered a done deal, and everything was to be done in C++, with C related updated only to the extent required by ISO C++ compliance.
Eventually the change of direction in Microsoft's management, made them backtrack on that decision.
Additionally, in what concerns C++ compliance, they are leading up to C++23 in compliance, while everyone else is still missing on full modules, some concepts stuff, parallel STL,...
Although something has happened, as the VC++ team has switched away to something else, maybe due to the Rust adoption, .NET finally being AOT proper with lowlevel stuff in C#, or something else.
I guess the difference is that you need maybe 5 peeps to keep the C compiler frontend and stdlib uptodate (in their spare time), but 500 fulltime to do the same thing for C++, and half of those are needed just to decipher the C++ standard text ;)
I mean, no. The most basic of uses on windows require #ifdef'ing as the prototypes, types, error codes and macros in wsock2 aren't exactly the POSIX ones (WSAPoll instead of poll, etc.).
Also some software still target windows xp and this one doesn't even have poll, only select
poll() and select() are POSIX-isms that are not necessarily going to be present in every system's C standard library.
The only reason they happen to be available on Windows is that Microsoft, in an uncharacteristicly freak stroke of respect for existing standards, decided to make WinSock (mostly) function-for-function compatible with BSD sockets.
In practice poll and select exist everywhere it makes sense. There aren't a ton of independent Unix vendors each with their own expensive and broken C implementation running around anymore.
Both are available if you use --std=c89 on gcc and clang. At this point it is safe to assume they are available unless you're doing something weird like writing C for some tiny microcontroller. Practically speaking this has been true for at least 20 years.
> Both are available if you use --std=c89 on gcc and clang.
This is irrelevant - that switch doesn’t have anything to do with controlling functionality not provided by the C standard library. Since poll and select aren’t part of it to begin with it doesn’t affect their availability.
epoll and signalfd will be available (on a gcc target where they are available obviously) as well with that switch I don’t think that makes them C standard library functionality.
select/poll differs enough across common platforms. In MSVC, poll isn’t there, sure you can emulate but now the goal post moving is getting ridiculous. The arguments to select are only superficially compatible (A practical example is that POSIX select supports pipes, but this will not work on Windows outside of specific environments or 3rd party implementations)
"WSAPoll" because it's not a POSIX poll even though the signature is the same at it works similarly for sockets and I guess Microsoft thought through it better 15 years later.
But the original claim is just that select/poll are "everywhere it makes sense", but this depends on what you define select/poll to be. I think something close to POSIX is what makes sense. If you can't use it on pipes and FIFOs (and even regular files without getting an error) that seems like a pretty contrived definition.
The whole moving around of definition of what is the C standard library just because of popularity seems unproductive. ("You can do nonblocking IO using the C std library." -- no, you can't) Most popular 3rd party libraries support or can easily support the most popular targets due in large part to them being popular, that means "in practice, they exist everywhere it makes sense." Does this mean all popular 3rd party libraries are part of the C standard library?
Colloquially redefining terms like what is the C standard library just sows confusion with no benefit (as was illustrated earlier in regards to threads; C11 threads are not pthreads), just say what you mean in this case.
Yes, it is remarkable what little you actually get when you strictly stick to the actual C89/C99/C11 definitions of what's in "The C Standard Library". I still get tripped up on it, and often have to double check: Surely sigaction() is part of the C Standard! NOPE it's POSIX, but signal() is! Surely strptime() is part of the standard! NOPE, but strftime() is. termios stuff? NOPE. It's a minefield out there.
Yeah, and on those platforms the functions exist if they make sense. If you're talking about some tiny embedded thing with no MMU or OS then it might not, but programming on those is a specialized task anyway so it doesn't really matter.
Thanks for that explanation! I have occasionally fantasized about a similar project - what could C be like, if one abandoned its ancient stdlib and replaced it with something suited to current purposes? - so I'm looking forward now to reading more of this author's writing.
Thank you for the context. I wouldn't have read the article without it. I mean, it's a pretty good idea for "no runtime," but when I saw the article title, I thought at first "Why????" Honestly, I'm glad I read it.
This comprehensive article goes over the problems of memory allocation, how programmers and educators have been trained to wrongly think about the problem, and how the concept of arenas solve it.
As someone who spends most of his time in garbage collected languages, this was wildly fascinating to me.
So bad is the performance of gcc std::regex that I reimplemented part of it using regex(3). Of course, I didn’t discover the problem until I’d committed to the interface, so I put mine in namespace dts, just in case one day the supplied implementation becomes useful.
As it stands, std::regex should come with a warning label. It’s fine for occasional use. As part of a parser, it’s not. Slow is better than broken, until slow is broken.
To be fair, the GNU implementation of std::regex has to conform to the API defined by ISO/IEC 14882 (The C++ Programming Language). If you don't have to provide that API purely in a header file, it gets pretty easy to write something bespoke that is faster, or smaller, or conforms to some special esoteric requirement, or does something completely different that what the C++ standard library specification requires.
The purpose of the C++ standard library is to provide well-tested, well-documented general functionality. If you have specific requirements and have an implementation or API that meets your requirements better than what the C++ standard library supplies, that's great. You're encouraged to use that instead.
If you have an implementation of std::regex that meets all the documented requirements and is provably faster under all or most circumstances than my implementation is, then submit it upstream. It's Free software and it wouldn't be the first time improved implementations of library code have been suggested and accepted by that project. Funny how no one has done that for std::regex in over a decade though, despite the complaints.
Around 30 years ago, STL introduced an allocator template parameter everywhere to let you control allocation. Here in 2024 we read about making use of the, erm, strange semantics of dynamic linking to force standard C++ code to allocate your way
TFA links to what arenas are and where they come from, how some bits included here would not really be part of this library but assumed part of the project using these techniques, does explain the general point of the exercise, and how this isn't even strictly a suggestion for a library but a "potpouri of techniques".
They are fully aware of -lre and assume that everyone else is too. This isn't about just achieving regex somehow. It's about avoiding the crt and gc and c++ in general while using an environment that normally includes all that by default.
You don't redefine new just to get regex. Obviously there must be some larger point and this regex is just some zoomed-in detail example of existing and operating within that larger point.
This is fun and impressive, but it feel the author kind of misses out on explaining in the intro why it would be wrong to just ... use C's regex library [1]?
I guess the entire post could be seen as an exercise in wrapping C++ to C with nice memory-handling properties and so on, but it would also be fine to be open and upfront about that, in my opinion.
> The regex engine allocates everything in the arena, including all temporary working memory used while compiling, matching, etc.
I do something quite different. I design the API so any data returned by the library function is allocated by the caller. This means the caller has full control over what style of memory management works best.
For example, you can then choose to use stack allocation, RAII, malloc/free, the GC, static allocation, etc.
Isn't giving the caller control over the memory exactly what this API does? The caller just passes in a block of memory that will be used for all of the internal allocations as well as the strings returned by the API.
If you read to the end of the article he actually lists the pros/cons where this is mentioned. That aside, the point of the article is maybe not so much using C++ regex but a technique to integrate C++ code into C code.
This is a pretty cool hack. Makes me want to write a regex library again.
reply