Hacker Newsnew | past | comments | ask | show | jobs | submit | jmillikin's commentslogin

I use CBZ to archive both physical and digital comic books so I was interested in the idea of an improved container format, but the claimed improvements here don't make sense.

---

For example they make a big deal about each archive entry being aligned to a 4 KiB boundary "allowing for DirectStorage transfers directly from disk to GPU memory", but the pages within a CBZ are going to be encoded (JPEG/PNG/etc) rather than just being bitmaps. They need to be decoded first, the GPU isn't going to let you create a texture directly from JPEG data.

Furthermore the README says "While folders allow memory mapping, individual images within them are rarely sector-aligned for optimized DirectStorage throughput" which ... what? If an image file needs to be sector-aligned (!?) then a BBF file would also need to be, else the 4 KiB alignment within the file doesn't work, so what is special about the format that causes the OS to place its files differently on disk?

Also in the official DirectStorage docs (https://github.com/microsoft/DirectStorage/blob/main/Docs/De...) it says this:

  > Don't worry about 4-KiB alignment restrictions
  > * Win32 has a restriction that asynchronous requests be aligned on a
  >   4-KiB boundary and be a multiple of 4-KiB in size.
  > * DirectStorage does not have a 4-KiB alignment or size restriction. This
  >   means you don't need to pad your data which just adds extra size to your
  >   package and internal buffers.
Where is the supposed 4 KiB alignment restriction even coming from?

There are zip-based formats that align files so they can be mmap'd as executable pages, but that's not what's happening here, and I've never heard of a JPEG/PNG/etc image decoder that requires aligned buffers for the input data.

Is the entire 4 KiB alignment requirement fictitious?

---

The README also talks about using xxhash instead of CRC32 for integrity checking (the OP calls it "verification"), claiming this is more performant for large collections, but this is insane:

  > ZIP/RAR use CRC32, which is aging, collision-prone, and significantly slower
  > to verify than XXH3 for large archival collections.  
  > [...]  
  > On multi-core systems, the verifier splits the asset table into chunks and
  > validates multiple pages simultaneously. This makes BBF verification up to
  > 10x faster than ZIP/RAR CRC checks.
CRC32 is limited by memory bandwidth if you're using a normal (i.e. SIMD) implementation. Assuming 100 GiB/s throughput, a typical comic book page (a few megabytes) will take like ... a millisecond? And there's no data dependency between file content checksums in the zip format, so for a CBZ you can run the CRC32 calculations in parallel for each page just like BBF says it does.

But that doesn't matter because to actually check the integrity of archived files you want to use something like sha256, not CRC32 or xxhash. Checksum each archive (not each page), store that checksum as a `.sha256` file (or whatever), and now you can (1) use normal tools to check that your archives are intact, and (2) record those checksums as metadata in the blob storage service you're using.

---

The Reddit thread has more comments from people who have noticed other sorts of discrepancies, and the author is having a really difficult time responding to them in a coherent way. The most charitable interpretation is that this whole project (supposed problems with CBZ, the readme, the code) is the output of an LLM.


> the pages within a CBZ are going to be encoded (JPEG/PNG/etc) rather than just being bitmaps. They need to be decoded first, the GPU isn't going to let you create a texture directly from JPEG data.

It seems that JPEG can be decoded on the GPU [1] [2]

> CRC32 is limited by memory bandwidth if you're using a normal (i.e. SIMD) implementation.

According to smhasher tests [3] CRC32 is not limited by memory bandwidth. Even if we multiply CRC32 scores x4 (to estimate 512 bit wide SIMD from 128 bit wide results), we still don't get close to memory bandwidth.

The 32 bit hash of CRC32 is too low for file checksums. xxhash is definitely an improvement over CRC32.

> to actually check the integrity of archived files you want to use something like sha256, not CRC32 or xxhash

Why would you need to use a cryptographic hash function to check integrity of archived files? Quality a non-cryptographic hash function will detect corruptions due to things like bit-rot, bad RAM, etc. just the same.

And why is 256 bits needed here? Kopia developers, for example, think 128 bit hashes are big enough for backup archives [4].

[1] https://docs.nvidia.com/cuda/nvjpeg/index.html

[2] https://github.com/CESNET/GPUJPEG

[3] https://github.com/rurban/smhasher

[4] https://github.com/kopia/kopia/issues/692


Maybe the CRC32 implementations in the smasher suite just aren't that fast?

[1] claims 15 GB/s for the slowest implementation (Chromium) they compared (all vectorized).

> The 32 bit hash of CRC32 is too low for file checksums. xxhash is definitely an improvement over CRC32.

Why? What kind of error rate do you expect, and what kind of reliability do you want to achieve? Assumptions that would lead to a >32bit checksum requirement seem outlandish to me.

[1] https://github.com/corsix/fast-crc32?tab=readme-ov-file#x86_...


From SMHasher test results quality of xxhash seems higher. It has less bias / higher uniformity that CRC.

What bothers me with probability calculations, is that they always assume perfect uniformity. I've never seen any estimates how bias affects collision probability and how to modify the probability formula to account for non-perfect uniformity of a hash function.


It doesn't matter, though. xxhash is better than crc32 for hashing keys in a hash table, but both of them are inappropriate for file checksums -- especially as part of a data archival/durability strategy.

It's not obvious to me that per-page checksums in an archive format for comic books are useful at all, but if you really wanted them for some reason then crc32 (fast, common, should detect bad RAM or a decoder bug) or sha256 (slower, common, should detect any change to the bitstream) seem like reasonable choices and xxhash/xxh3 seems like LARPing.


> both of them are inappropriate for file checksums

CRCs like CRC32 were born for this kind of work. CRCs detect corruption when transmitting/storing data. What do you mean when you say that it's inappropriate for file checksums? It's ideal for file checksums.


Uniformity isn’t directly important for error detection. CRC-32 has the nice property that it’s guaranteed to detect all burst errors up to 32 bits in size, while hashes do that with probability at best 2^−b of course. (But it’s valid to care about detecting larger errors with higher probability, yes.)

> Uniformity isn’t directly important for error detection.

Is there any proof of this? I'm interested in reading more about it.

> detect all burst errors up to 32 bits in size

What if errors are not consecutive bits?


There’s a whole field’s worth of really cool stuff about error correction that I wish I knew a fraction of enough to give reading recommendations about, but my comment wasn’t that deep – it’s just that in hashes, you obviously care about distribution because that’s almost the entire point of non-cryptographic hashes, and in error correction you only care that x ≠ y implies f(x) ≠ f(y) with high probability, which is only directly related in the obvious way of making use of the output space (even though it’s probably indirectly related in some interesting subtler ways).

E.g. f(x) = concat(xxhash32(x), 0xf00) is just as good at error detection as xxhash32 but is a terrible hash, and, as mentioned, CRC-32 is infinitely better at detecting certain types of errors than any universal hash family.


This seems to make sense, but I need to read more about error correction to fully understand it. I was considering possibility that data could also contain patterns where error detection performs poorly due to bias, and I haven't seen how to include these estimates in probability calculations.

> The 32 bit hash of CRC32 is too low for file checksums.

What makes you say this? I agree that there are better algorithms than CRC32 for this usecase, but if I was implementing something I'd most likely still truncate the hash to somewhere in the same ballpark (likely either 32, 48, or 64 bits).

Note that the purpose of the hash is important. These aren't being used for deduplication where you need a guaranteed unique value between all independently queried pieces of data globally but rather just to detect file corruption. At 32 bits you have only a 1 out of 2^(32-1) chance of a false negative. That should be more than enough. By the time you make it to 64 bits, if you encounter a corrupted file once _every nanosecond_ for the next 500 years or so you would expect to miss only a single event. That is a rather absurd level of reliability in my view.


I've seen few arguments that with the amount of data we have today the 2^(32-1) chance can happen, but I can't vouch their calculations were done correctly.

Readme in SMHasher test suite also seems to indicate that 32 bits might be too few for file checksums:

"Hash functions for symbol tables or hash tables typically use 32 bit hashes, for databases, file systems and file checksums typically 64 or 128bit, for crypto now starting with 256 bit."


That's vaguely describing common practices, not what's actually necessary or why. It also doesn't address my note that the purpose of the hash is important. Are "file systems" and "file checksums" referring to globally unique handles, content addressed tables, detection of bitrot, or something else?

For detecting file corruption the amount of data alone isn't the issue. Rather what matters is the rate at which corruption events occur. If I have 20 TiB of data and experience corruption at a rate of only 1 event per TiB per year (for simplicity assume each event occurs in a separate file) that's only 20 events per year. I don't know about you but I'm not worried about the false negative rate on that at 32 bits. And from personal experience that hypothetical is a gross overestimation of real world corruption rates.


It depends on how you calculate statistics. If you are designing a file format that over the lifetime of the format hundreds of millions of user will use (storing billions of files), what are the chances that 32 bits checksum won't be able to catch at least one corruption? During transfer over unstable wireless internet connection, storage on cheap flash drive, poor HDD with a higher error rate, unstable RAM etc. We want to avoid data corruption if we can even in less then ideal conditions. Cost of going from 32 bit to 64 bit hashes is very small.

No, it doesn't "depend on how you calculate statistics". Or rather you are not asking the right question. We do not care if a different person suffers a false negative. The question is if you, personally, are likely to suffer a false negative. In other words, will any given real world deployment of the solution be expected to suffer from an unacceptably high rate of false negatives?

Answering that requires figuring out two things. The sort of real world deployment you're designing for and what the acceptable false negative rate is. For an extremely conservative lower bound suppose 1 error per TiB per year and suppose 1000 TiB of storage. That gives a 99.99998% success rate for any given year. That translates to expecting 1 false negative every 4 million years.

I don't know about you but I certainly don't have anywhere near a petabyte of data, I don't suffer corruption at anywhere near a rate of 1 event per TiB per year, and I'm not in the business of archiving digital data on a geological timeframe.

32 bits is more than fit for purpose.


I can't say I agree with your logic here. We are not talking about any specific backup or anything like that. We are talking about the design of a file format that is going to be used globally.

Business running a lottery has to calculate the odds of anyone winning, not just the odds of a single person winning. Same, a designer of a file format has to consider chances for all users. What percent of users will be affected by any design decision.

For example, what if you would offer a guarantee that 32 bit hash will protect you from corruption, and compensate generously anyone who would get this type of corruption; how would you calculate probability then?


If you offer compensation then of course you need to consider your risk exposure, ie total users. That's similar to a lottery where the central authority is concerned with all payouts while an individual is only concerned with their own payout.

Outside of brand reputation issues that is not how real world products are designed. You design a tool for the specific task it will be used for. You don't run your statistics in aggregate based on the expected number of customers.

Users are independent from one another. If the population doubles my filesystem doesn't suddenly become less reliable. If more people purchase the same laptop that I have the chance of mine failing doesn't suddenly go up. If more people deep fry things in their kitchen my own personal risk of a kitchen fire isn't increased regardless of how busy the fire department might become.


  > It seems that JPEG can be decoded on the GPU [1] [2]
Sure, but you wouldn't want to. Many algorithms can be executed on a GPU via CUDA/ROCm, but the use cases for on-GPU JPEG/PNG decoding (mostly AI model training? maybe some sort of giant megapixel texture?) are unrelated to anything you'd use CBZ for.

For a comic book the performance-sensitive part is loading the current and adjoining pages, which can be done fast enough to appear instant on the CPU. If the program does bulk loading then it's for thumbnail generation which would also be on the CPU.

Loading compressed comic pages directly to the GPU would be if you needed to ... I dunno, have some sort of VR library browser? It's difficult to think of a use case.

  > According to smhasher tests [3] CRC32 is not limited by memory bandwidth.
  > Even if we multiply CRC32 scores x4 (to estimate 512 bit wide SIMD from 128
  > bit wide results), we still don't get close to memory bandwidth.
Your link shows CRC32 at 7963.20 MiB/s (~7.77 GiB/s) which indicates it's either very old or isn't measuring pure CRC32 throughput (I see stuff about the C++ STL in the logs).

Look at https://github.com/corsix/fast-crc32 for example, which measures 85 GB/s (GB, GiB, eh close enough) on the Apple M1. That's fast enough that I'm comfortable calling it limited by memory bandwidth on real-world systems. Obviously if you solder a Raspberry Pi to some GDDR then the ratio differs.

  > The 32 bit hash of CRC32 is too low for file checksums. xxhash is definitely
  > an improvement over CRC32.
You don't want to use xxhash (or crc32, or cityhash, ...) for checksums of archived files, that's not what they're designed for. Use them as the key function for hash tables. That's why their output is 32- or 64-bits, they're designed to fit into a machine integer.

File checksums don't have the same size limit so it's fine to use 256- or 512-bit checksum algorithms, which means you're not limited to xxhash.

  > Why would you need to use a cryptographic hash function to check integrity
  > of archived files? Quality a non-cryptographic hash function will detect
  > corruptions due to things like bit-rot, bad RAM, etc. just the same.
I have personally seen bitrot and network transmission errors that were not caught by xxhash-type hash functions, but were caught by higher-level checksums. The performance properties of hash functions used for hash table keys make those same functions less appropriate for archival.

  > And why is 256 bits needed here? Kopia developers, for example, think 128
  > bit hashes are big enough for backup archives [4].
The checksum algorithm doesn't need to be cryptographically strong, but if you're using software written in the past decade then SHA256 is supported everywhere by everything so might as well use it by default unless there's a compelling reason not to.

For archival you only need to compute the checksums on file transfer and/or periodic archive scrubbing, so the overhead of SHA256 vs SHA1/MD5 doesn't really matter.

I don't know what kopia is, but according to your link it looks like their wire protocol involves each client downloading a complete index of the repository content, including a CAS identifier for every file. The semantics would be something like Git? Their list of supported algorithms looks reasonable (blake, sha2, sha3) so I wouldn't have the same concerns as I would if they were using xxhash or cityhash.


> which can be done fast enough to appear instant on the CPU

Big scanned PDFs can be problfrom more efficient processing (if it had HW support for such technique)

> Your link shows CRC32 at 7963.20 MiB/s (~7.77 GiB/s) which indicates it's either very old or isn't measuring pure CRC32 throughput

It may not be fastest implementation of CRC32, but it's also done on old Ryzen 5 3350G 3.6GHz. Below the table are results done on different HW. On Intel i7-6820HQ CRC32 achieves 27.6 GB/s.

> measures 85 GB/s (GB, GiB, eh close enough) on the Apple M1. That's fast enough that I'm comfortable calling it limited by memory bandwidth on real-world systems.

That looks incredibly suspicious since Apple M1 has maximum memory bandwidth of 68.25 GB/s [1].

> I have personally seen bitrot and network transmission errors that were not caught by xxhash-type hash functions, but were caught by higher-level checksums. The performance properties of hash functions used for hash table keys make those same functions less appropriate for archival.

Your argument is meaningless without more details. xxhash supports 128 bits, which I doubt wouldn't be able to catch an error in you case.

SHA256 is an order of magnitude or more slower than non-cryptographic hashes. In my experience archival process usually has big enough effect on performance to care about it.

I'm beginning to suspect your primary reason for disliking xxhash is because it's not de facto standard like CRC or SHA. I agree that this is a big one, but you constantly imply like there's more to why xxhash is bad. Maybe my knowledge is lacking, care to explain? Why wouldn't 128 bit xxhash be more than enough for checksums of files. AFAIK the only thing it doesn't do is protect you against tampering.

> I don't know what kopia is, but according to your link it looks like their wire protocol involves each client downloading a complete index of the repository content, including a CAS identifier for every file. The semantics would be something like Git? Their list of supported algorithms looks reasonable (blake, sha2, sha3) so I wouldn't have the same concerns as I would if they were using xxhash or cityhash.

Kopia uses hashes for block level deduplication. What would be an issue, if they used 128 bit xxhash instead of 128 bit cryptographic hash like they do now (if we assume we don't need to protection from tampering)?

[1] https://en.wikipedia.org/wiki/Apple_M1


> What would be an issue, if they used 128 bit xxhash instead of 128 bit cryptographic hash like they do now (if we assume we don't need to protection from tampering)?

malicious block hash collisions where the colliding block was introduced by some way other than tampering (e.g. storing a file created by someone else)


That's a good example. Thanks! It would be kind of an indirect tampering method.

> The most charitable interpretation is that this whole project (supposed problems with CBZ, the readme, the code) is the output of an LLM.

Do LLMs perform de/serialization by casting C structs to char-pointers? I would've expected that to have been trained out of them. (Which is to say: lots of it is clearly LLM-generated, but at least some of the code might be human.)

Anyway, I hope that the person who published this can take all the responses constructively. I know I'd feel awful if I was getting so much negative feedback.


Most of the code in WebP and AVIF is shared with VP8/AV1, which means if your browser supports contemporary video codecs then it also gets pretty good lossy image codecs for free. JPEG-XL is a separate codebase, so it's far more effort to implement and merely providing better compression might not be worth it absent other considerations. The continued widespread use of JPEG is evidence that many web publishers don't care that much about squeezing out a few bytes.

Also from a security perspective the reference implementation of JPEG-XL isn't great. It's over a hundred kLoC of C++, and given the public support for memory safety by both Google and Mozilla it would be extremely embarrassing if a security vulnerability in libjxl lead to a zero-click zero-day in either Chrome or Firefox.

The timing is probably a sign that Chrome considers the Rust implementation of JPEG-XL to be mature enough (or at least heading in that direction) to start kicking the tires.


> The continued widespread use of JPEG is evidence that many web publishers don't care that much about squeezing out a few bytes.

I agree with the second part (useless hero images at the top of every post demonstrate it), but not necessarily the first. JPEG is supported pretty much everywhere images are, and it’s the de facto default format for pictures. Most people won’t even know what format they’re using, let alone that they could compress it or use another one. In the words of Hank Hill:

> Do I look like I know what a JPEG is? I just want a picture of a god dang hot dog.

https://www.youtube.com/watch?v=EvKTOHVGNbg


I'm not (only) talking about the general population, but major sites. As a quick sanity check, the following sites are serving images with the `image/jpeg` content type:

* CNN (cnn.com): News-related photos on their front page

* Reddit (www.reddit.com): User-provided images uploaded to their internal image hosting

* Amazon (amazon.com): Product categories on the front page (product images are in WebP)

I wouldn't expect to see a lot of WebP on personal homepages or old-style forums, but if bandwidth costs were a meaningful budget line item then I would expect to see ~100% adoption of WebP or AVIF for any image that gets recompressed by a publishing pipeline.


It’s subsidized by cheap CDN rates and dominated by video demand.

Any site that uses a frontend framework or CMS will probably serve WebP at the very least.

The https://github.com/blackjetrock/ghidra-6303 repository your post links to (containing a SLEIGH spec for the HD6303) is no longer available, did you happen to save a local clone that could be re-uploaded somewhere?


Thank you very much for pointing this out! Fortunately I still have the code locally. I'll try to raise another PR to get 6303 support into Ghidra.



Thank you for finding this! Depili does great work! In another comment I mentioned that I've been working on the Casio CZ-101, which uses the NEC μPD7810 processor. Depili created a processor spec for the μCOM-87 architecture, which I've continued working on in this PR: https://github.com/NationalSecurityAgency/ghidra/pull/7930


There's at least one proprietary platform that supports Git built by via a vendor-provided C compiler, but for which no public documentation exists and therefore no LLVM support is possible.

Ctrl+F for "NonStop" in https://lwn.net/Articles/998115/


Shouldn't these platforms work on getting Rust to support it rather than have our tools limited by what they can consume? https://github.com/Rust-GCC/gccrs


A maintainer for that specific platform was more into the line of thinking that Git should bend over backwards to support them because "loss of support could have societal impact [...] Leaving debit or credit card authorizers without a supported git would be, let's say, "bad"."

To me it looks like big corps enjoying the idea of having free service so they can avoid maintaining their own stuff, and trying the "too big to fail" fiddle on open source maintainers, with little effect.


It's additionally ridiculous because git is a code management tool. Maybe they are using it for something much more wild than that (why?) but I assume this is mostly just a complaint that they can't do `git pull` from their wonky architecture that they are building on. They could literally have a network mount and externally manage the git if they still need it.

It's not like older versions of git won't work perfectly fine. Git has great backwards compatibility. And if there is a break, seems like a good opportunity for them to fork and fix the break.

And lets be perfectly clear. These are very often systems built on top of a mountain of open source software. These companies will even have custom patched tools like gcc that they aren't willing to upstream because some manager decided they couldn't just give away the code they paid an engineer to write. I may feel bad for the situation it puts the engineers in, I feel absolutely no remorse for the companies because their greed put them in these situations in the first place.


> Leaving debit or credit card authorizers without a supported git would be, let's say, "bad".

Oh no, if only these massive companies that print money could do something as unthinkable as pay for a support contract!


Yes. It benefits them to have ubiquitous tools supported on their system. The vendors should put in the work to make that possible.

I don’t maintain any tools as popular as git or you’d know me by name, but darned if I’m going to put in more than about 2 minutes per year supporting non-Unix.

(This said as someone who was once paid to improve Ansible’s AIX support for an employer. Life’s too short to do that nonsense for free.)


As you're someone very familiar with Ansible, what are your thoughts on it in regards to IBM's imminent complete absorption of RedHat? I can't imagine Ansible, or any other RedHat product, doing well with that.


I wouldn’t say I’m very familiar. I don’t use it extensively anymore, and not at all at work. But in general, I can’t imagine a way in which IBM’s own corporate culture could contribute positively to any FOSS projects if they removed the RedHat veneer. Not saying it’s impossible, just that my imagination is more limited than the idea requires.


IBM has been, and still is, a big contributor to a bunch of Eclipse projects, as their own tools build on those. The people there were both really skilled, friendly and professional. Different divisions and departments can have huge cultural differences and priorities, obviously, but “IBM” doesn’t automatically mean bad for OSS projects.


I'm sure some of RedHat stuff will end up in the Apache Foundation once IBM realizes it has no interest in them.


There isn't even a Nonstop port of GCC yet. Today, Nonstop is big-endian x86-64, so tacking this onto the existing backend is going to be interesting.


That platform doesn’t support GCC either.


Isn’t that’s what’s happening? The post says they’re moving forward.


[flagged]


On the other hand: why should the entire open-source world screech to a halt just because some new development is incompatible with the ecosystem of a proprietary niche system developed by a billion-dollar freeloader?

HPE NonStop doesn't need to do anything with Rust, and nobody is forcing them to. They have voluntarily chosen to use an obscure proprietary toolchain instead of contributing to GCC or LLVM like everyone else: they could have gotten Rust support for free, but they believed staying proprietary was more important.

Then they chose to make a third-party project (Git) a crucial part of that ecosystem, without contributing time and effort into maintaining it. It's open source, so this is perfectly fine to do. On the other hand, it also means they don't get a say in how the project is developed, and what direction it will take in the future. But hey, they believed saving a few bucks was more important.

And now it has blown up in their face, and they are trying to control the direction the third-party project is heading by playing the "mission-critical infrastructure" card and claiming that the needs of their handful of users is more important than the millions of non-HPE users.

Right now there are three options available to HPE NonStop users:

1. Fork git. Don't like the direction it is heading? Then just do it yourself. Cheapest option short-term, but it of course requires investing serious developer effort long-term to stay up-to-date, rather than just sending the occasional patch upstream.

2. Port GCC / LLVM. That's usually the direction obscure platforms go. You bite the bullet once, but get to reap the benefits afterwards. From the perspective of the open-source community, if your platform doesn't have GCC support it might as well not exist. If you want to keep freeloading off of it, it's best to stop fighting this part. However, it requires investing developer effort - especially when you want to maintain a proprietary fork due to Business Reasons rather than upstreaming your changes like everyone else.

3. Write your own proprietary snowflake Rust compiler. You get to keep full control, but it'll require a significant developer effort. And you have to "muck around" with Rust, of course.

HPE NonStop and its ecosystem can do whatever it wants, but it doesn't get to make demands just because their myopic short-term business vision suddenly leaves them having to spend effort on maintaining it. This time it is caused by Git adopting Rust, but it will happen again. Next week it'll be something like libxml or openssl or ssh or who-knows-what. Either accept that breakage is inevitable when depending on third-party components, or invest time into staying compatible with the ecosystem.


At this point maybe it's time to let them solve the problem they've created for themselves by insisting on a closed C compiler in 2025.


[flagged]


>> insisting on a closed C compiler in 2025.

> Everything should use one compiler, one run-time and one package manager.

If you think that calling out closed C compilers is somehow an argument for a single toolchain for all things, I doubt there's anything I can do to help educate you about why this isn't the case. If you do understand and are choosing to purposely misinterpret what I said, there are a lot of much stronger arguments you could make to support your point than that.

Even ignoring all of that, there's a much larger point that you've kind of glossed over here by:

> The shitheads who insist on using alternative compilers and platforms don't deserve tools

There's frequently discussion around the the expectations between open source project maintainers and users, and in the same way that users are under no obligation to provide compensation for projects they use, projects don't have any obligations to provide support indefinitely for any arbitrary set of circumstances, even if they happen to for a while. Maintainers sometimes will make decisions weighing tradeoffs between supporting a minority of users or making a technical change they feel will help them maintain the project better in the long-term differently than the users will. It's totally valid to criticize those decisions on technical grounds, but it's worth recognizing that these types of choices are inevitable, and there's nothing specific about C or Rust that will change that in the long run. Even with a single programming language within a single platform, the choice of what features to implement or not implement could make or break whether a tool works for someone's specific use case. At the end of the day, there's a finite amount of work people spend on a given project, and there needs to be a decision about what to spend it on.


For various libs, you provide a way to build without it. If it's not auto-detected, or explicitly disabled via the configure command line, then don't try to use it. Then whatever depends on it just doesn't work. If for some insane reason git integrates XML and uses libxml for some feature, let it build without the feature for someone who doesn't want to provide libxml.

> At the end of the day, there's a finite amount of work people spend on a given project

Integrating Rust shows you have too much time on your hands; the people who are affected by that, not necessarily so.


> Integrating Rust shows you have too much time on your hands; the people who are affected by that, not necessarily so.

As cited elsewhere in the this thread, the person making this proposal on the mailing list has been involved in significant contributions to git in the past, so I'd be inclined to trust their judgment about whether it's a worthwhile use of their time in the absence of evidence to the contrary. If you have something that would indicate this proposal was made in bad faith, I'd certainly be interested to see it, but otherwise, I don't see how you can make this claim other than as your own subjective opinion. That's fine, but I can't say I'm shocked that the people actually making the decisions on how to maintain git don't find it convincing.


Weighted by user count for a developer tool like Git, Rust is a more portable language than the combination of C and bash currently in use.


> There's at least one proprietary platform that supports Git built by via a vendor-provided C compiler, but for which no public documentation exists and therefore no LLVM support is possible.

That's fine. The only impact is that they won't be able to use the latest and greatest release of Git.

Once those platforms work on their support for Rust they will be able to jump back to the latest and greatest.


It's sad to see people be so nonchalant about potentially killing off smaller platforms like this. As more barriers to entry are added, competition is going to decrease, and the software ecosystem is going to keep getting worse. First you need a lib C, now you need lib C and Rust, ...

But no doubt it's a great way for the big companies funding Rust development to undermine smaller players...


It's kind of funny to see f-ing HPE with 60k employees somehow being labeled as the poor underdog that should be supported by the open-source community for free and can't be expected to take care of software running on their premium hardware for banks etc by themselves.


I think you misread my comment because I didn't say anything like that.

In any case HPE may have 60k employees but they're still working to create a smaller platform.

It actually demonstrates the point I was making. If a company with 60k employees can't keep up then what chance do startups and smaller companies have?


> If a company with 60k employees can't keep up then what chance do startups and smaller companies have?

They build on open source infrastructure like LLVM, which a smaller company will probably be doing anyway.


Sure, but let's not pretend that doesn't kill diversity and entrench a few big players.


The alternative is killing diversity of programming languages, so it's hard to win either way.


HP made nearly $60b last year. They can fund the development of the tools they need for their 50 year old system that apparently powers lots of financial institutions. It's absurd to blame volunteer developers for not wanting to bend over backwards, just to ensure these institutions have the absolute latest git release, which they certainly do not need.


Oh they absolutely can, they just choose not to. To just make some tools work again there's also many slightly odd workarounds one could choose over porting the Rust compiler.


> It's sad to see people be so nonchalant about potentially killing off smaller platforms like this.

Your comment is needlessly dramatic. The only hypothetical impact this has is that whoever uses these platforms won't have upgrades until they do something about it, and the latest and greatest releases will only run if the companies behind these platforms invests in their maintenance.

This is not a good enough reason to prevent the whole world from benefiting from better tooling. This is not a lowest common denominator thing. Those platforms went out of their way to lag in interpretability, and this is the natural consequence of these decisions.


Maybe they can resurrect the C backend for LLVM and run that through their proprietary compilers?

It's probably not straightforward but the users of NonStop hardware have a lot of money so I'm sure they could find a way.


Rust has an experimental C backend of its own as part of rustc_codegen_clr https://github.com/FractalFir/rustc_codegen_clr . Would probably work better than trying to transpile C from general LLVM IR.


Some people have demonstrated portability using the WASM target, translating that to C89 via w2c2, and then compiling _that_ for the final target.


Given that the maintainer previously said they had tried to pay to get GCC and LLVM ported multiple times, all of which failed, money doesn’t seem to have helped.


Surely the question is how much they tried to pay? Clearly the answer is "not enough".


I mean at one point I had LLVM targeting Xbox 360, PS3, and Wii so I'm sure it's possible, it just needs some imagination and elbow grease :)


Why should free software projects bend over backwards to support obscure proprietary platforms? Sounds absurd to me


Won't someome think of the financial sector


Reminds me of a conversation about TLS and how a certain bank wanted to insert a backdoor into all of TLS for their convenience.


Sucks to be that platform?

Seriously, I guess they just have to live without git if they're not willing to take on support for its tool chain. Nobody cares about NonStop but the very small number of people who use it... who are, by the way, very well capable of paying for it.


I strongly agree. I read some of the counter arguments, like this will make it too hard for NonStop devs to use git, and maybe make them not use it at all. Those don’t resonate with me at all. So what? What value does them using git provide to the git developers? I couldn’t care less if NonStop devs can use my own software at all. And since they’re exclusively at giant, well-financed corporations, they can crack open that wallet and pay someone to do the hard work if it means than much to them.


"You have to backport security fixes for your own tiny platform because your build environment doesn't support our codebase or make your build environment support our codebase" seems like a 100% reasonable stance to me


> your build environment doesn't support our codebase

If that is due to the build environment deviating from the standard, then I agree with you. However, when its due to the codebase deviating from the standard, then why blame the build environment developers for expecting codebases to adhere to standards. That's the whole point of standards.


Is there a standard that all software must be developed in ANSI C that I missed, or something? The git developers are saying - we want to use Rust because we think it will save us development effort. NonStop people are saying we can't run this on our platform. It seems to me someone at git made the calculus: the amount that NonStop is contributing is less than what we save going to Rust. Unless NonStop has a support contract with git developers that they would be violating, it seems to me the NonStop people want to have their cake and eat it too.

According to git docs they seem to try to make a best effort to stick to POSIX but without any strong guarantees, which this change seems to be entirely in line with: https://github.com/git/git/blob/master/Documentation/CodingG...


An important point of using C is to write software that adheres to a decades old very widespread standard. Of course developers are free to not do that, but any tiny bit of Rust in the core or even in popular optional code amounts to the same as not using C at all, i.e. only using Rust, as far as portability is concerned.

If your codebase used to conform to a standard and the build environment relies on that standard, and now the your codebase doesn't anymore, then its not the build environment that deviates from the standard, its the codebase that brakes it.


Had you been under the impression that any of these niche platforms conform to any common standard other than their own?

Because they don’t. For instance, if they were fully POSIX compliant, they’d probably already have LLVM.


I expect them to conform to the C standard or to deal with the deviation. I don't think POSIX compliance is of much use on an embedded target.


I’m sold.


How is this git's concern?


They enjoy being portable and like things to stay that way so when they introduce a new toolchain dependency which will make it harder for some people to compile git, they point it out in their change log?


I don't think "NonStop" is a good gauge of portability.

But, I wasn't arguing against noting changes in a changelog, I'm arguing against putting portability to abstruse platforms before quality.


I don’t think staying portable means you have to do concession on quality. That merely limit your ability to introduce less portable dependancies.

But even then Git doesn’t mind losing some plateformes when they want to move forward on something.


Git's main concern should, of course, be getting Rust in, in some shape or form.


I am curious, does anyone know what is the use case that mandates the use of git on NonStop? Do people actually commit code from this platform? Seems wild.


Nonstop is still supported? :o


Among ecosystems based on YAML-formatted configuration defaulting to YAML 1.1 is nearly universal. The heyday of YAML was during the YAML 1.1 era, and those projects can't change their YAML parsers' default version to 1.2 without breaking extant config files.

By the time YAML 1.2 had been published and implementations written, greenfield projects were using either JSON5 (a true superset of JSON) or TOML.

  > While JSON numbers are grammatically simple, they're almost always distinct
  > from how you'd implement numbers in any language that has JSON parsers,
  > syntactically, exactness and precision-wise.
For statically-typed languages the range and precision is determined by the type of the destination value passed to the parser; it's straightforward to reject (or clamp) a JSON number `12345` being parsed into a `uint8_t`.

For dynamically-typed languages there's less emphasis on performance, so using an arbitrary-precision numeric type (Python's Decimal, Go's "math/big" types) provide lossless decoding.

The only language I know of that really struggles with JSON numbers is, ironically, JavaScript -- its BigInt type is relatively new and not well integrated with its JSON API[0], and it doesn't have an arbitrary-precision type.

[0] See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe... for the incantation needed to encode a BigInt as a number.


Arguably the root problem was lack of user namespacing; the incident would have been less likely to happen in the first place if the packages in question were named "~akoculu/left-pad" and "~akoculu/kik".


That's right and probably a lot less people would have used left-pad because it looks like a package for a specific org.


I think that statement is parsed as "npm was the first incredibly accessible package manager for [server-side JavaScript, which at the time was] an emergent popular technology,"


I get that, but there was plenty of prior art to learn from anyway.


You wrote "water has great compressive strength", sk5t directly (and correctly) refuted that claim. What is there to think about?

Are you confusing "compressive strength" with compressibility?


I think his point is that things very rarely experience purely compressive forces. Just being compressed induces tension in other directions, like water being squished out between your clapping hands. So even though water has great compressive strength, in practice this isn't very useful.


Exactly.

Many materials would have compressive strength easily, just by being relatively uncompressible.

But most loads have a (troublesome) tensile component. Fundamentally, the ability of a rigid material to resist deformation (in the most general sense) is what is most important, and that requires tensile strength.

See this comment elsewhere in this sub-thread that explains it probably better than I did: https://news.ycombinator.com/item?id=43904800


Look up the Wikipedia definition [1] of compressive strength:

> In mechanics, compressive strength (or compression strength) is the capacity of a material or structure to withstand loads tending to reduce size (compression). It is opposed to tensile strength which withstands loads tending to elongate, resisting tension (being pulled apart).

Google search AI summary states:

> Compressive strength is a material's capacity to resist forces that try to reduce its volume or cause deformation.

To be fair, compressive strength is a complex measure. Compressibility is only one aspect of it. See this Encyclopedia Britannica article [2] about how compressive strength is tested.

[1] https://en.wikipedia.org/wiki/Compressive_strength

[2] https://www.britannica.com/technology/compressive-strength-t...


Please tell me how to make a water prism to test compressive strength and deformation resistance. Water is an incompressible fluid, that is different.

These are well understood terms in the field. Unfortunately, this illustrates the bounds of ai in subfields like materials: it confuses people.


I'm not saying water meets the strict definition of a material with high compressive strength (it does meet some, since it resists forces that attempt to decrease its volume well). I am just using as an extreme example of the issues with the concept of compressive strength.


lower the temperature


Nothing that you wrote here indicates you understand what is being discussed.

Water has very low compressive strength, so low that it freely deforms under its own weight. You can observe this by pouring some water onto a table. This behavior is distinct from materials with high compressive strength, such as wood or steel.

(I say "very low" instead of "zero" because surface tension could be considered a type of compressive strength at small scales, such as a single drop of water on a hydrophobic surface)


Your comments betrays a lack of comprehension and understanding. Please reads my comments and linked definitions carefully.

See this comment elsewhere in this sub-thread that explains it probably better than I did: https://news.ycombinator.com/item?id=43904800


  > use SECCOMP_SET_MODE_STRICT to isolate the child process. But at that
  > point, what are you even doing? Probably nothing useful.
The classic example of a fully-seccomp'd subprocess is decoding / decompression. If you want to execute ffmpeg on untrusted user input then seccomp is a sandbox that allows full-power SIMD, and the code has no reason to perform syscalls other than read/write to its input/output stream.

On the client side there's font shaping, PDF rendering, image decoding -- historically rich hunting grounds for browser CVEs.


The classic example of a fully-seccomp'd subprocess is decoding / decompression.

Yes. I've run JPEG 2000 decoders in a subprocess for that reason.


Well, it seems that lately this kind of task wants to write/mmap to a GPU, and poke at font files and interpret them.


I flagged this for being LLM-generated garbage; original comment below. Any readers interested in benchmarking programming language implementations should visit https://benchmarksgame-team.pages.debian.net/benchmarksgame/... instead.

---

The numbers in the table for C vs Rust don't make sense, and I wasn't able to reproduce them locally. For a benchmark like this I would expect to see nearly identical performance for those two languages.

Benchmark sources:

https://github.com/naveed125/rust-vs/blob/6db90fec706c875300...

https://github.com/naveed125/rust-vs/blob/6db90fec706c875300...

Benchmark process and results:

  $ gcc --version
  gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
  $ gcc -O2 -static -o bench-c-gcc benchmark.c
  $ clang --version
  Ubuntu clang version 14.0.0-1ubuntu1.1
  $ clang -O2 -static -o bench-c-clang benchmark.c
  $ rustc --version
  rustc 1.81.0 (eeb90cda1 2024-09-04)
  $ rustc -C opt-level=2 --target x86_64-unknown-linux-musl -o bench-rs benchmark.rs

  $ taskset -c 1 hyperfine --warmup 1000 ./bench-c-gcc
  Benchmark 1: ./bench-c-gcc
    Time (mean ± σ):       3.2 ms ±   0.1 ms    [User: 2.7 ms, System: 0.6 ms]
    Range (min … max):     3.2 ms …   4.1 ms    770 runs

  $ taskset -c 1 hyperfine --warmup 1000 ./bench-c-clang
  Benchmark 1: ./bench-c-clang
    Time (mean ± σ):       3.5 ms ±   0.1 ms    [User: 3.0 ms, System: 0.6 ms]
    Range (min … max):     3.4 ms …   4.8 ms    721 runs

  $ taskset -c 1 hyperfine --warmup 1000 ./bench-rs
  Benchmark 1: ./bench-rs
    Time (mean ± σ):       5.1 ms ±   0.1 ms    [User: 2.9 ms, System: 2.2 ms]
    Range (min … max):     5.0 ms …   7.1 ms    507 runs

Those numbers also don't make sense, but in a different way. Why is the Rust version so much slower, and why does it spend the majority of its time in "system"?

Oh, it's because benchmark.rs is performing a dynamic memory allocation for each key. The C version uses a buffer on the stack, with fixed-width keys. Let's try doing the same in the Rust version:

  --- benchmark.rs
  +++ benchmark.rs
  @@ -38,22 +38,22 @@
   }
 
   // Generates a random 8-character string
  -fn generate_random_string(rng: &mut Xorshift) -> String {
  +fn generate_random_string(rng: &mut Xorshift) -> [u8; 8] {
       const CHARSET: &[u8] = b"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
  -    let mut result = String::with_capacity(8);
  +    let mut result = [0u8; 8];
   
  -    for _ in 0..8 {
  +    for ii in 0..8 {
           let rand_index = (rng.next() % 62) as usize;
  -        result.push(CHARSET[rand_index] as char);
  +        result[ii] = CHARSET[rand_index];
       }
   
       result
   }
   
   // Generates `count` random strings and tracks their occurrences
  -fn generate_random_strings(count: usize) -> HashMap<String, u32> {
  +fn generate_random_strings(count: usize) -> HashMap<[u8; 8], u32> {
       let mut rng = Xorshift::new();
  -    let mut string_counts: HashMap<String, u32> = HashMap::new();
  +    let mut string_counts: HashMap<[u8; 8], u32> = HashMap::with_capacity(count);
   
       for _ in 0..count {
           let random_string = generate_random_string(&mut rng);
Now it's spending all its time in userspace again, which is good:

  $ taskset -c 1 hyperfine --warmup 1000 ./bench-rs
  Benchmark 1: ./bench-rs
    Time (mean ± σ):       1.5 ms ±   0.1 ms    [User: 1.3 ms, System: 0.2 ms]
    Range (min … max):     1.4 ms …   3.2 ms    1426 runs
 
... but why is it twice as fast as the C version?

---

I go to look in benchmark.c, and my eyes are immediately drawn to this weird bullshit:

  // Xorshift+ state variables (64-bit)
  uint64_t state0, state1;

  // Xorshift+ function for generating pseudo-random 64-bit numbers
  uint64_t xorshift_plus() {
      uint64_t s1 = state0;
      uint64_t s0 = state1;
      state0 = s0; 
      s1 ^= s1 << 23; 
      s1 ^= s1 >> 18; 
      s1 ^= s0; 
      s1 ^= s0 >> 5;
      state1 = s1; 
      return state1 + s0; 
  }
That's not simply a copy of the xorshift+ example code on Wikipedia. Is there any human in the world who is capable of writing xorshift+ but is also dumb enough to put its state into global variables? I smell an LLM.

A rough patch to put the state into something the compiler has a hope of optimizing:

  --- benchmark.c
  +++ benchmark.c
  @@ -18,25 +18,35 @@
   StringNode *hashTable[HASH_TABLE_SIZE]; // Hash table for storing unique strings
   
   // Xorshift+ state variables (64-bit)
  -uint64_t state0, state1;
  +struct xorshift_state {
  +       uint64_t state0, state1;
  +};
   
   // Xorshift+ function for generating pseudo-random 64-bit numbers
  -uint64_t xorshift_plus() {
  -    uint64_t s1 = state0;
  -    uint64_t s0 = state1;
  -    state0 = s0;
  +uint64_t xorshift_plus(struct xorshift_state *st) {
  +    uint64_t s1 = st->state0;
  +    uint64_t s0 = st->state1;
  +    st->state0 = s0;
       s1 ^= s1 << 23;
       s1 ^= s1 >> 18;
       s1 ^= s0;
       s1 ^= s0 >> 5;
  -    state1 = s1;
  -    return state1 + s0;
  +    st->state1 = s1;
  +    return s1 + s0;
   }
   
   // Function to generate an 8-character random string
   void generate_random_string(char *buffer) {
  +    uint64_t timestamp = (uint64_t)time(NULL) * 1000;
  +    uint64_t state0 = timestamp ^ 0xDEADBEEF;
  +    uint64_t state1 = (timestamp << 21) ^ 0x95419C24A637B12F;
  +    struct xorshift_state st = {
  +        .state0 = state0,
  +        .state1 = state1,
  +    };
  +
       for (int i = 0; i < STRING_LENGTH; i++) {
  -        uint64_t rand_value = xorshift_plus() % 62;
  +        uint64_t rand_value = xorshift_plus(&st) % 62;
   
           if (rand_value < 10) { // 0-9
               buffer[i] = '0' + rand_value;
  @@ -113,11 +123,6 @@
   }
   
   int main() {
  -    // Initialize random seed
  -    uint64_t timestamp = (uint64_t)time(NULL) * 1000;
  -    state0 = timestamp ^ 0xDEADBEEF; // Arbitrary constant
  -    state1 = (timestamp << 21) ^ 0x95419C24A637B12F; // Arbitrary constant
  -
       double total_time = 0.0;
   
       // Run 3 times and measure execution time
  
and the benchmarks now make slightly more sense:

  $ taskset -c 1 hyperfine --warmup 1000 ./bench-c-gcc
  Benchmark 1: ./bench-c-gcc
    Time (mean ± σ):       1.1 ms ±   0.1 ms    [User: 1.1 ms, System: 0.1 ms]
    Range (min … max):     1.0 ms …   1.8 ms    1725 runs
  
  $ taskset -c 1 hyperfine --warmup 1000 ./bench-c-clang
  Benchmark 1: ./bench-c-clang
    Time (mean ± σ):       1.0 ms ±   0.1 ms    [User: 0.9 ms, System: 0.1 ms]
    Range (min … max):     0.9 ms …   1.4 ms    1863 runs
But I'm going to stop trying to improve this garbage, because on re-reading the article, I saw this:

  > Yes, I absolutely used ChatGPT to polish my code. If you’re judging me for this,
  > I’m going to assume you still churn butter by hand and refuse to use calculators.
  > [...]
  > I then embarked on the linguistic equivalent of “Google Translate for code,”
Ok so it's LLM-generated bullshit, translated into other languages either by another LLM, or by a human who doesn't know those languages well enough to notice when the output doesn't make any sense.


> my eyes are immediately drawn to this weird bullshit

Gave me a good chuckle there :)

Appreciate this write up; I'd even say your comment deserves its own article, tbh. Reading your thought process and how you addressed the issues was interesting. A lot of people don't know how to identify or investigate weird bullshit like this.


So glad I had read the 2nd agreement by Don Miguel Ruiz lol.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: