PCHs give a sour taste in my mouth after working on a project which very liberally added commonly imported headers to a huge one. In practice, it meant each TU was indirectly including many more headers than needed, and it was harder for humans (or IDEs) to reason about the real dependency chain. While builds were faster, they also started using much more memory.
You also can end up needing to rebuild the world, if touching a header that is part of the PCH, even if it isn’t really needed by all the targets.
Modules and header units were supposed to solve these a lot more cleanly, but is still not well supported.
A company using React Native recently used bounties to solicit bug-fixes to RN issues their app was hitting.
A lot of positives came out of it, and it did improve framework quality. There are challenges with the model though. More changes than not are high quality, but some aren’t, or are just inherently risky, and it’s especially tricky to discern when first time contributors touch systems that might no longer have an active maintainer. Unlike someone employed full-time, there isn’t the opportunity to establish long term trust, and the contributor might not be around to support their change if something goes wrong.
A lot of changes fell through the cracks, or needed maintainer time that wasn’t there, which creates a bad situation where someone could have done great work, but isn’t getting paid. Knowing that someone is losing money if you don’t accept a PR can also trend towards guilt-inducing as a maintainer.
That's my experience with bounties: someone does the job because they get paid, not because they have a particular interest in the issue, and then they instantly move on, leaving the submission to rot.
I apologize for the repeated posts. The reason I posted it 6 times is that I aim to announce every release and significant commit. The reason my entire history is centered around this benchmark is because I wanted to introduce my project to the community, potentially with some bias. I began my Hacker News journey at that time and wanted to share what I was working on.
Generally these, say, too "proactive" moves to artificially gain attention to your own GitHub projects makes me less likely to test it out, so I'd rather stay with the mainstream options.
Yeah.. this combined with the fact that this benchmark happens to rank their cloud offering the highest by a wide margin sounds a bit like they are submitting it to market themselves.
Worth pointing out, there has been quite a bit of contention around this change, both technical, and some accusations of plagiarism/miscrediting here. https://github.com/ggerganov/llama.cpp/pull/711
> This PR was written in collaboration with @slaren. This PR is also rebased on
PR #586 so please do not squash merge! Use either merge or rebase.
jart made sure to that the other user got credit, in addition to making sure that their name was properly attributed in the commit log. Given all this, it feels like the drama--shouldn't exist? Like, if there's an issue with attribution, it's not because of bad-faith, and I feel like a good-faith conversation could have just resolved this, instead of bringing in trolls.
That's not the original PR. jart was working on a malloc() approach that didn't work and slaren wrote all the code actually doing mmap, which jart then rebased in a random new PR, changed to support an unnecessary version change, magic numbers, a conversion tool, and WIN32 support when that was already working in the draft PR. https://archive.ph/Uva8c
slaren replied to jart on HN asking her why she was doing and saying those things, and she didn't bother to reply to him, despite replying to others in that subthread within minutes. https://archive.ph/zCfiJ
Hmm, based on what you've quoted here and knowing nothing else but a few messages on AI Twitter I would invest in jart.
This is BillG-style product skill -- there is a ton of work that goes into representing a piece of software as something important and valuable that people should buy into.
Jart is a pretty exceptional engineer, even if she wrote this patch single-handedly it would hardly be a footnote in her list of professional accomplishments. This is the author of Cosmopolitan libc, redbean and APE we're talking about, after all.
That being said, it's important to attribute work properly. It can be easy to mix things up (eg. "my patch" is excusable) but repeatedly insisting authorship when you're not the author of the change just seems disingenuous. I'm sure it was in good faith, but since they didn't address the issue or clear anything up, it's come to this.
Dramatic, and hardly the conclusion people wanted to the story of a free performance improvement. It's not entirely contrived though, and I think the maintainer handled this exceptionally well given the circumstances.
I'm all for detracting from suspicious authors, but it's unlikely Justine just steals their code wholecloth. She's been an active community member for a while, and wrote a lot of impressive software before LLMs and script kiddies democratized the whole process.
In this specific instance, jart had a communication error that she failed to clarify, and so things compounded from there. The part that she didn't author is clearly defined in Git, and the most-plausible explanation is an honest mistake. Assuming ill-intent requires you to ignore the original context of the disagreement and focus on the outrage, which pretty much says it all.
That being said, I'd love to hear what evidence you have to the contrary. Maybe you've got a link to an FTP server from 2001 with the Blinkenlights source code on it, I can't say for sure. A fraud probably doesn't write in-depth patch breakdowns on their personal blog for fun, though.
> > This PR was written in collaboration with @slaren. This PR is also rebased on PR #586 so please do not squash merge! Use either merge or rebase.
I read that PR (didn't click any links) and here on HN posted a "Great work" to jart. The reason I did that is precisely because those final lines in the PR came across as an upright acknowledgement that some people helped out. I also got the impression that jart was a co-owner of the project with all the "we"s that were thrown around.
If I was writing that PR, it would be something like "this PR consolidates slaren's mmap approach with additional work done for ... by myself". After hearing about the drama, actually reading slaren's PR, and reviewing jart's comments in issues and the PR and the hn show and tell, I am now convinced this is someone who wants to steal other people's thunder. Heck, even this front page article is yet another PR stunt. I suspect "faster fork of llama.cpp" posts will follow.
Giorgi Gerganov remains for me the hacker hero here as far as LLMs are concerned -- mmap is kiddie stuff to be frank, but anyone who gets whisper and llama to work on my laptop with a handful of files (many thanks to you sir) has my technical respect. And I think he has made the right call regarding the project.
Also worth pointing out that you can follow the thread’s link to Rentry, which links to a 4chan (?) archived thread, where you can see anons getting worked up over jart being a trans internet celebrity. And unless you’re playing dumb, you have to admit they were looking for an excuse to troll jart. Unless you seriously want me to believe they were all that mad about… mmap
I can understand these folks struggling with what mmap is actually doing. But this isn't a new discussion about the qualities of MMAP versus file based IO etc. Although, many of the comments stated are quite wrong.
I feel significantly dumber for reading that merge request.
The one thing to understand is that the performance implications of mmap are subtle and only work when you have much more RAM than the files you're mapping in.
> only work when you have much more RAM than the files you're mapping in.
Really depends on what you're doing, like memory access patterns. I've definitely seen scenarios when mapping hundreds of gigabytes of data on dozens of gigabytes of ram where mmap has been an almost absurd performance boost over traditional I/O, both immediately but also asymptotically as all the most frequently accessed data ends up in cache and the least accessed data is paged out.
I don't disagree with the subtlety part though. It's very difficult to reason about I/O performance in general. Modern systems are like an onion of hidden performance optimization tricks and caching layers (both in software and hardware).
Yeah and on top of that, different systems (software and hardware combos) are different, so I can see the performance of this depending on the implementation of mmap on the system and the implementation of caches and virtual memory on the architecture. When I've debugged stuff like this, it's either been for myself in which case I know what combo I'm running on or it's been for work where we know which combinations we target and we run regression tests to observe perf implications.
Yes- I have 35 years experience with UNIX and used to use mmapping with BLAST, a sequence search tool, as well as my own codes.
I'll repeat myself: mmap is subtle. If what you mmap is larger than your host RAM, only some of the pages will be loaded at any time, and depending on access patterns, can lead to significant paging.
I may have parsed your statement incorrectly, but I'm assuming you are talking about the copy of data when using either mmap or File IO (memcpy versus write) Whether you do File IO versus mmap, there's going to be copy. With files, the copy occurs within kernel space with data being copied into the pages in the buffer cache, with mmap the copy occurs in userspace with data being copied into the address space. Swapping can occur in the buffer cache or mmap, this is why so many databases implement their own buffer cache to ensure specific data isn't flushed, leaving them in an inconsistent state.
> With files, the copy occurs within kernel space with data being copied into the pages in the buffer cache, with mmap the copy occurs in userspace with data being copied into the address space.
There is no copy with mmap, the page is either unwritable or CoW. There's always a copy with read(). (But read() can still be faster and more memory efficient nevertheless.)
You are right, if you are directly modifying the mmaped region. I always internally model my data as staging my changes to be synchronized to the mmaped region, so thats my mistake there.
> the page is either unwritable or CoW.
This is not universally true, or maybe I'm confused on this statement. MAP_SHARED exists, but maybe you are referencing a specific kernels' implementation on how they achieve coherence between file backed shared memory regions in two processes? Im not sure.
> Darwin kernel does though.
Sure we can always point to a kernel that has has implemented some feature or another, which is why I said typically you don't see it.
To be entirely honest I'm not sure why the kernel doesn't use better routines here, I think on ARM at least it saves the entire NEON state on context switch…
Unfortunately Justine has attracted a peculiar fanbase+haterbase. As their numbers swell the collective intelligence and technical understanding diminishes.
So the discussions end up gravitating towards weird drama. I wish you wouldn't have linked this thread. Theres going to be a bunch of stupid comments here as well about how great/awful jart is.
I'm not a fan or a hater, I didn't even know who this person was until this thread.
Does the change deserve a blog post or wild claims like "llama.cpp is 100x faster and uses half the memory!"? No. The original PR looks like a decent addition but the blog posts reads as incredibly narcissistic (i.e. lots of language like "We spent several weeks volunteering" and "our project") uh whatever. It also breaks a backwards compatibility when there's no technical reason it couldn't have been optional or put behind a feature flag, plus a ton of condescending language in the PR. Not really the kind of work I'd be proud of or would be advertising in a blog post.
The claim that it uses half the memory was probably a honest mistake. The ensuing disappointment that it did not in fact halve memory usage and drama attracted trolls and white knights and is icky. The discussion around nmap I suppose is subtle and when emotion abounds can no longer be had. :/
Is mmap really that broken on Windows? Or is the poster just confused that the data stays in the page cache? But that’s what the page cache does - that memory will be used for other things if needed, but if the memory is not needed it might as well keep the old data in cache.
No, mmap on Windows is fine. A generous, charitable statement would be that the OP on that thread is very confused, but based on some comments elsewhere on this thread about jart attracting a chorus of haters, it seems more likely that they're just trolling.
There's a weird breed of programmer who only wants to see the free memory column in top be maximized. I bought all this RAM and I want to make sure none of it is used in case I want to use it later.
I maybe should not be surprised, given that we live in the era of Unity and Electron, but using mmap() to load large files should be not be seen as rocket science.
And this is basically available on almost any platform with a MMU and a kernel.
Using memory mapped files is not always the right answer.
Memory mapped files have their disadvantages. The biggest disadvantage is that any disk read error (or yanking the USB drive) becomes an access violation exception (also known as a crash), just like you read from a bad pointer. You need to have robust exception handling, which is a taller order than just checking a return value.
Another disadvantage is that even when you have your pages mapped into memory, calling the page fault handler and getting your page has a cost of ~1200 CPU cycles on Windows just to do the User<->Kernel mode transition, plus the cost of actually performing the IO. "Just reading the file" skips many User<->Kernel mode transitions, so it's one per read call rather than one per page fault.
although it's true that many hardware problems exhibit as SIGBUS on memmapped memory, remember that this is an API and implementation written for high performance disk drives on important servers; for example, the ingres server on berkeley's research vax (IIRC mmap became used widely after one of the BSD 4.3 subreleases was released). IE, at the time, the idea of a drive that could be easily detached being used for production computing would have been crazy so I think crashing the app when a drive is removed is not completely insensible.
The fault will also raise a signal if there is an error reading the sector from the drive (what would be an EIO from read()). Lack of error handling in mmap isn't only a problem for removable media.
yes, that sounds like a good idea to me. Like I said: if you use mmap, the expectation is that the drive will not bork and if it does, it should terminate the application.
I think there just hasn't been a consumer application that is really resource constrained, for a long time now. Only things for enthusiasts have been. LLMs have product market fit, but running a useful one client side is resource constrained, but instead of it truly being a consumer hardware limitation, it just turns out they were never optimized to begin with - coming from the perceived "top AI/ML minds" at FAANGs, while some of the most basic optimizations are seemingly a lost art.
On the other hand, its only been a few weeks, so maybe I should ignore this absurdity and just wait.
Probably a combination of (a) ML framework people not paying much attention to CPU inference due to already having GPUs/TPUs already lying around for training - CPU inference is just for very quick experiments (b) research code has never been the best optimized for performance (c) ML people are not generally systems programmers, and a lot of systems programmers are afraid to mess with the ML code outside of low-level computation kernels (doesn't help that ML code is notoriously unreproducible).
It's indeed a very different world. This model was trained on thousands of GPUs. The weird file format corresponds to the train time sharding of the weights. And really nobody is doing CPU inference with all the GPU we have. And also the "CLI" use case seems contrieved to me. If you plan to interact several times with the model and want to keep the weights in RAM, why don't you start a REPL or spin up a server?
> while some of the most basic optimizations are seemingly a lost art
mmap isn't relevant to anyone except CPU-using programmers because other hardware doesn't have virtual memory paging. Firmware programmers don't care, GPU programmers don't care.
AFAIK CUDA offers unified memory which basically works with virtual address space and page faulting in data from main memory. There is also IOMMU in general.
Many of us would like to get rid of the host CPU and have ML trainers that are just GPUs and drives and NICs all attached to a northbridge. The GPU has everything required to make disk requests over the bus, and ideally the drive can receive network messages that get plumbed straight to the drive (I'm only partially joking).
Word embeddings were big for their time (especially with subword embeddings like fastText). We mmaped word embeddings for similar reasons. But yeah, I was kinda surprised that one post about LLaMa.cpp mmap support talked about a 'fairly new technique'. mmap has been in a UNIX programmer's tool belt for literally decades.
I'm in a grad program for Software Engineering. At my university, the only difference between the Comp Sci and Software Engineering degree is that comp sci requires an advanced algorithm class whereas software engineering has a capstone class where you have to work with a team to build a MVP that is unit tested, uses CI/CD, and obviously works.
I say this to highlight the parent comment. I'm essentially in a computer science program and we have learned absolutely 0 about paging or memory in any of my required courses. We practically don't touch OS anything in any of the classes. That's not to say the courses for that aren't offered but they aren't part of the core curriculum and over my time in my program, they've mostly not been offered due to lack of student interest.
I did learn how to use linked lists like a champion though!
I actually don't understand this. I think a lot of regular hn commentators are just avoiding these threads [given the questionable circumstances surrounding the related PRs and the "drama"].
We've had regular discussions on HN about various storage engines, how the latencies are cut down, etc. I share your surprise at hearing 'wow, mmap!' and all the debates in the issues as what it actually does.
Part of the problem is that this is the domain of computer engineering and not computer science.
Self respecting computer engineering curriculums will cover MMUs, page tables, TLBs, hardware interrupts, and page caches which once you know about mmap is fairly simple to understand.
The fundamentals really haven’t changed much in the past 40 years.
> Regarding the version comment - yes, the plan was to bump versions and no the magic. But I'm ok to change the magic to commemorate the significance of this update. In fact, maybe we can make this a thing and everybody who makes a significant contribution to the project will get their initials appended to the version. What do you think? smile
Greg didn't change it, it was changed in Jart's pull request. Also they can't just make a PR to fix it, because the models were already converted to that magic string that was changed for no reason.
Well if it wasn't Greg, then I feel like I need to ask--why the heck would you use he/him to refer to Justine? I'm glad you've changed your mind and referred to her with they/them above, though!
> they can't just make a PR to fix it, because the models were already converted to that magic string that was changed for no reason.
I _think_ the magic string was changed _because_ of versioning issues--it sounds like you're arguing that the magic version number _instead_ of the magic string should've been changed...but it sounds like Justine _was_ concerned with versioning, even if the versioning wasn't done in what you're saying is the best possible way. I just don't think that "one magic number was changed instead of another magic number" really warrants this level of animosity.
This sort of statement really don't contribute anythign to the discussion and in fact greatly distracts from the technical content. We don't need to hear your opinions about trans people in this thread.
I previously worked at Microsoft in an unrelated area, but had a friend who worked on CosmosDB, and later a different part of Azure.
There are some Microsoft products I genuinely love, but some are terrible. To an extent it is a reflection of the inconsistency in internal teams. Culture, values, skill level, and quality bar are all over the place depending on who you talk to, even compared to other large companies.
From what I had heard, CosmosDB was not a healthy team, and I would not consider using it as a product.
With the funding of MS, how could this be turned around? Is it necessary to build a new team or is it enough to exchange leadership and let them bring in new members?
It's a good question well above my pay grade :). Changing culture isn't an easy problem.
I suspect the way Microsoft does interviewing and performance management (very local to the specific team) contributes to the inconsistency.
MSFT has also been fairly open to its employees that it does not try to compete with competitors like Google, Meta, or even Amazon, in terms of compensation. So it isn't really trying to get the best engineers, so long as it can continue to print money.
There are still folks there who are incredible, but the floor is shockingly low at times. Folks will self-select, so you will then get teams which are more homogeneously good or bad.
That is an odd position when -Wstringop-overflow also highly depends on the optimizer (and will frequently generate false positives!) but not only remains in GCC but is enabled by default (even without any -Wall/-Wextra).
Things like this are why its pays to compile your project with as many compilers as possible (as well as static analysis tools).
Not all of them. msimg32.dll has no certificate and many system processes attempt to load that. There are more dlls in system32 if you look. Neither does Wldap32.dll, which gets loaded into lsass and is part of the knowndlls...
That's unfortunate. I remember 7 years ago I specifically seeked out an Atheros-bassed WiFi card for my computer because of Linux support, and more recently purchased a high-end router based on OpenWRT support. And a few months ago I specifically got a $450 AMD GPU because Nvidia doesn't like free drivers. Seems like Intel is the only option left for networking...
Qualcomm being closed source is the biggest issue with running good (user-respecting) software on phones, since basically every single Android has a Qualcomm processor/GPU. Samsung has some Exynos stuff but not in the USA. Only because of that, I'm looking into the PinePhone and Librem 5, which can run the mainline kernels.
I realize I'm far from a normal user, but the amount of ill will these companies develop is pretty massive. Broadcom for example has been permanently associated with 'this will be a pain in the ass to get working, and will break your system on updates.' And when I was buying the motherboard / wifi card / wifi router recently I immediately disqualified anything that had a Broadcom chip or didn't specify a chip.
For consumer sales, technologically aware people like us have an amplified impact because people trust us for recommendations. The AMD GPU purchase doubled to $900 because I suggested it to my brother when he needed a new one last month, even though he uses Windows. I've heard this also accredited to the rise of services like Google & GMail; 'techies' used them and their influence on family & friends was a big amplifier to help get users. So I wonder why Qualcomm and Nvidia are so hostile.
> Qualcomm being closed source is the biggest issue with running good (user-respecting) software on phones, since basically every single Android has a Qualcomm processor/GPU.
Actually there are kernel drivers for several Snapdragon platforms now, to the point that a Poco F1 can boot both GNU and AOSP-based systems off mainline.
I haven't touched Ruby in a while, but are there any common multithreaded use cases? It seemed like the direction was to go multi-process for web workloads (E.g. with Unicorn).
Both Puma (https://github.com/puma/puma, a popular server these days) and Passenger Enterprise (paid) provide multithreaded web support. Also on the background jobs side, Sidekiq https://sidekiq.org is very popular.
Those are both solid choices for servers. However neither of them have a scalability model suitable for HTTP/2 or WebSockets. That's something I wanted to try and address.
I used EventMachine a lot about 8-10 years ago. I'm excited to see Ruby getting some concurrency love again. What are the goals and improvements of your underlying design in general, and especially those that make HTTP2 and WebSockets work?
Concurrency is one of those core features which is hard to add after-the-fact, and so the initial design strongly determines the course of the language's life. It requires re-opening such fundamentals as what does it mean to call a function, or assign to a variable.
"Nobody is using Ruby for multithreading" is both cause and effect.
That's why I'm not terribly optimistic about projects like this (or the proposed Swift 6). That's not how these things work. Can you imagine a language which features good concurrency support today (like Erlang or Clojure) having been launched without it, and then announcing 5 (or 25) years later "We're going to address concurrency now"?
Completely agree with you and to me that's why it's an exciting challenge. I'm not expecting to solve every problem, but I'm trying to carve out a solution which I think works for these legacy issues. Even if we didn't have a solution for the last 25 years, no harm in adding one now! :)
Not a common use case, but a GUI with background processing is terrible without parallelization.
Even if the background process is I/O intensive (which is supposed to be most of the time waiting, therefore freeing the CPU for the foreground process), it doesn't mean it won't end up still blocking (I've experienced this with filesystem operations).
Actually, I was looking at how audio loops work, and it seems like the low context switching overhead of fibers could be really great for stacks of effects and filters. Because the overhead is very small and predictable, and the ergonomics of fibers is easier to deal with, it could make for a really nice interface.
Microsoft internally uses something pretty similar to Bazel. I'm not familiar enough with the two to fully understand motivations for the divergence. It supposedly initially ran poorly on Mac, just due to differences in what is cheap on different platforms. I wonder what kind of inherent performance differences you would find in something "Windows first" vs "Linux/MacOS first"
https://github.com/microsoft/BuildXL
The differences between bazel and BuildXL are less about platform and more about philosophy. Bazel is about opinionated builds - declare everything, and you get fast correct builds. Hard to adopt, but super powerful. BuildXL is more about meeting codebases where they are - take an existing msbuild|cmake|dscript build with its under-specified dependencies, and figure out how to parallelize it safely. This delivers quick wins, but comes with real costs as well (notably: no caching; labour-intensive magic to guess an initial seed of deps, with a performance cliff when something changes).
(Disclaimer: Googler, work closely with the bazel team. I like bazel; I like BuildXL for expanding the space of build ideas, but have long-term concerns it settles in a local maxima that'll be hard to get out of).
I think there's nothing insurmountable in getting Bazel to work well on Windows. It's just a chicken and egg problem: few people use it there, so it's not up to par, so few people use it there because it's not up to par.
You also can end up needing to rebuild the world, if touching a header that is part of the PCH, even if it isn’t really needed by all the targets.
Modules and header units were supposed to solve these a lot more cleanly, but is still not well supported.