I can't really understand what's going on after a few minutes poking around (best I can tell an argument over attribution?), but I certainly sympathize with the project author having to deal with some stupid internet drama that takes away from a cool project. This seems to be one of the less-discussed challenges with popular open-source.
The blog post is a narcissistic cringe ride, and the PR did not give credit to the level this Slarin appears to have been involved. Also, the change seems to cause a performance regression in some cases, which someone had also pointed out. It seemed to optimize for the case where only a subset of the weights are needed at the expense of the case where all of them are eventually needed. Seems the kind of thing that should have been tested prior to checking in.
Overall, these people would be better off taking their drama on Twitter or LinkedIn. ggerganov did the right thing kicking them out.
> @slaren made 7 commits in his fork, which @jart then squashed down into one
Good lord, it's terrible when the peanut gallery feels like they have to comment on development practice.
Why would numbers of commits be a relevant metric in an Open Source project? Of course squashed commits are easier to handle during reabses and such, and when that work can be squashed to a single "initial mmap support" commit, then that's fine.
> @jart rewrote @slaren's code, which slaren wrote first
Click the 4chan links and you’ll see in their own words what this was really about (trans maintainer, channers worried about their waifu bot getting cucked.) Most seem to not know what they’re talking about- some admit to being retarded. Highly suspicious of this being tech related.
Also note the stats on GH subscribers and stuff. This is a lolcow dossier…
Events like this make me glad I don’t contribute OSS. I’ll keep my coombots proprietary.
I didn’t know Justine was trans. Her wikipedia article doesn’t mention it, and has another female middle name which suggests that her parents gave the names to her.
I am not much familiar with her work except the impressive Cosmopolitan / Redbean mentioned on HN in the past. But she seems to be quite a controversial figure that is for some weird technocracy and against democracy and leftists, despite being a leader in the zucotti park protests… in short, someone who is no stranger to drama and controversy, and actively courts it:
All true, stuff I didn’t like about her, but also ancient history (article is from 2014!)
And like I said, you can follow the links and see the brigade discussing something else entirely.
Personally I don’t want an OSS ecosystem that banishes trans people or people who had weird proto-alt-right politics pre-Trump. If you’re gonna banish anyone, banish the ones posing existential risks to projects by their brigading against contributors they don’t like.
Yeah, I definitely prefer to be part of very inclusive and open OSS ecosystems, that do things in good faith.
I am not part of the YCombinator or West Coast ecosystem, but didn't it banish gay people with weird pro-alt-right politics pre-Trump, and then supported Trump? Like, for some reason there was a movement to banish Peter Thiel: https://mashable.com/article/peter-thiel-y-combinator
As a country, we ignore the Yemen war, and are told to only clutch pearls about taking money from Russia due to the Ukraine war. I imagine that YC stopped taking Yuri Milner's money a decade ago, partly because of his ties to the Kremlin, but probably it was just a natural parting of ways eventually: https://news.ycombinator.com/item?id=15631084
Anyway, just saying ... ecosystems aren't always perfect.
The issue is that mmap was unilaterally (or close to unilaterally) implemented and made the only way of loading files. Users do not have an option to continue to not use mmap.
Probably - if sadly - the right decision. Despite being a great feature and improvement, something simply "happened" to the project vibe and so the community, apparently as a direct response to that PR.
Similarly, this GH issue response to that occurrence, despite having made valid points in a reasoned manner, also held some of the same kind of "happen", in response - which is understandable, but not diffusive.
It ultimately doesn't matter who contributes, if someone truly believes in the project, they'll be just as happy to step away from it if they're affecting its momentum, even through no fault of their own or just a misunderstanding.
Momentum is important - and at an early stage like this, when vibe is building and community is forming, it can be very. I hope ggerganov continues to make these difficult decisions characteristic of clear leadership.
There's something extremely odd about the drama that occurs around justine in particular. It makes me feel there's some iceberg of things I don't know going on and I genuinely don't have the faintest clue what it is. I just know they make some cool software, like APE, cosmopolitan libc, and I believe landlock-make, which are inspirational projects to people who love clever yet practical hacks.
As for these LLaMA changes, I ran it on my machine for fun, and it worked perfectly. I wound up re-converting my models, but it doesn't take terribly long to do so even for 65B. After that, generation starts nearly instantaneously, which is very impressive. I wouldn't be surprised if there are legitimate problems with the change. Obviously people who deleted their local copy of the original model to save disk space are probably displeased, and maybe it is a massive performance reduction in some cases.
I wish I understood, and yet I fear I don't really want to know at the same time.
edit: At least in this case, it seems like it's mostly drama around attribution and unnecessary changes. Kind of sad that an otherwise really useful code change wound up being marred by probably-avoidable drama, but such is life ¯\_(ツ)_/¯ Honestly, I don't have any input, I just hope everyone can resolve their gripes amicably in due time.
Worth pointing out, there has been quite a bit of contention around this change, both technical, and some accusations of plagiarism/miscrediting here. https://github.com/ggerganov/llama.cpp/pull/711
> This PR was written in collaboration with @slaren. This PR is also rebased on
PR #586 so please do not squash merge! Use either merge or rebase.
jart made sure to that the other user got credit, in addition to making sure that their name was properly attributed in the commit log. Given all this, it feels like the drama--shouldn't exist? Like, if there's an issue with attribution, it's not because of bad-faith, and I feel like a good-faith conversation could have just resolved this, instead of bringing in trolls.
That's not the original PR. jart was working on a malloc() approach that didn't work and slaren wrote all the code actually doing mmap, which jart then rebased in a random new PR, changed to support an unnecessary version change, magic numbers, a conversion tool, and WIN32 support when that was already working in the draft PR. https://archive.ph/Uva8c
slaren replied to jart on HN asking her why she was doing and saying those things, and she didn't bother to reply to him, despite replying to others in that subthread within minutes. https://archive.ph/zCfiJ
Hmm, based on what you've quoted here and knowing nothing else but a few messages on AI Twitter I would invest in jart.
This is BillG-style product skill -- there is a ton of work that goes into representing a piece of software as something important and valuable that people should buy into.
Jart is a pretty exceptional engineer, even if she wrote this patch single-handedly it would hardly be a footnote in her list of professional accomplishments. This is the author of Cosmopolitan libc, redbean and APE we're talking about, after all.
That being said, it's important to attribute work properly. It can be easy to mix things up (eg. "my patch" is excusable) but repeatedly insisting authorship when you're not the author of the change just seems disingenuous. I'm sure it was in good faith, but since they didn't address the issue or clear anything up, it's come to this.
Dramatic, and hardly the conclusion people wanted to the story of a free performance improvement. It's not entirely contrived though, and I think the maintainer handled this exceptionally well given the circumstances.
I'm all for detracting from suspicious authors, but it's unlikely Justine just steals their code wholecloth. She's been an active community member for a while, and wrote a lot of impressive software before LLMs and script kiddies democratized the whole process.
In this specific instance, jart had a communication error that she failed to clarify, and so things compounded from there. The part that she didn't author is clearly defined in Git, and the most-plausible explanation is an honest mistake. Assuming ill-intent requires you to ignore the original context of the disagreement and focus on the outrage, which pretty much says it all.
That being said, I'd love to hear what evidence you have to the contrary. Maybe you've got a link to an FTP server from 2001 with the Blinkenlights source code on it, I can't say for sure. A fraud probably doesn't write in-depth patch breakdowns on their personal blog for fun, though.
> > This PR was written in collaboration with @slaren. This PR is also rebased on PR #586 so please do not squash merge! Use either merge or rebase.
I read that PR (didn't click any links) and here on HN posted a "Great work" to jart. The reason I did that is precisely because those final lines in the PR came across as an upright acknowledgement that some people helped out. I also got the impression that jart was a co-owner of the project with all the "we"s that were thrown around.
If I was writing that PR, it would be something like "this PR consolidates slaren's mmap approach with additional work done for ... by myself". After hearing about the drama, actually reading slaren's PR, and reviewing jart's comments in issues and the PR and the hn show and tell, I am now convinced this is someone who wants to steal other people's thunder. Heck, even this front page article is yet another PR stunt. I suspect "faster fork of llama.cpp" posts will follow.
Giorgi Gerganov remains for me the hacker hero here as far as LLMs are concerned -- mmap is kiddie stuff to be frank, but anyone who gets whisper and llama to work on my laptop with a handful of files (many thanks to you sir) has my technical respect. And I think he has made the right call regarding the project.
Also worth pointing out that you can follow the thread’s link to Rentry, which links to a 4chan (?) archived thread, where you can see anons getting worked up over jart being a trans internet celebrity. And unless you’re playing dumb, you have to admit they were looking for an excuse to troll jart. Unless you seriously want me to believe they were all that mad about… mmap
I can understand these folks struggling with what mmap is actually doing. But this isn't a new discussion about the qualities of MMAP versus file based IO etc. Although, many of the comments stated are quite wrong.
I feel significantly dumber for reading that merge request.
The one thing to understand is that the performance implications of mmap are subtle and only work when you have much more RAM than the files you're mapping in.
> only work when you have much more RAM than the files you're mapping in.
Really depends on what you're doing, like memory access patterns. I've definitely seen scenarios when mapping hundreds of gigabytes of data on dozens of gigabytes of ram where mmap has been an almost absurd performance boost over traditional I/O, both immediately but also asymptotically as all the most frequently accessed data ends up in cache and the least accessed data is paged out.
I don't disagree with the subtlety part though. It's very difficult to reason about I/O performance in general. Modern systems are like an onion of hidden performance optimization tricks and caching layers (both in software and hardware).
Yeah and on top of that, different systems (software and hardware combos) are different, so I can see the performance of this depending on the implementation of mmap on the system and the implementation of caches and virtual memory on the architecture. When I've debugged stuff like this, it's either been for myself in which case I know what combo I'm running on or it's been for work where we know which combinations we target and we run regression tests to observe perf implications.
Yes- I have 35 years experience with UNIX and used to use mmapping with BLAST, a sequence search tool, as well as my own codes.
I'll repeat myself: mmap is subtle. If what you mmap is larger than your host RAM, only some of the pages will be loaded at any time, and depending on access patterns, can lead to significant paging.
I may have parsed your statement incorrectly, but I'm assuming you are talking about the copy of data when using either mmap or File IO (memcpy versus write) Whether you do File IO versus mmap, there's going to be copy. With files, the copy occurs within kernel space with data being copied into the pages in the buffer cache, with mmap the copy occurs in userspace with data being copied into the address space. Swapping can occur in the buffer cache or mmap, this is why so many databases implement their own buffer cache to ensure specific data isn't flushed, leaving them in an inconsistent state.
> With files, the copy occurs within kernel space with data being copied into the pages in the buffer cache, with mmap the copy occurs in userspace with data being copied into the address space.
There is no copy with mmap, the page is either unwritable or CoW. There's always a copy with read(). (But read() can still be faster and more memory efficient nevertheless.)
You are right, if you are directly modifying the mmaped region. I always internally model my data as staging my changes to be synchronized to the mmaped region, so thats my mistake there.
> the page is either unwritable or CoW.
This is not universally true, or maybe I'm confused on this statement. MAP_SHARED exists, but maybe you are referencing a specific kernels' implementation on how they achieve coherence between file backed shared memory regions in two processes? Im not sure.
> Darwin kernel does though.
Sure we can always point to a kernel that has has implemented some feature or another, which is why I said typically you don't see it.
To be entirely honest I'm not sure why the kernel doesn't use better routines here, I think on ARM at least it saves the entire NEON state on context switch…
Unfortunately Justine has attracted a peculiar fanbase+haterbase. As their numbers swell the collective intelligence and technical understanding diminishes.
So the discussions end up gravitating towards weird drama. I wish you wouldn't have linked this thread. Theres going to be a bunch of stupid comments here as well about how great/awful jart is.
I'm not a fan or a hater, I didn't even know who this person was until this thread.
Does the change deserve a blog post or wild claims like "llama.cpp is 100x faster and uses half the memory!"? No. The original PR looks like a decent addition but the blog posts reads as incredibly narcissistic (i.e. lots of language like "We spent several weeks volunteering" and "our project") uh whatever. It also breaks a backwards compatibility when there's no technical reason it couldn't have been optional or put behind a feature flag, plus a ton of condescending language in the PR. Not really the kind of work I'd be proud of or would be advertising in a blog post.
The claim that it uses half the memory was probably a honest mistake. The ensuing disappointment that it did not in fact halve memory usage and drama attracted trolls and white knights and is icky. The discussion around nmap I suppose is subtle and when emotion abounds can no longer be had. :/
Is mmap really that broken on Windows? Or is the poster just confused that the data stays in the page cache? But that’s what the page cache does - that memory will be used for other things if needed, but if the memory is not needed it might as well keep the old data in cache.
No, mmap on Windows is fine. A generous, charitable statement would be that the OP on that thread is very confused, but based on some comments elsewhere on this thread about jart attracting a chorus of haters, it seems more likely that they're just trolling.
There's a weird breed of programmer who only wants to see the free memory column in top be maximized. I bought all this RAM and I want to make sure none of it is used in case I want to use it later.
I maybe should not be surprised, given that we live in the era of Unity and Electron, but using mmap() to load large files should be not be seen as rocket science.
And this is basically available on almost any platform with a MMU and a kernel.
Using memory mapped files is not always the right answer.
Memory mapped files have their disadvantages. The biggest disadvantage is that any disk read error (or yanking the USB drive) becomes an access violation exception (also known as a crash), just like you read from a bad pointer. You need to have robust exception handling, which is a taller order than just checking a return value.
Another disadvantage is that even when you have your pages mapped into memory, calling the page fault handler and getting your page has a cost of ~1200 CPU cycles on Windows just to do the User<->Kernel mode transition, plus the cost of actually performing the IO. "Just reading the file" skips many User<->Kernel mode transitions, so it's one per read call rather than one per page fault.
although it's true that many hardware problems exhibit as SIGBUS on memmapped memory, remember that this is an API and implementation written for high performance disk drives on important servers; for example, the ingres server on berkeley's research vax (IIRC mmap became used widely after one of the BSD 4.3 subreleases was released). IE, at the time, the idea of a drive that could be easily detached being used for production computing would have been crazy so I think crashing the app when a drive is removed is not completely insensible.
The fault will also raise a signal if there is an error reading the sector from the drive (what would be an EIO from read()). Lack of error handling in mmap isn't only a problem for removable media.
yes, that sounds like a good idea to me. Like I said: if you use mmap, the expectation is that the drive will not bork and if it does, it should terminate the application.
I think there just hasn't been a consumer application that is really resource constrained, for a long time now. Only things for enthusiasts have been. LLMs have product market fit, but running a useful one client side is resource constrained, but instead of it truly being a consumer hardware limitation, it just turns out they were never optimized to begin with - coming from the perceived "top AI/ML minds" at FAANGs, while some of the most basic optimizations are seemingly a lost art.
On the other hand, its only been a few weeks, so maybe I should ignore this absurdity and just wait.
Probably a combination of (a) ML framework people not paying much attention to CPU inference due to already having GPUs/TPUs already lying around for training - CPU inference is just for very quick experiments (b) research code has never been the best optimized for performance (c) ML people are not generally systems programmers, and a lot of systems programmers are afraid to mess with the ML code outside of low-level computation kernels (doesn't help that ML code is notoriously unreproducible).
It's indeed a very different world. This model was trained on thousands of GPUs. The weird file format corresponds to the train time sharding of the weights. And really nobody is doing CPU inference with all the GPU we have. And also the "CLI" use case seems contrieved to me. If you plan to interact several times with the model and want to keep the weights in RAM, why don't you start a REPL or spin up a server?
> while some of the most basic optimizations are seemingly a lost art
mmap isn't relevant to anyone except CPU-using programmers because other hardware doesn't have virtual memory paging. Firmware programmers don't care, GPU programmers don't care.
AFAIK CUDA offers unified memory which basically works with virtual address space and page faulting in data from main memory. There is also IOMMU in general.
Many of us would like to get rid of the host CPU and have ML trainers that are just GPUs and drives and NICs all attached to a northbridge. The GPU has everything required to make disk requests over the bus, and ideally the drive can receive network messages that get plumbed straight to the drive (I'm only partially joking).
Word embeddings were big for their time (especially with subword embeddings like fastText). We mmaped word embeddings for similar reasons. But yeah, I was kinda surprised that one post about LLaMa.cpp mmap support talked about a 'fairly new technique'. mmap has been in a UNIX programmer's tool belt for literally decades.
I'm in a grad program for Software Engineering. At my university, the only difference between the Comp Sci and Software Engineering degree is that comp sci requires an advanced algorithm class whereas software engineering has a capstone class where you have to work with a team to build a MVP that is unit tested, uses CI/CD, and obviously works.
I say this to highlight the parent comment. I'm essentially in a computer science program and we have learned absolutely 0 about paging or memory in any of my required courses. We practically don't touch OS anything in any of the classes. That's not to say the courses for that aren't offered but they aren't part of the core curriculum and over my time in my program, they've mostly not been offered due to lack of student interest.
I did learn how to use linked lists like a champion though!
I actually don't understand this. I think a lot of regular hn commentators are just avoiding these threads [given the questionable circumstances surrounding the related PRs and the "drama"].
We've had regular discussions on HN about various storage engines, how the latencies are cut down, etc. I share your surprise at hearing 'wow, mmap!' and all the debates in the issues as what it actually does.
Part of the problem is that this is the domain of computer engineering and not computer science.
Self respecting computer engineering curriculums will cover MMUs, page tables, TLBs, hardware interrupts, and page caches which once you know about mmap is fairly simple to understand.
The fundamentals really haven’t changed much in the past 40 years.
> Regarding the version comment - yes, the plan was to bump versions and no the magic. But I'm ok to change the magic to commemorate the significance of this update. In fact, maybe we can make this a thing and everybody who makes a significant contribution to the project will get their initials appended to the version. What do you think? smile
Greg didn't change it, it was changed in Jart's pull request. Also they can't just make a PR to fix it, because the models were already converted to that magic string that was changed for no reason.
Well if it wasn't Greg, then I feel like I need to ask--why the heck would you use he/him to refer to Justine? I'm glad you've changed your mind and referred to her with they/them above, though!
> they can't just make a PR to fix it, because the models were already converted to that magic string that was changed for no reason.
I _think_ the magic string was changed _because_ of versioning issues--it sounds like you're arguing that the magic version number _instead_ of the magic string should've been changed...but it sounds like Justine _was_ concerned with versioning, even if the versioning wasn't done in what you're saying is the best possible way. I just don't think that "one magic number was changed instead of another magic number" really warrants this level of animosity.
This sort of statement really don't contribute anythign to the discussion and in fact greatly distracts from the technical content. We don't need to hear your opinions about trans people in this thread.
The original change made intuitive sense, some of the arguments against seem a bit weird - asserting that MMAPing the file could mean the memory sticks around after the program stops.. no.
Suggesting that MMAP limits things to the size of the ram, well - no, as well - paging may happen, but then we are just back out to the file.
Honestly, some of the weird assertions wouldn't take long for people to double check an verify (or falsify).
I still don't understand: why was the magic number changed in addition to the file version number?
Edit: can someone running llama.cpp ask it whether it thinks it's a good idea to concatenate a running list of vanity initials of important developers into a magic filetype constant?
It was a significant change, and Greg - the original author of llama.cpp was fine with it:
> Regarding the version comment - yes, the plan was to bump versions and no the magic. But I'm ok to change the magic to commemorate the significance of this update. In fact, maybe we can make this a thing and everybody who makes a significant contribution to the project will get their initials appended to the version. What do you think? smile
Since the post is from day, so the improvements were all ‘real’?
I didn’t follow closely but I remember multiple points people brought up earlier like: is the memory counting correct, why aren’t all the weights accessed for a query, whether quantisation is a problem etc.
Were all these fixed?
There are no memory improvements, people were not measuring correct. The giant improvement is the load times after the first run(if you do not invalidate your caches). Quantization to 4 bit is a big gain, the loss appears to be minimal from benchmarks. So with quantization you gain the ability to try a bigger model, if you have the hardware to fit the biggest model then you can skip it but for most people we need to try to fit the biggest model possible in our VRAM or RAM.
Unless the prior code was using O_DIRECT, the data was getting loaded into the kernel's page cache, and then the application was copying it into its own anonymous memory. Now the copy isn't happening. There are some subtleties involved [1] but it's not crazy to claim approximately half the RAM usage, even before bringing multiple processes into the picture.
[1] The kernel doesn't necessarily load the whole thing into page cache at once and keep it around indefinitely. It might have been recognizing a sequential loading pattern before and basically discarding pages almost immediately, where as now it might be keeping them for much longer. Or it might now be essentially skipping loading the whole thing in at once and doing it page-by-page on demand, which could be more RAM-efficient but slower. To some extent, you can control these behaviors with madvise, mlock, MAP_LOCKED, MAP_POPULATE, as well as various sysctls. Also, if it had to page out before, the anonymous memory was "dirty" and thus had to be swapped (written out to disk) where as the mmap()ed bytes are "clean" and can simply be discarded and (if needed to be paged back in later) reread from the existing file unchanged.
Thanks for the extra clarifications, but the claims were something impossible like a 23 Gb model only using 6Gb with this change. So maybe before this change it would have used a lot more of 23 Gb. I was referring to those miracle memory reductions, unfortunetly not possible, I would like to try 3 bit qunatizations when models and software will be ready(found none in my searches today)
Yes, those claims were a bit much, and in fairness jart chimed in to say so too. [1]
fwiw, I'm not a ML person, but it doesn't seem entirely crazy to me to think that SSDs are becoming fast enough that you could avoid keeping a huge model in RAM in some cases. Especially if "computational SSDs" (SSDs that can do some basic first-stage computation without transferring the input data over PCIe) ever become common. (I think some of the ML accelerators for sale today might be approximately this.)
much of performance in computing is about moving the memory hierarchy around in ways that are inconvenient to programmers.
I made an SSD into a spare swap device, and basically treated my system as having RAM+SSD's worth of RAM. It allowed me to finish a few big jobs (~96GB RAM) overnight that wouldn't have otherwise.
> There are no memory improvements, people were not measuring correct.
Using filebacked pages instead of anonymous memory is a real improvement because it doesn't have to get swapped out if there's memory pressure. And this program probably isn't the only thing running on the machine.
I was referring that you would not gain any memory, there was no magic compression so you could use a bigger model on the same hardware. There were some wild claims made but it was some people meassuring memory usage wrong, but you are correct there might be some small memory improvements and soem speed improvements.
Well, you can use a bigger model now, it will "just" be really slow. This is different from GPUs, which would just fail to load larger models than VRAM because they don't support paging (unless you build that yourself.)
The initial claims about memory savings and sparse models weren’t correct at all. This was immediately clarified by people who tested it, but the headline had already moved toward the top of HN.
I have got Vicuna-13B working on GTX 3090 Ti + OpenCL + CPU with 90% of weights on the GPU (otherwise running out of memory) at around 500ms per token.
This model is really good for a (semi-)open source model. I think this may be the first locally runnable model that I will actually use for real stuff rather than just play around for fun.
It's not ChatGPT level but it's not that far behind. It will draw ASCII art HUDs for a text adventure or analyze data or recognize languages or write stories. AFAIK it's been trained on ChatGPT discussions so makes sense.
This AI still gets uppity sometimes about offensive content but unlike ChatGPT, you can edit the prompts to put words in its mouth to encourage it to answer properly.
I only got it working at all yesterday and there's no nice UX at all. Not sure I recommend trying to use this as llama.cpp will probably have this in no time with a much better user experience, although I am also trying to make it more usable.
If you follow the instructions on Vicuna page over how to apply the deltas, and you can compile the project, then you could run:
Where /models/vicuna13b is the HuggingFace-compatible model. This will put 90% of weights on GPU and remaining 10% non CPU which is just barely enough to not run out of GPU memory (on a 24 gig card)
Create a text file 'prompt' with the prompt. I've been using this template:
You are a helpful and precise assistant for checking the quality of the answer.###Human: Can you explain nuclear power to me?###Assistant:
(the model seems to use ### as delimiters to distinguish Human and Assistant). The "system prompt" is whatever text is written at the beginning.
The feature to load a % to the GPU is novel and amazing! I couldn't get the project up and running myself (requires a nightly rust build) but I love this particular innovation.
> We release Vicuna weights as delta weights to comply with the LLaMA model license. You can add our delta to the original LLaMA weights to obtain the Vicuna weights.
Officially, only as deltas against LLaMa weights, and needing a complicated and resource-intensive conversion procedure. Unofficially, yes, a pre-converted llama.cpp compatible ggml file is available, but obviously I won't publish the link here to avoid violating the Y Combinator's terms of use.
I recently found this list of models that works with llama.cpp: https://rentry.org/nur779 (with dl links, albeit given llama's licensing gray area, use at your own risk)
The latest so far would be Vicuna, whose weights were just recently release.
> I don't think I've ever seen a high-level library that's able to do what mmap() does, because it defies attempts at abstraction.
I'm not sure what this means but I'm pretty sure I can name several "high level libraries" that mmap things. None of those are the STL, but it's not exactly perfect design.
I read it to mean mmap is irreplaceable. There is no other sophisticated dance of system calls or userspace trickery that can achieve what mmap can achieve. She's saying that everything up and down the stack, including high level libraries, do just call mmap, because there would be no DIY alternative with similar cost-benefit.
Except it’s not irreplaceable, at least on Linux. userfaultfd allows you to define custom page fault handling. With it, you can even do crazy things like “mmap” a remote resource by making HTTP range requests on a read fault.
You’re right. Irreplaceable was a stronger way of putting it than the original, I think, so that’s more my mistake than hers, and I think the contrast with userspace stands.
mmap sits at this lovely intersection between virtual memory and the disk, and it’s been around for a long time. By now there are other means of playing within that nice intersection, but mmap is the pop classic.
Windows has vectored exception handlers, which are a bit like UNIX signal handlers but much more sanely designed. You can use that to redirect control flow when a page fault occurs, and you can check the faulting address in your handler to scope it to a particular region of memory.
It might be the non-obvious problem with mmap : failures are surfaced through signals. So, a failure signal handler may run at any time. Your program needs to be resilient to this, and it's not trivial to do so. It's not a local change to one class.
Boost.interprocess is an example of an abstraction over mmap which solves some of the things the blog post mentions. It abstracts away the difference between mmap() and CreateViewOfFile(), and gives you smart pointers and container types which are close to being drop-in replacements for std::vector and std::map that can be stored in a memory-mapped file.
The post is a technical article that describes a very cool systems engineering approach to a problem that is usually coded away in proprietary code from nvidia.
No wonder a lot of people see it as "mmap, nothing new". But this is not the case in a lot of libraries where the norm is to just budget for a lot of time moving things to/from gpu and just relying on someone else's code.
Instead of accumulating technical debt the owner of the repo decided to merge this, have a breaking change and move on. When there was some community backlash to the breaking changes there was a pull request trying to revert all changes instead working through the issues (it was a net win for several users but not all, some configurations with slower drives were better served by the older approach). There was an ugly back and forth and the repo owner decided to ban both the person who did the pull request and the author of this post.
This article brings the conversation back to the technical merits, the roadmap, credit the the original authors and tones down the ownership tone that may have pissed off some community members.
That pull request has now been closed by the owner of the repo. They are trying to move on and be productive, let's do the same.
You can just read it (at least ctrl-f jart) before telling people what is/isn't appropriate. As I recall in short order people started arguing about jealousy/credit, the file format initials being changed in jarts honor, "I'm not technical and don't know what's going on but..", 'etc. Goofy stuff.
> One of the downsides of the Linux cp command, is copying a file larger than RAM will destroy every existing entry in the file cache. Under normal circumstances this is a good thing, since a least recently used strategy usually works. However it can be problematic if you're just organizing your files on a production system where you don't want to disrupt performance. As far as I know, no standard command line utility offers a way to exploit this functionality.
This is from today apr 5 saying the mmap change loads twice as big models with x100 speed up - is this not a blatant lie?
Wasn’t it discovered last week that loading larger models was an error in measurement and the speed up was from keeping things in memory after the first loading?
Justine knows this and it is stated right there on the page:
> The first time you load a model after rebooting your computer, it's still going to go slow, because it has to load the weights from disk. However each time it's loaded afterwards, it should be fast (at least until memory pressure causes your file cache to be evicted).
"Blatant lie" seems a bit strong. Running a large model for a second time in a row is a pretty common use case and that speedup strikes me as real in that common case. Attribution may have been wrong but the time saved is real.
mmap() will keep things in memory after first loading, but the page cache will _also_ keep things in memory after first loading. The difference is in order to re-use that you still need to read the file and store yourself (requiring 2x memory), instead of just doing a memory access. This has two consequences:
* 2x memory. A 20G data set requires 40G (20 for page cache and 20 for LLaMA)
* Things would be _even slower_ if they weren't in page cache after first loading. mmap is fast because it does not require a copy and reduces the working set size
There was a post a few months ago about a developer who's job it was to rework ML stuff into actual efficient code. This reminds me of that post because it seems that lots of ML stuff is just plain inefficient...in that they use way too many resources given the problem.
And they're arguing about mmap, something that's been around forever.
It reminds me of that speedup in a package manager because they were reading uncached byte-at-a-time off of disk. You need to explicitly turn buffered reads off...but why would you do that in the first place? Unbuffered reads are almost never a good idea, ever.
It makes me wonder what other weird sub-optimal stuff is lying underneath the resource behemoth that is ML.
> it seems that lots of ML stuff is just plain inefficient
This is just like any other technology. Use it wrong, you will get burned. Doesn't matter how shiny the container is.
"ML stuff" also (mostly?) includes purely statistical methods from the 90s that are deterministic and arguably the best way to solve a large variety of non-generative problems.
In fact, unless generation of arbitrary output is a major objective, it's likely you can solve whatever ML task on a workstation from 2010 that uses intel integrated graphics.
I see https://docs.kernel.org/filesystems/proc.html describes FilePmdMapped as "Page cache mapped into userspace with huge pages", consistent with what you are saying. I don't fully understand the distinction between that and FileHugePages: "Memory used for filesystem data (page cache) allocated with huge pages". I wouldn't think it'd be possible to map it into userspace as huge pages if the kernel hasn't allocated it as contiguous physical memory (and consistently aligned with the userspace virtual addresses), so there's something I'm missing.
What kernel version did that output come from? Do you happen to know if Firefox did anything special to set that up? What filesystem type is this?
Huh. I did a little digging through kernel source. There's been a CONFIG_READ_ONLY_THP_FOR_FS since 2019. It's still marked as experimental and isn't enabled on the precompiled kernel I'm using (with Ubuntu 22.10). Is that option set on your kernel, or is this something else?
I also see something about MADV_COLLAPSE that supposedly supports file-backed pages. [1]
I have no idea. I was pointing out that the transparent huge pages documentation is not authorative for requesting huge pages explicitly with mmap.
I didn't find any documentation that would indicate that explicit huge pages didn't work with on-disk filesystems, but sure enough, it doesn't seem to work on ext4.
jart is a genius. What they’ve done with Blink, Cosmopolitan C, Redbean, and now llama.cpp is incredible. It gives me hope for the future of systems/low-level programming.
I suspect you're trolling but Jart is not a genius for co-authoring a 300 line change to use mmap(), this is a pretty typical system engineering optimization that goes on in big companies, which is usually framed as a performance improvement for certain use cases with pros and cons, not "Hacker news front page 100X revolutionary performance improvement".
I think reducing this to "using mmap is basic" is pretty unfair.
The trick she did overriding malloc & friends to validate that the optimization would be worth doing is, in my mind, one of the high-points of the paper. It's a very clever way of making a meaningful measurement, which was the keystone of the entire change.
I've never heard of, thought of, or used that trick, and the fact she had it in her arsenal to apply to this very specific situation is pretty impressive, to me at least.
Systems Programmer is itself a niche field, when compares to other fields of interest in Software Developement. So let's not trivialize anything, cause it can be amazing to someone, but not to another. It depends from person to person.
it's as much engineering as is gluing mmap, recv and send... there are people who can do that and can't piece together a dataframe pipeline, no need to be passive agressive here
> "it's as much engineering as is gluing mmap, recv and send"
As you can see from comments up this thread (ref. various GitHub issues) - the people "gluing mmap" don't actually have a single clue. They can't properly measure memory consumption (they don't understand what the numbers they're seeing actually mean). They don't understand how paging, swapping or virtual memory work. They don't actually understand the concept of memory-mapped files, why they're there and how they work. They can't explain why their code behaves differently when using memory-mapped files.
Moments like this are here to remind you that there's actual knowledge and skill to building scalable and efficient software, and that hustling and copy-pasting StackOverflow examples will only get you so-far, as will "piecing together dataframe pipelines" in Python.