Hacker News new | past | comments | ask | show | jobs | submit login
Memory Allocation (samwho.dev)
815 points by thecoppinger 10 hours ago | hide | past | favorite | 118 comments





> When we free memory, we should make sure that if the block we return to the free list is next to any other free blocks, we combine them together. This is called "coalescing."

A little offtopic but the default Delphi 7 memory allocator did this, except that it also merged blocks that it obtained from different OS allocation calls.

This worked fine for regular usage, but if that memory was ever used for Bitmaps for UI stuff, it wouldn't work: Since Windows does some of its UI stuff in kernel mode, before doing anything Windows would attempt to lock the entire allocation's VAD entry to prevent you from messing with it in another thread while it was using it. If the Bitmap you were working on happened to belong to two different OS-level allocations, this lock function would fail since it wasn't meant to handle that case.

This would lead to random, extremely cryptic errors such as ERROR_NO_SCROLLBARS "The window does not have scroll bars." since the lock call was deep in the stack and the callers replaced the error with another one as it bubbled up.

In the end we had to replace the allocator to avoid that issue. That was a fun day I spent debugging that.


You mean a bug this deep took one day to debug?

Yes and no, it took me from 8am to 3am once we decided it needed to get fixed but really it sat on the app for years, it only happened on a background process that sent print jobs on a timer, since it used Windows GDI to compose the image we sent to the printer it was affected (our "frontend" should've been affected too but never was, I guess because it had a different memory usage pattern).

We just had it restart itself and try again whenever it got one of those errors when printing but eventually we wanted to add a feature that required the process to not die, and by that time I was already 99% sure that it wasn't something in our code and I had already ruled out threading issues.

I ended up putting it in a VM with a kernel debugger attached and having a script make a snapshot and make it print over and over until it errored, then following along in IDA until I saw what was going on.

Having a way to trigger it (by restoring the snapshot) on demand helped a lot, otherwise it would have taken forever to make sense of it as it could sit without crashing for nearly an hour.


How do you attach kd to VM?

While I was at MS, it was such a big PITA - we just had a bunch of IT managed machines with KVM console access and KDNET for debugging.


I think the far weirder part of this was the kernel-side handling of scrollbars

If I recall correctly the kernel part of things would return an out of memory error which the user mode DLL translated to that weird error (sometimes, other times it would just say "out of system resources", it depended on what widget the bitmap that overlapped the two memory regions belonged to).

Here's a 2003 forum post from someone else having the same problem: http://www.delphigroups.info/2/1/749064.html


Until Windows 95, Windows was essentially just a DOS application that grabbed the framebuffer and ran an event loop where it drew "controls" (which includes windows, buttons, text views, and yes, scrollbars.) That was the whole point of it. It wasn't an "OS" per se; DOS was the OS. Windows was what a Linux-head would think of as a combination of an X server and window manager. And Windows loaded your "application" as essentially a DLL, with the Windows global event loop calling into your application's event-loop delegate handler (WndProc) whenever it has an interesting event that your application might like to react to.

(Your application wasn't even a "process" per se. Until Windows 95, everything was just happening in one shared address space, in real mode. In fact, it was only in Windows 3.1 where user applications stopped running in ring 0!)

If you think about it, this "the kernel is a game engine and your application is the game" approach isn't necessarily a bad design... for a single-tasking OS's library exokernel, like the Wii's https://wiibrew.org/wiki/IOS.

But, of course, Windows claimed to be a multitasking OS. But it actually wasn't! And I don't mean the obvious thing about it not having pre-emption. Lots of multitasking OSes didn't have pre-emption.

No, what I mean is that the concurrency primitive for the cooperative scheduling wasn't the "task" (i.e. the process or thread. Which, again, there weren't any of.) Instead, the concurrency primitive was the window. Until Windows 95, Windows was a multi-windowing OS.

Each control was owned by a window. Each window had a WndProc. If your Windows executable (i.e. application delegate module) set up two windows, then each window participated separately in the global Windows event loop, up-to-and-including things like having its own set of loaded fonts, its own clipboard state, and its own interned strings table. In OOP terms†, your application was just a dead "class object", running no logic of its own save for one-time load and unload hooks. It was the windows themselves that were the "instances" of your class.

This might make you realize why MDI (or Multiple Document Interface, where there are multiple small per-document "windows" inside one big window) was so popular back then. The MDI "windows" weren't actually windows — they didn't have their own WndProc. They were just controls, like a tab view is a control. Only the big container window was a real window, and so all the resources within that big window were shared between all the virtual windows. MDI was a memory-saving trick!

---

† The actual more interesting analogy is that Windows was essentially a (single-threaded, cooperatively-scheduled) actor system, where windows were the actors. There is a very close parallel between (Windows 3.1 executables, Windows 3.1 windows) and (Erlang modules, Erlang processes).


> This might make you realize why MDI (or Multiple Document Interface, where there are multiple small per-document "windows" inside one big window) was so popular back then. The MDI "windows" weren't actually windows — they didn't have their own WndProc. They were just controls, like a tab view is a control. Only the big container window was a real window, and so all the resources within that big window were shared between all the virtual windows. MDI was a memory-saving trick!

MDI may have saved some memory - I can't say one way or the other on that - but the mechanism you describe is incorrect.

Every MDI child window was a window of its own with its own WndProc. Every control inside those windows was also a window with its own WndProc. Every dialog box was also - yes - a window with its own WndProc.

You wouldn't always be aware of the WndProc in your code, but it was there.

If you ran WinSight or Spy++, you could see the entire window hierarchy including all the child windows, child control windows, and so on.

Later on, a few applications implemented "windowless controls" to save memory, but this was uncommon, especially in the early days. For example, there was an optional windowless version of the Rich Edit control:

https://learn.microsoft.com/en-us/windows/win32/controls/win...

Fun fact: an early informal name for MDI was "Mac in a box", because if you maximized the top level window, you had a somewhat Mac-like environment, with one menu bar at the top that was shared by the child windows.

Source: I was the author of WinSight and the VBX (Visual Basic eXtension) API.


> Until Windows 95, everything was just happening in one shared address space, in real mode. In fact, it was only in Windows 3.1 where user applications stopped running in ring 0!

Windows 3.0 and predecessors runs in processor which had no concept of "ring 0", so that should not be surprising at all...

> Your application wasn't even a "process" per se

I think this is a bit of a "modernistic", "win32y" view of the definition of a process. Surely there are processes/tasks in 3.x -- you can launch multiple instances of a module, each of them can allocate memory separately, each of them have different resources/handles, and the OS will cleanup for you once each instance terminates (cleanly). Obviously, without any type of virtual address space or memory protection, any such process can write to and destroy the memory of any other process, but they are still processes. The existence of Yield/DirectedYield , which do not take/need a window, is also a hint of that. (Note there are no threads in 3.x).

Many platforms (that either predate VM or decide not to use VM for e.g. power usage concerns) worked like this. MacOS, Windows CE, PalmOS, etc.


When writing C, I tend to avoid calling malloc and free directly.

* https://github.com/DaveJarvis/mandelbrot/blob/master/memory....

I then apply this same principle of "opening" and "closing" structures throughout the application. Generally, I can quickly verify that the calls are balanced:

* https://github.com/DaveJarvis/mandelbrot/blob/master/threads...

What's nice about this pattern is that the underlying implementation of exactly how the memory is maintained for a particular data structure (e.g., whether malloc or gdImageCreateTrueColor is called) becomes an implementation detail:

* https://github.com/DaveJarvis/mandelbrot/blob/master/image.c

The main application opens then closes structures in the reverse order:

* https://github.com/DaveJarvis/mandelbrot/blob/master/main.c

This means there's only one call to malloc and one call to free throughout the entire application (third-party libraries notwithstanding), allowing them to be swapped out with ease.

Aside, logging can follow this same idea by restricting where any text is written to the console to a single location in the code base:

* https://github.com/DaveJarvis/mandelbrot/blob/master/logging...


The article claims that an allocator that splits memory based on allocation size is called a "buddy allocator". That's misleading: an allocator that allocates an area for each size class is usually called a "slab allocator", while a "buddy allocator" is one that when needed subdivides a memory area with a power of two size into two half-sized areas that are "buddies", does so recursively to satisfy allocations, and coalesces them again when they are free.

E.g. the Linux kernel used (not sure if it's still like this) a buddy allocator to allocate pages and power-of-two blocks of pages and slab allocators to subdivide those pages and allocate data structures.

Another thing that the article doesn't mention that is important is that most production allocators make use of thread-local storage and either have per-thread caches of free blocks or sometimes whole per-thread memory regions. This is to reduce lock contention and provide memory that is more likely to be in the current core's cache.


You're absolutely right, I've corrected this. Thank you!

I had originally written about threading and locality but it made the post too long and complicated, so I cut it out for the final draft. You can see remnants of it if you check the HTML comments in the post :D


linux had a bitmap based "buddy allocator" (power of two), now it is not bitmap based anymore (complexity not worth it anymore, performance wise, namely simplicty was restored).

Then linux has various slabs(slub/slob/slab), built on top of the "buddy allocator".

Userlevel code shoud use non virtual address stable mmap-ed regions (slabs + offsets). Legacy "libc" services were built as virtual address stable services... which are kind of expensive to manage on a modern paginated system. Virtual address stable regions should be kept to a minimum (that horrible ELF static TLS). There is a workaround though (but linux overcommit default policy could kill your process): a user process would query the amount of ram on the system and mmap a region of roughly (care of the overcommit policy) the same size, only once per process life-time. Then you could have a virtual address stable region which could use most if not all the available ram (excluding hot-memory addition...). Should be very easy to manage with lists.


This is absolute gold. When I use things like this, I am reminded how powerful digital learning can be. So much more capable then just text or video. But very little content is this well put together. Probably because it is so time intensive.

This feedback has made my day. Thank you.

I'm inspired by Bartosz Ciechanowski and Julia Evans. The web is such a powerful toolbox. So many concepts are more complex than text alone can hope to explain. Those two are so creative and full of energy.

And you're right, it's incredibly time intensive to put these together. Thousands of lines of code, plus the text content, plus reaching out to domain experts for reviews (shout out Chris Down, kernel wizard extraordinaire).


Such great work. I learned things and have a clearer understanding that I think I will come back to in the future. And I say that as someone who has programmed for 15 years! I think your effort paid off, and the inspiration shows through.

One more name I'dd add to this list is Mike Bostock. The care and attention he gives to data visualization examples comes through in the finished product.

Communicating complex subjects through interactive visual displays is very effective -- it's worth the effort. Thank you!


Also shout out Anton Verinov (https://anton.codes/): the only reason this web page doesn't drain your battery before you get to the end of it.

why is it mind you?

The writing is very clear and the concepts are explained well. Not too much information and not too little. Excellent post.

Deja vu. :D

Great job Sam

If you're not yet familiar with it - https://ciechanow.ski/archives/ is for you (and everyone!)

The master of this artform.

Agreed, I wish I had more thing like this growing up.

Oh man, this brings back the days when I wrote special debug-version malloc and free code to try to track down heap corruption due to malloc / free misuse (in code I had contributed to). Stuff like kbyte-long boundary buffers with bit patterns in them, and all sorts of lookaside lists run in parallel with libc's default code. Those bug-detectors worked OK. Hard-nosed code inspection worked far better.

As an old-timer in writing code, I think my generation's most-challenging legacies (=== the things we f**ked up) are the non-robust malloc/free concept and null-terminated text strings. Bugs in code using those conventions have been exploitable to a really damaging extent. I learned to program in C from K&R. And getting data-structure code right, and safe to deploy, in that language and its derivatives is HARD.

The C inventors are (were in Dennis Ritchie's case) brilliant and Bell Labs was great. Their ideas shaped the the stuff the global internet runs on. But these parts of what thy invented ..... ouch. (All OSs had the same problems.)

I wish the wonderful article presented here carried a more prominent warning about this. Many will read it as they learn to code. The history of our profession can teach about what NOT to do as well as what to do.


Sam, this is such a wonderful resource that you've put out into the world. Thank you for the time and care you've put into each paragraph and interactive component. You've not only given me a lot to think about in terms of my basic assumptions about memory, but also about how to write and teach better online. I'm really excited to start following you!

Excellent, excellent article! I have a question though.

> Couldn't we rearrange the memory to get a block of 6 contiguous bytes? Some sort of defragmentation process?

> Sadly not. Remember earlier we talked about how the return value of malloc is the address of a byte in memory? Moving allocations won't change the pointers we have already returned from malloc. We would change the value those pointers are pointed at, effectively breaking them. This is one of the downsides of the malloc/free API.

But why not? Couldn't we store information about old pointers somewhere and match them with new addresses when defragmenting? Some kind of virtual memory driver that would map old pointers to new adresses transparently for the programs? Or would it be too much overhead for too little benefit?


You’d need cooperation between your malloc implementation and the application code. It’s possible, but tricky. If your malloc returned a pointer to a pointer, and promised to keep the first level pointer up to date, and was able to lock your application whenever it moved things around, it could work.

Someone else already mentioned, but garbage collected languages do actually do this. Because they’re fully in control of memory (the language exposes no raw pointers), they’re able to create the layer of indirection you’ve suggested and move things around as they please. I know at minimum the JVM does this. The term to search for is “heap compaction.”

There’s also the weird and wonderful work of Emery Berger et al with their “Mesh” malloc implementation, which blows my mind: https://youtu.be/XRAP3lBivYM.


To do that you either need a structure that you update every time a pointer is created, copied, moved or deleted (too much overhead), or you need a way to scan the entire memory and get all the pointers. And at the point where you have a piece of code that knows where every pointer is, you already know which pointers aren't used anywhere anymore so it's a waste to not have it also call free() for you.

Once you have it call free() for you, your piece of code is now a compacting GC, like Java's for example.


> Some kind of virtual memory driver that would map old pointers to new adresses transparently for the programs

You would need hardware support for this, since the hardware is what decides what gets returned when a program attempts to read from a memory location.

Hardware already does support virtual memory but the granularity is the page (which are a minimum of 4KiB in most OSs).


In a language like C that isn't really possible because the language can't keep track of all of the places that memory address is stored.

If malloc were to return something like an address that holds the address of memory allocated there is nothing preventing the program from reading that address, doing math on it, and storing it somewhere else.


Well, that's what some GCs do, and they do indeed defragment the heap.

Thank you for this, this is helpful.

I wrote a JIT compiler and I didn't bother calling free much, I just let the operating system free up all allocated memory.

I got into this situation often:

   return_struct = do_something(mystruct);
   return_struct->inner_struct = malloc(sizeof(struct my_inner_struct));
Now, who owns inner_struct? Who is responsible for freeing it? Do I free it when I assign to it?

I feel this ownership complicates cross-language FFI API calls, because who is responsible for freeing structures depends on the application and the platform you're running under. For example, Rust code being called from Erlang. You have to be extra careful when memory is managed by a different language runtime.

I feel I am at the beginning of intuitively understanding what memory really is: memory is just a huge contiguous region of numbered locations. Bump allocators allocate in one direction and free all at once. Arena allocators allocate to a preallocated region, I think.

Memory is a logistical problem of how you arrange and allocate finite resources.

I am thinking of alternative visualizations of understanding memory, for example, I started writing an animation of binary search:

https://replit.com/@Chronological/ProgrammingRTS

The idea is that you see values and memory locations move around with the final goal being able to command memory to move around and be computed, such as real time strategy game.

I think if we could visualise memory as cars on a road, we would see obvious traffic jams.


Yeah, I hear you. I've not done a lot of FFI stuff directly, it scares me.

Arena allocators are cool, the idea is you allocate a large-ish region of memory and sub-allocate into it (often with a fast, simple allocator like a bump allocator) and then free the large-ish block when you're done. It's a way to take knowing how much memory you need as a whole and optimise that to a single call to malloc/free.

You may enjoy looking through https://www.cs.usfca.edu/~galles/visualization/Algorithms.ht....


Thanks for the link to the animations.

I want an extremely performant deep copy solution, I've been thinking of using an allocator to implement it.

If we have a tree data structure or a nested hashmap, then we want to copy it cheaply, there is copy on write. But most copies of hashmaps are slow because they instantiate every child object in a recursive loop.

So I want to be able to memcpy a complicated data structure for cheap copies.


> I feel I am at the beginning of intuitively understanding what memory really is: memory is just a huge contiguous region of numbered locations.

There might be an analogy here that could help you reason about your nested structure allocations…

Memory is an array of bytes owned by the OS. While there are all kinds of implementation details about addressing and storage and performance and paging and virtual memory, it’s really just an array. The OS gives you a way to reserve pieces of the array for your own use, and you’re responsible for giving them back if you want to play nice and/or run for a long time, otherwise (as a safety net) the OS will take them back as soon as you exit.

This is, in a sense, very similar to the question you posed. An outer routine owns the outer structure, and an inner routine allocates some inner structure. The simplest, most intuitive, and generally best advice is that whoever allocates is also responsible for freeing memory. In other words, one way to define ownership of memory is by who allocates it. Implicitly and automatically the responsibility to free that memory belongs to owner that allocated it. It’s okay to explicitly transfer ownership, but can easily get complicated and unintuitive. You can also consider letting the parent free your struct to be similar to not calling free() in your JIT compiler - it’s a ‘lazy’ optimization to have the parent clean up - and I don’t mean that in a judgemental sense, I mean it’s valid to let the parent handle it, if you know that it will, and this can be done without getting confused about who owns the memory and who was actually responsible for freeing it. Note that when you leave the parent to clean it up, you are foregoing the ability to re-use the memory - this is true in your JIT compiler and it’s true for malloc() and free() as well. If you let the OS handle it, you’re in effect declaring that you believe you do not need to recycle the memory allocated in your program during it’s execution. (This might be true, and it might stay that way, but it’s always worth asking if it will remain true, since lots of people have been eventually bitten and had to retroactively refactor for memory management when their requirements change.)


> You have to be extra careful when memory is managed by a different language runtime.

While it would be nice to have next to no overhead for FFI, it's not always tractable. That's why you have to serialize across boundaries, the same as if you're serializing across processes or the network. At least in a single virtual memory space you can have a caller allocate a buffer and the callee fill it, with the caller being responsible for deserializing and freeing later. That gets you pretty far, and is relatively safe.

The alternative is to be callee managed, and for the callee to return things by handle and not necessarily by pointer, but that is also fraught.


> Now, who owns inner_struct?

return_struct does since it is the only thing that knows the address.

> Who is responsible for freeing it?

return_struct is, unless you hand that responsibility over to something else.

> Do I free it when I assign to it?

Yes, unless you want leaks.

> I think if we could visualise memory as cars on a road, we would see obvious traffic jams.

That visualisation is helpful for threads, where the program is the road/map and the cars are the threads. I don't see how it's useful for memory.


All credit goes to my wonderful friend Sam Who: https://twitter.com/samwhoo

He recently published a similar guide on load balancing that is equally intuitive and insightful: https://samwho.dev/load-balancing/

I can't wait to see what he puts out next!


https://news.ycombinator.com/item?id=35588797

We never missed good things lol!


A delightful page, written in a wonderfully interactive way.

My high congratulations for creating a most friendly, readable and useful lesson!


> "There's no shortage of information about memory allocators on the Internet, and if you've read this far you should be well-placed to dive in to it. Join the discussion on Hacker News! https://news.ycombinator.com/item?id=36029087 "

Interesting to use hacker news as the blog's own comment section in this way.


The only thing that confused me is how it said we can know the location of the block after and before by calculating:

    address + <value at address>
    address - <value at address-1>
Shouldn't this be?

    address + <value at address> + 3
    address - <value at address-1> - 3

Well shit. I think you're right.

Oh another thing, I'm not a fan of the premise:

"As a general-purpose memory allocator, though, we can't get away with having no free implementation."

I have a belief that the future of software are short-lived programs that never free memory. Programs allocate and terminate. Short-lived program communicate with each other via blocking CSP-style channels (see Reppy's Concurrent Programming in ML).

If you could also educate me on why this is a bad idea I would appreciate.


So basically garbage collection via just terminating and letting the OS handle it?

With a simple enough allocator, freeing things could maybe even be beneficial even for short-lived programs, purely from the memory being in cache already, instead of needing to ask the OS for more (incl. page faulting & zeroing, besides being limited by RAM throughput). For a buddy allocator without coalescing, a free() that just adds the argument to the corresponding freelist can be as simple as 5 or so x86-64 instructions (the fast path of the allocator being ~8 or so instructions; certainly more than a bump allocator, but not by much, and the reuse benefits can easily be pretty big).

My first large scale web application was a webmail service built in C++ (!) where I early on decided we'd ditch nearly all freeing of memory, as it was running as a CGI, and it was much faster to just let the OS free memory on termination. The exception was for any particularly large buffers. Coupled with statically linking it, it reduced the overhead sufficiently that running it as a CGI performed well enough to save us the massive pain of guaranteeing sufficient isolation and ensuring we were free of memory leaks.

Especially in a request/reply style environment, long running application servers is largely a workaround for high startup costs, and it's only a "bad idea" in the instances where removing that high startup cost is too difficult to be practical. Overall I love avoiding long running programs.


My take on this is that code should always match up malloc and free, but your application may use an allocator where free is noop, if that's appropriate for the application you write. This way your code is more generic and can be reused in an other application with different constraints.

And as soon as you are replacing free, you can replace malloc as well to be optimized for your use case. No need to build difficult bookkeeping hierarchies when they will never get used.


I agree with your point but disagree with your reasoning. I think programs should always free memory at some point because then it's easier to reason about debugging memory leaks.

Practically speaking though, there are arena allocators that do exactly this - you allocate a bunch of memory at once, assign like-typed instances to "slots" in that memory region, and then deallocate everything all at once. Thus, the individual instance `free()` is a no-op.


It's not general purpose, and lots of programs that were designed to be short-lived often end up not being so in the future. People used to point at compilers as a typical example of this kind of thing, well, now we have compilers as libraries sitting resident in every popular developer tool.

It's funny, I saw and retweeted this while writing this post: https://twitter.com/samwhoo/status/1650572915770036225?s=20

Not sure the future you describe is where we'll end up, haven't given it a huge amount of thought. Would be interesting to see, though.

Things like web servers could probably get away with doing some sort of arena allocation per request (I'd be surprised if some don't already do this).


Apache does this! And I do this in my own C web framework:

https://github.com/williamcotton/express-c/blob/master/deps/...


Even if the programs don't free memory, something has to allocate and free memory for the programs and channels.

Well otherwise, I learned a lot and the basics are much simpler than I expected, thank you for the article.

This also trapped me.

> Others will return what's called a "null pointer", a special pointer that will crash your program if you try to read or write the memory it points to.

This is not strictly true, it depends on the environment you're using. Some older operating systems and some modern embedded systems have memory mapped at the zero address.


This is incredibly well done. I've never seen malloc/memory allocation explained so clearly. I'd buy a book written like this.

I have some bad news for you about books.

Seems to be a bug on the first interactive graph, at least for me. Unless I'm misunderstanding the point of the graph, `malloc(7)` only allocates 2 bytes.

I came here to see if anyone else noticed this and am confirming that there is a bug in the first slider on malloc(7). Indeed it only allocates two bytes instead of seven.

Good spot! Thank you. Fix on its way out now. :)

Thank you for building this! It's helped a lot for me to wrap my head around the concept even more deeply than before.

True, it might be cut-off from screen?

Nah, I just failed at basic arithmetic :D

I wrote this:

  <div class="memory" bytes="32">
    <malloc size="4" addr="0x0"></malloc>
    <malloc size="5" addr="0x4"></malloc>
    <malloc size="6" addr="0x9"></malloc>
    <malloc size="7" addr="0xa"></malloc>
    <free addr="0x0"></free>
    <free addr="0x4"></free>
    <free addr="0x9"></free>
    <free addr="0xa"></free>
  </div>

Instead of this:

  <div class="memory" bytes="32">
    <malloc size="4" addr="0x0"></malloc>
    <malloc size="5" addr="0x4"></malloc>
    <malloc size="6" addr="0x9"></malloc>
    <malloc size="7" addr="0xf"></malloc>
    <free addr="0x0"></free>
    <free addr="0x4"></free>
    <free addr="0x9"></free>
    <free addr="0xf"></free>
  </div>

I love this so much, thank you for putting this together!

My only piece of feedback would be for the "Inline Bookkeeping" section (https://samwho.dev/memory-allocation/#inline-bookkeeping), it took a while for me to grok the numbered list to figure out which block corresponded to address + X. I wonder if there is a better way to visualize the 4 numbered bullet points? Maybe just arrows and text pointing to the visualization?

Thanks again for this wonderful article!


Yes, this is one of the things I look back on as a weak point in the writing. I wanted to make this a lot more clear but by the time I'd gotten to this point in the article, I'd sort of coded myself into a corner with the visualisation code.

In another universe I'd hook in to the page scroll and highlight each block being referred to as I talk about it in the text. I'm probably not explaining that well here, but imagine something like the way this article works: https://pudding.cool/2017/05/song-repetition/.


Couldn't you highlight each block being referred to as you mouse over / click on relevant text? (The latter might not be terribly hard to do with <a> and CSS :target, if you can give the blocks ids.)

The blocks aren't HTML elements, they're drawn on to a canvas. It _could_ be possible regardless, though. I may revisit this. Thanks for the idea! :)

This is a nice read too (the whole blog actually)

http://igoro.com/archive/volatile-keyword-in-c-memory-model-...

It's a bit old by now (2010), but I always remembered the mental model Igor created.


I think it will be helpful if it discusses internal fragmentation, where the payload is smaller than the allocated block. I observed that this was important in understanding the purpose of various memory allocation algorithms when undertaking the malloc lab in college.

A great example of why I'll always read a post over watching a video.

Glad there are always talent people and great guide like him/this exist! Great people and guides will help tremendous learners along the way, timeless, priceless.

Interesting. Is there a book that focuses on evolution of allocators so we can follow along and code allocators of different difficulties?

Someone needs to make an awesome list for visual guides

Absolutely, this is so helpful

Great webpage, but it somehow left me more confused.

It seems to imply that malloc/free works by boundary tag? Which I don't think is the case? (and if not, it leaves the reader confused as to how it then actually works).

I know "some" languages use the tag technique (e.g. julia), but the article seems to imply that this also applies to the c code above? Or am I wrong and c also makes use of such a scheme when you use pointer arithmetic?? (I don't see how that would work with arrays if that's the case though)


I'm sorry you're more confused than you were when you started!

The post is intending to talk about the topic of memory allocation in a general sense. The way that malloc works on any given system is implementation dependent, it's not possible to say "malloc works in this way" because Debian can differ from Arch can differ from BSD, etc.

It's not my goal to tell you exactly how modern malloc/free implementations work. I could probably be more clear about this at the start of the post.


Nice article, but highly editorialized title, not sure what is intuitive about it.

This is a great article. I also recommend K&R Chapter 8 if people wanna know more. Do the exercises and it’ll just click.

It even has a playground! https://samwho.dev/allocator-playground/

How I wish I had something like that when I first learned C.


The playground was the inspiration for the post. I always wanted to be able to fiddle with malloc implementations and see how they work in this way.

Admittedly the playground is hard to understand at first, and a touch janky in places. But the workloads are taken from for-real malloc/free traces on programs!


When you think that I (and probably the vast majority of developers) used a pen and a paper for the first few years every time I tried to visualize more complex memory, then that's a big upgrade.

Especially because you can scroll through all the steps.


So good! I wish I had an article like this to learn from when I implemented my language's memory allocator a few months ago! Wonderful, thank you!

I love this! I wish it existed back when I was writing my first memory allocators in university or when building a custom EntityComponentSystem implementation.

I'd love to also see applications of custom memory allocations. I know about usecases in building game engines and the importance of hitting cache there, but I'm not sure where else in the world this would be as useful.


Robotics, embedded systems, CUDA/gpgpu, AI/Deep learning/Reinforcement Learning/LLM (PyTorch) for example.

This is really, really well done. Also, the allocator playground[0] is really cool. Will be my go-to when teaching this topic moving forward :)

[0] https://samwho.dev/allocator-playground/


Thanks so much, I really appreciate it.

I'm glad you like the playground. If you don't mind me asking, what/where/how do you teach? I was actually hoping to get the attention of educators with this tool to see if it would make sense in, e.g., undergrad CS courses.


Just friends and colleagues, nothing formal. I wish you the best of luck with it, though!

Ahh yes. Back at university we had a class called efficient programming where we had to implement a Unix utility and optimize it for speed, meaning cpu cycles.

While aggressively optimizing we replaced malloc in the end with a function we called "cooloc", that traded portability and security for speed. The debug tool here would have been useful.


Somewhat relevant about how GCs work, with also very great visuals: https://spin.atomicobject.com/2014/09/03/visualizing-garbage...

I really like the website design, above all. I'd love to pen my thoughts online, but haven't had the energy to learn static site generation...

Can you highlight the specific malloc() calls in the interactive part? It confused me when it said malloc(5) because it looks almost exactly like malloc(4). Specifically highlighting the malloc(5) allocation would make that a bit clearer I think.

I completely agree with you on this, though I couldn't find a great way to do it. It was suggested to me to put an outline around the allocation or free that just happened, but I struggled to get the box to look good when the allocation was split across 2 lines.

I've started writing my next post and have learnt a bit more about PixiJS/GSAP3, and think I know a way to do it that would work nicely but would require changing some of the underlying code. I can't promise I'll revisit this soon, but I would like to in future.


It's somewhat incomplete without a discussion of how one actually gets the memory allocate to a program.

Could you elaborate on what you mean?

It talks about how a simple malloc implementation would doll out entries in a buffer of memory and manage free lists, but not how the implementation actually gets that buffer from the operating system, or return it to the system when done (mmap, munmap, for example).

Ahh yes, I'm with you now.

I made a deliberate choice to not include that, mostly due to post size/complexity reasons. I decided that the focus of the piece was going to be on the task of effectively managing the memory you have in the face of calls to malloc/free. I decided against talking about any environment-specific information (brk, mmap, cache locality, multithreading, etc.)

May not have been the right call, but it kept the length fairly reasonable and I think it gives people enough of a nudge to dip their toes into the playground if they have time.

If you check the HTML comments on the page you'll find quite a lot of cut content. I wanted to do deep dives on real-world malloc implementations, but it ended up being very difficult to cover without making the post take over an hour to read.


Love the dog. A bit like the “Cool Bear” asides in Amos’s articles (fasterthanli.me), a chance to pause & consolidate.

Don't tell Amos, but Cool Bear is the inspiration for Haskie, my little husky companion.

While I do like the article, I wish it simply used multiple images and/or animated gifs (instead of javascript) for the pictorials.

It would make the site much more accessible and clear in the event you didn't realize you had to click forward.


The accessibility of this technique is something that greatly worries me. I try and be quite thoughtful about things like colour palette. I'm not sure how to achieve the level of interactivity I want while still being accessible to screen readers, without writing two completely separate articles with and without JavaScript.

Definitely open to ideas.


Oh, if only I had this in college! This is a fantastic explanation.

[1] is not very far from that. IMHO, [1] it is better.

[1] https://pages.cs.wisc.edu/~remzi/OSTEP/vm-freespace.pdf


This is an example of why interactive blog post are awesome, loved the illustrative sliders

This is wonderful! I'm definitely going to be sharing this with my students (college sophomores studying CS).

If I were to make some suggestions, based on how I know they would receive this:

- I would make explicit reference to heap and stack. Students who are learning this material are learning about the heap/stack dichotomy, and I think it would really improve the exposition to make clear that not all memory is allocated this way in a program.

- I would remove this line: "At the end of this post, you should know everything you need to know to write your own allocator." I can confidently say that my student will not be able to write an allocator after reading this. It's nothing to do with your piece, it's just the intersection of people who don't understand hex and who could build an allocator after a short blog post is very very small. Students who read this post and at the end still feel like they can't build an allocator will end up discouraged, with a feeling that maybe they are missing something.

- Consider rethinking how you handle hex numbers throughout. You introduce them and say they are distinguished by a preceding "0x", but then you immediately omit the 0x to save space in the figure. In my experience, students have a lot of trouble with understanding that hex and dec have the same underlying representation. They will not be able to distinguish between hex and dec bases implicitly, so from a pedagogical standpoint, it's better to be consistent throughout and include the prefix.

- Finally at the end I would make mention that there are other strategies for memory management to encourage further exploration. Other languages do it differently, and for students it's important to know which other avenues are out there. Otherwise they might think heap-based malloc/free are the only way to do things, the same way they often thing imperative programming is the only way to program.

Anyway, thank you for creating this, and I'm sure it will really help my students. In a just world, "seeing" the memory like this would ideally be first-class tooling for languages like C.


Really appreciate you taking the time to write this, thank you.

I tried a couple of different ways to introduce the stack and the heap but it always felt like it made the post too long and complicated. In the end I decided to take a pure, idealistic view of memory in order to focus on the algorithms used to pack allocations effectively. You can see some of my abandoned efforts as HTML comments in the post :D

Introducing the 0x prefix and immediately dropping it hurt me as well, but I didn't have a better way to make the visualisation work on mobile. I completely agree with you that it's not ideal.

I'd like to do a post of this style about garbage collection at some point.


Please do, this excellent post is already a good start on the issues involved in creating compacting garbage collectors.

Maybe make this a series? I think another post just like this on the stack is in order. You could show allocating stack frames with the slider! Then you can put them both together in a third post and show the entire memory layout of a C program.

Would definitely like to see more thoughts from those cute corgis.


A few people suggested a series, but there's a human problem: my attention jumps between topics quite a lot! At the end of this post, and my load balancing post, my main feeling was relief in that I could start looking into a new topic.

Maybe after more time has gone by I can revisit earlier posts and do follow-ups :)


Thanks. Well written and presented.

Nice article on general-purpose memory allocation approaches. While the article doesn't explicitly discuss it, these all seem to be list-allocation mechanisms?

> "List allocators used by malloc() and free() are, by necessity, general purpose allocators that aren’t optimized for any particular memory allocation pattern... To understand, let’s examine commonly used list allocation algorithms: first-fit, next-fit, best-fit and quick-fit."

That's from an article on custom memory management (aimed at embedded programming issues) I've found pretty useful, and it picks up where this leaves off:

https://www.embedded-software-engineering.de/dynamic-memory-...

You can practice writing custom memory managers in an application that runs on an operating system by only using the stack (just create a big array of int etc. and call that your memory space):

> "For the safety-critical tasks, the developer can replace the standard allocator with a custom allocator that sets aside a buffer for the exclusive use of that task, and satisfies all memory allocation requests out of that buffer... The remainder of this paper presents four examples of custom allocator algorithms: the block allocator, stack allocator, bitmap allocator and thread-local allocator. Custom memory managers can use any one of these algorithms or a combination of different algorithms."


Love this. It is so important to know these fundamentals concepts. I would like to see a similar version for SQL database indexes as 99% of the engineers I work with have no idea how to use them.

If more took the time to EXPLAIN ANALYZE they'd see indexes in action and hopefully learn from a few variations, especially pre/post variations of indexes. Or so I'd hope :).

It's kind of not important though, except for a tiny group of developers.

I am slightly embarrassed I didn’t know overallocate was the term for that. TIL!

The asides with the dog are great. I've seen a few blogs use this technique -- I'm not big on Greek, but I think Plato used this technique too.

It instills the idea that you should be asking a question at this point. Some information has been provided that should generate more questions in your head, if you're keeping pace.




Applications are open for YC Summer 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: