The final approach in the article is an array for each unique object size plus tagged indices/pointers. This would take one byte per uint8_t and doesn't suffer from the problems you mention, though it does have others. If memory pressure is your main problem it's a big win.