Your packed union eliminates pure padding, yes, but it still costs 9 bytes to store a u8 value and misaligns accesses for wider elements. These are both significant misses on "solving the problem." You also need to copy each element on access, which could be expensive for larger plain-old-data types, or especially for non-POD types.
The final approach in the article is an array for each unique object size plus tagged indices/pointers. This would take one byte per uint8_t and doesn't suffer from the problems you mention, though it does have others. If memory pressure is your main problem it's a big win.