The fact that all the allocators have the same tail performance suggests the measurement is polluted by something like thread preemption or processor power state transitions.
TLSF has a worst-case of about 30 microseconds. Second best is 200 microseconds, which is shared by 3 allocators (HeapAlloc, dlmalloc, and rpmalloc). Which implies something inside the OS call is occasionally very slow.
TLSF is the only tested allocator that uses a pre-allocated pool and stays in user space. And it has noticeably better tail performance.