I think its widely understood that a little-endian prefix VarInt decodes much faster as the leading alternative to LEB128. You can implement the whole thing without a loop, thus you don't have to exercise branch and loop prediction resources at all.
- count leading zeros (or ones, depending on the first byte's tagging preference)
- Unaligned load expressed as memcpy. Let the compiler's instruction scheduler peephole it out into a single unaligned load instruction.
- shift the loaded value by an amount indicated by the leading zero count.
I.e. one byte tag followed by the value? It's definitely an option for me but the main issue is that values under 128 now take 2 bytes instead of 1. I guess I could do something like what CBOR does and use 0-251 = literal value, 252 = u8 follows, 253 = u16 follows, 254 = u32 follows, 255 = u64 follow.
- count leading zeros (or ones, depending on the first byte's tagging preference)
- Unaligned load expressed as memcpy. Let the compiler's instruction scheduler peephole it out into a single unaligned load instruction.
- shift the loaded value by an amount indicated by the leading zero count.
- Zig-zag decode for signed integers