TTF files have a 4-byte field, where the font manufacturer can put "information about himself" (like the identification). The Adobe company puts an ASCII string "ADBE" into these four bytes.
There is another field for the font manufacturer, which has only two bytes. Guess what Adobe puts into these two bytes? 0xadbe :D
The author also says "In UTF-8 all characters after 0x79 are at least two bytes long." That's also wrong. All characters after 0x7f get encoded as two or more bytes.
When PUTing the attachment the appropriate Content-Type header is required to be set. If both of these things are done properly I see no obvious reason as to why they'd run into the encoding issue mentioned. Which makes me suspect it's not properly using the attachment feature or not correctly setting the MIME type.
Or they're doing something weird when grabbing the binary data from the user, like not using FileReader.readAsArrayBuffer() from their JS code and instead getting it as text. readAsArrayBuffer is specifically designed to deal with binary data, usually used with images in web context.
And decode unicode:
Thankfully Python 3 has removed all this madness:
AttributeError: 'bytes' object has no attribute 'encode'
AttributeError: 'str' object has no attribute 'decode'
(For those not familiar, Python 3 switched around the notation for unicode/bytes. In Python 2 "abcd" is a bytes literal, adding u makes it unicode, in Python 3 "abcd" is a unicode literal, adding b makes it bytes.)
PHP would deal with uploaded files by itself and write them correctly and directly to disk (unlike some Java implementations like Nexus which buffers in RAM, you can guess what happens).
As for ordinary POST/GET parameters, it stores them in a string aka a byte stream which you can then post-process e.g. by translating to UTF-8 based on the browser encoding header. So basically the only way to shoot yourself is if you're doing substr and friends on user input instead of using the mb_ variants.
unlike some Java implementations like Nexus which buffers in
RAM, you can guess what happens
PHP, generally speaking, doesn't do Unicode at all. Outside of functions which explicitly do encoding conversions (mbstring, iconv, etc), all "strings" are just handled as a bag of bytes.
The main footguns I'm aware of are the "utf8_encode" and "utf8_decode" functions, which actually do lossy UTF8 <-> ISO8859-1 conversions.
Many programming languages internally represent chars as UTF-8 or UTF-16, so when using libraries to read bytes into chars everything get's mangled.
Check out this guide for more in-depth look at the mangling that can happen. http://cweb.github.io/unicode-security-guide/background/
Turns out, we forgot to check content type, and valid emails according to the regex we had used were surprisingly common in binaries.
The most confusing technically correct statement of the year.
Edit: Sorry, scratch "technically correct". Need more coffee.
EDIT: I suppose this doesn't include 0x7A-0x7F either so it's not even technically correct under that definition.