
The Magic 0xC2 - koehr
https://log.koehr.in/2017/04/09/the-magic-0xc2
======
IvanK_net
It reminds me a fun fact, which I noticed when writing a TTF font parser
[https://github.com/photopea/Typr.js](https://github.com/photopea/Typr.js)

TTF files have a 4-byte field, where the font manufacturer can put
"information about himself" (like the identification). The Adobe company puts
an ASCII string "ADBE" into these four bytes.

There is another field for the font manufacturer, which has only two bytes.
Guess what Adobe puts into these two bytes? 0xadbe :D

~~~
koehr
That must have been fun times!

------
cornstalks
Am I the only one struggling to understand the hex dump? The author says "0x79
is the z in the ASCII table." That's wrong. 'z' is 0x7a.

The author also says "In UTF-8 all characters after 0x79 are at least two
bytes long." That's also wrong. All characters after 0x7f get encoded as two
or more bytes.

~~~
mnarayan01
He skips all the numbers where the hex-encoding has an A-F as the least
significant digit. This makes everything bizarre.

~~~
jwilk
But after 99, they are no longer skipped. Go figure.

------
koehr
So what most probably happened is that FileReader.readAsBinaryString()°
defaults to FileReader.readAsText()°° since it's deprecated. At least that is
what I saw in Chromium. As soon as I used readAsArrayBuffer the problem went
away.

°) [https://developer.mozilla.org/en-
US/docs/Web/API/FileReader/...](https://developer.mozilla.org/en-
US/docs/Web/API/FileReader/readAsBinaryString)

°°) [https://developer.mozilla.org/en-
US/docs/Web/API/FileReader/...](https://developer.mozilla.org/en-
US/docs/Web/API/FileReader/readAsText)

------
derpadelt
If you dive head-first into Python's string behaviour, you'll eventually learn
the hard UnicodeDecodeError-way what the difference is between a stream of
bytes/octets and a text made of unicode code points. Much the same as learning
that a timestamp without a timezone is not worth much, a text as a stream of
bytes is not much worth without the encoding it is in. PHP also has nice
footguns in that area.

~~~
mschuster91
> PHP also has nice footguns in that area.

PHP would deal with uploaded files by itself and write them correctly and
directly to disk (unlike some Java implementations like Nexus which buffers in
RAM, you can guess what happens).

As for ordinary POST/GET parameters, it stores them in a string aka a byte
stream which you can then post-process e.g. by translating to UTF-8 based on
the browser encoding header. So basically the only way to shoot yourself is if
you're doing substr and friends on user input instead of using the mb_
variants.

~~~
wahern

      unlike some Java implementations like Nexus which buffers in
      RAM, you can guess what happens
    

I can't guess. It's 2017 and everybody knows that RAM is unlimited and nobody
need ever worry about an OOM condition. Assuming, for the sake of argument,
that RAM isn't unlimited, it's unrecoverable, anyhow--modern language
designers and Linux kernel architects have made sure of it.

------
gnrlist
Sounds like you're confusing the integer code point and the integer
representation of characters.

Many programming languages internally represent chars as UTF-8 or UTF-16, so
when using libraries to read bytes into chars everything get's mangled.

Check out this guide for more in-depth look at the mangling that can happen.
[http://cweb.github.io/unicode-security-
guide/background/](http://cweb.github.io/unicode-security-guide/background/)

------
tossandturn
Isn't this why FTP had separate Binary and Text transfer modes?

~~~
jharger
It's similar, yeah, but actually it wasn't binary or text mode, it was IMAGE,
ASCII and EBCDIC modes. Image was essentially binary, but the other two would
translate text encodings so that you could convert documents to a format that
was readable on other machines.

~~~
lloeki
IIRC some FTP (client-side? server-side?) implementations even "helpfully" did
the DOS to Unix line ending conversion and back when ASCII was used and it
involved a Windows machine.

------
MatthewWilkes
I had a similar thing with glitched images once. We had to retrofit a
middleware onto a site that would obfuscate email addresses. It used a regex
to spot valid emails and replaced them with a hash. It also knew what urls and
form parameters expected emails and used a lookup table to translate them
back. This was sufficient to anonymise usernames without breaking any
functionality on the site.

Turns out, we forgot to check content type, and valid emails according to the
regex we had used were surprisingly common in binaries.

------
chuckdries
So the server was erroneously treating the images as text?

~~~
koehr
Not the server but the browser. It looks like FileReader.readAsBinaryString is
falling back to readAsText since it's deprecated. Just an assumption though. I
couldn't find any better evidence.

------
mnarayan01
First guess: The author is running into something related to
[https://developer.mozilla.org/en-
US/docs/Web/API/XMLHttpRequ...](https://developer.mozilla.org/en-
US/docs/Web/API/XMLHttpRequest/Sending_and_Receiving_Binary_Data#Receiving_binary_data_in_older_browsers).

------
yongjik
> 0x79 is the z in the ASCII table.

The most confusing technically correct statement of the year.

Edit: Sorry, scratch "technically correct". Need more coffee.

~~~
tekknolagi
How is it technically correct?

~~~
michaelhoffman
If you define "the z of _x_ " as the last element of _x_ , then it is correct.
Whether such a definition is reasonable (especially when _x_ already contains
an element specifically named "z") is left as an exercise to the reader.

EDIT: I suppose this doesn't include 0x7A-0x7F either so it's not even
technically correct under that definition.

~~~
ChristianBundy
Base 10 society strikes again! This happens to me too, it's hard to think of
[0-9] being only 62.5% of base 16.

