Hacker News new | past | comments | ask | show | jobs | submit login
The Magic 0xC2 (log.koehr.in)
62 points by koehr on April 11, 2017 | hide | past | favorite | 29 comments

It reminds me a fun fact, which I noticed when writing a TTF font parser https://github.com/photopea/Typr.js

TTF files have a 4-byte field, where the font manufacturer can put "information about himself" (like the identification). The Adobe company puts an ASCII string "ADBE" into these four bytes.

There is another field for the font manufacturer, which has only two bytes. Guess what Adobe puts into these two bytes? 0xadbe :D

that 4 byte field really should have been 0xAD0BEAD0BE

That must have been fun times!

Am I the only one struggling to understand the hex dump? The author says "0x79 is the z in the ASCII table." That's wrong. 'z' is 0x7a.

The author also says "In UTF-8 all characters after 0x79 are at least two bytes long." That's also wrong. All characters after 0x7f get encoded as two or more bytes.

He skips all the numbers where the hex-encoding has an A-F as the least significant digit. This makes everything bizarre.

But after 99, they are no longer skipped. Go figure.

Yeah it doesn't make a lot of sense. I feel like the author must be doing something wrong with CouchDB in the first place if his binary uploads are being converted to UTF-8. Not to mention that that kind of thing should always be checked application side anyways.

If you want to place binary data in CouchDB that's entirely possible. However, you have to create a document and then add an attachment to it. Attachments themselves can have multiple revisions.

When PUTing[1] the attachment the appropriate Content-Type header is required to be set. If both of these things are done properly I see no obvious reason as to why they'd run into the encoding issue mentioned. Which makes me suspect it's not properly using the attachment feature or not correctly setting the MIME type.

Or they're doing something weird when grabbing the binary data from the user, like not using FileReader.readAsArrayBuffer()[2] from their JS code and instead getting it as text. readAsArrayBuffer is specifically designed to deal with binary data, usually used with images in web context.

[1]: http://docs.couchdb.org/en/2.0.0/api/document/attachments.ht...

[2]: https://developer.mozilla.org/en-US/docs/Web/API/FileReader/...

Perhaps it's UCS-2 not UTF-8. Not interested enough to figure it out :-o

So what most probably happened is that FileReader.readAsBinaryString()° defaults to FileReader.readAsText()°° since it's deprecated. At least that is what I saw in Chromium. As soon as I used readAsArrayBuffer the problem went away.

°) https://developer.mozilla.org/en-US/docs/Web/API/FileReader/...

°°) https://developer.mozilla.org/en-US/docs/Web/API/FileReader/...

If you dive head-first into Python's string behaviour, you'll eventually learn the hard UnicodeDecodeError-way what the difference is between a stream of bytes/octets and a text made of unicode code points. Much the same as learning that a timestamp without a timezone is not worth much, a text as a stream of bytes is not much worth without the encoding it is in. PHP also has nice footguns in that area.

Yeah especially in Python 2 you can get into fun messes if you don't really understand what you're doing. Python 2 lets you encode bytes:

>>> 'abcd'.encode('ascii')


And decode unicode:

>>> u'abcd'.decode("ascii")


It's nonsensical.

Thankfully Python 3 has removed all this madness:

>>> b'abcd'.encode('ascii')

AttributeError: 'bytes' object has no attribute 'encode'

>>> 'abcd'.decode('ascii')

AttributeError: 'str' object has no attribute 'decode'

(For those not familiar, Python 3 switched around the notation for unicode/bytes. In Python 2 "abcd" is a bytes literal, adding u makes it unicode, in Python 3 "abcd" is a unicode literal, adding b makes it bytes.)

> PHP also has nice footguns in that area.

PHP would deal with uploaded files by itself and write them correctly and directly to disk (unlike some Java implementations like Nexus which buffers in RAM, you can guess what happens).

As for ordinary POST/GET parameters, it stores them in a string aka a byte stream which you can then post-process e.g. by translating to UTF-8 based on the browser encoding header. So basically the only way to shoot yourself is if you're doing substr and friends on user input instead of using the mb_ variants.

  unlike some Java implementations like Nexus which buffers in
  RAM, you can guess what happens
I can't guess. It's 2017 and everybody knows that RAM is unlimited and nobody need ever worry about an OOM condition. Assuming, for the sake of argument, that RAM isn't unlimited, it's unrecoverable, anyhow--modern language designers and Linux kernel architects have made sure of it.

> PHP also has nice footguns in that area.

PHP, generally speaking, doesn't do Unicode at all. Outside of functions which explicitly do encoding conversions (mbstring, iconv, etc), all "strings" are just handled as a bag of bytes.

The main footguns I'm aware of are the "utf8_encode" and "utf8_decode" functions, which actually do lossy UTF8 <-> ISO8859-1 conversions.

Coincidentally, I just read this 2-yo post on Swift's string handling. Relevant.


Sounds like you're confusing the integer code point and the integer representation of characters.

Many programming languages internally represent chars as UTF-8 or UTF-16, so when using libraries to read bytes into chars everything get's mangled.

Check out this guide for more in-depth look at the mangling that can happen. http://cweb.github.io/unicode-security-guide/background/

Isn't this why FTP had separate Binary and Text transfer modes?

It's similar, yeah, but actually it wasn't binary or text mode, it was IMAGE, ASCII and EBCDIC modes. Image was essentially binary, but the other two would translate text encodings so that you could convert documents to a format that was readable on other machines.

IIRC some FTP (client-side? server-side?) implementations even "helpfully" did the DOS to Unix line ending conversion and back when ASCII was used and it involved a Windows machine.

I had a similar thing with glitched images once. We had to retrofit a middleware onto a site that would obfuscate email addresses. It used a regex to spot valid emails and replaced them with a hash. It also knew what urls and form parameters expected emails and used a lookup table to translate them back. This was sufficient to anonymise usernames without breaking any functionality on the site.

Turns out, we forgot to check content type, and valid emails according to the regex we had used were surprisingly common in binaries.

So the server was erroneously treating the images as text?

Not the server but the browser. It looks like FileReader.readAsBinaryString is falling back to readAsText since it's deprecated. Just an assumption though. I couldn't find any better evidence.

First guess: The author is running into something related to https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequ....

> 0x79 is the z in the ASCII table.

The most confusing technically correct statement of the year.

Edit: Sorry, scratch "technically correct". Need more coffee.

Unfortunately it's not. I made a mistake here (guess I mixed things in my head). I meant the character 'z' but 0x79 actually a 'y'. I will fix the issues later today.

How is it technically correct?

If you define "the z of x" as the last element of x, then it is correct. Whether such a definition is reasonable (especially when x already contains an element specifically named "z") is left as an exercise to the reader.

EDIT: I suppose this doesn't include 0x7A-0x7F either so it's not even technically correct under that definition.

Base 10 society strikes again! This happens to me too, it's hard to think of [0-9] being only 62.5% of base 16.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact