Hacker News new | past | comments | ask | show | jobs | submit login

I deal with file formats that like plain text files and zip do not specify an encoding and have different encodings depending on where they come from. I think the generic approach is to guess, which means trying encodings until one successfully converts unknown input to garbage unicode resulting in output that is both wrong and different from the original input. Most of the time I can just treat the text contents as byte array, with a few exceptions that are specified to use ascii compatible names.

So you can throw in your emoji and they might not correctly show up on the xml logging metadata I write, because I don't care. But they will end up in the processed file the same way they came in instead of <?> or some random Chinese or Japanese symbol that the guessing algorithm thought appropriate.

In that case you should be opening files in binary mode "b", then you will be operating in bytes.

Also, there's no guessing happening in this instance. A locale configured in your environment variable are used if you open files using text mode.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact