Hacker News new | past | comments | ask | show | jobs | submit login

Would not a better solution be to process the file as a byte string?

I don't think so. If you want to detect and operate on only the data that could represent ASCII characters, you could, certainly process it as a byte string if you wanted, but you'd have to track the presence of non-ASCII-range character codes yourself, and keep state around to represent whether you were in the middle of a multibyte character as you read through the bytes.

If done right, it would be a (probably much slower) re-implementation of what happens when you use the latin1 trick mentioned. You have to get it right, though (sneaky edge cases abound--what if the file starts in the middle of an incomplete multibyte character?).

TL;DR this could technically work but is a poor idea.

This is talking about the case where you don't know the encoding. So you don't know which byte sequences are multibyte characters. Whether you use latin1 or bytes the edge cases are exactly the same, and they don't get handled.

You wouldn't be able to use any APIs that only work on string (unicode) type objects.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact