Hacker News new | past | comments | ask | show | jobs | submit login

It's a disappointment that we never used different conceptual "types" for text files with different encodings. You can sort of get there with MIME types and encoding data attached, but imagine if you had a filename convention like '.txt.ASCII' or '.txt.UTF8LE', or '.txt.ISO8859-1' so software would know exactly what to expect rather than trying to sniff byte-order marks or hueristics for guessing encoding and mojibake fixes.



I think if someone had tried to make encoding-in-filename a common convention it would have just resulted in people writing “.txt.ASCII” without understanding what that meant, and we’d have the same charset sniffing we have now plus another mess on top. (For example: when you have a “config.ini.UTF-16” and a “config.ini.WINDOWS-1252”, which should your program prefer?)

---

An anecdote: I was writing some code to send data to another company as an XML file. When I sent them a test file to validate, I got a complaint that it didn’t declare its encoding as UTF-8.

“Yes, our database uses UCS-2 for exports,” I replied. “Why, can you not read it?”

“Our system imported it just fine,“ they explained, “but it needs to say ‘<?xml version="1.0" encoding="UTF-8"?>’ at the top and when I looked at it with a text editor yours says something else there instead.”

“Yes, it doesn’t say UTF-8 because it’s not UTF-8. Sounds like your system is OK with that, though?”

“Yes, the rest of the file is fine, but we can’t move forward with go-live until you conform to the example in our implementation guide...”

Actually changing the encoding proved difficult (for reasons I no longer recall), so I ended up shoving in a string replacement to make the file claim the wrong encoding, along with an apologetic comment; their end apparently did sniffing (and the BOM was unambiguous) so this worked out fine, but it goes to show just how confused people can get about what a character encoding a file actually is.


I don't think having the extensions as a part of the file name was a good idea in the first place; I see it more as a practical compromise.

HTTP does not care about extensions in URLs at all, and you can serve any content type (plus information about its encoding) independently of the contents. Early file systems did not have capabilities this advanced, and through decades of backward compatibility we're kind of stuck with .jpg (or is it .jpeg? or .jpe?), .html (or .htm?), or .doc (is this MS Word? or RTF?).


> Early file systems did not have capabilities this advanced, and through decades of backward compatibility we're kind of stuck with

Some did. Some mainframes did. And Apple pre-OSX (HFS I think was the file system) did too.

The problem isn’t just backwards compatibility with DOS. It’s that most file transfer protocols don’t forward that kind of meta information. They don’t even share whether it’s executable, a system file or hidden (those that do have an attribute for hidden files). So the file type ended up needing to be encoded in the file name if just to preserve it.


FTP (and friends) indeed are a factor as well, but what stuck in my mind was how Forth initially was supposed to be named "Fourth", but the OS didn't support file names this long. https://en.wikipedia.org/wiki/Forth_(programming_language)#H...

HFS (and HFS+, and APFS) indeed has "resource forks", and supporting (or just not tripping over) them on non-Apple filesystems comes with a bag of surprises, even on Apple platforms. Just last week we've run into a problem with a script that globbed for "*.mp4" files to feed them into ffmpeg, but ran into and choked on the "._*.mp4"'s (which is how macOS stores the resource forks on FAT32/exFAT). The script was tested and worked fine on every Mac I tested it on, and everything was smooth until the customer plugged in an external HD to process videos off of. https://en.wikipedia.org/wiki/Resource_fork

So unfortunately the most reliable way of knowing exactly what you're working with, is lots of heuristics, and very defensive coding.


Resource forks are something different; the parent is referring to type and creator codes, which are part of the metadata for a file. (Type codes were also used to identify types inside the resource fork, which is maybe what you’re thinking of?)

The type code didn’t carry encoding information, though I believe the later https://en.wikipedia.org/wiki/Uniform_Type_Identifier can.


I didn’t mean FTP specifically, just file transfer protocols in general. But I do agree that FTP, specifically, is a garbage protocol.


> UTF8LE

There is a big-endian UTF8?!


All UTF-8 is big-endian, at least for various ways to think of it. Is little endian UTF-8 a thing that gets used?




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: