> don’t know the bit structure of ASCII and the meaning of the odder control characters in it.
If it was up to me, remove the more useless old codes that take up precious 1-byte UTF-8 codes and replace them by common characters. Like "Record Separator": if it's that useful to have a record separator character, why aren't we using this one instead of e.g. commas for comma separated values?
I find that the degree symbol (°) is a glaring omission from ASCII.
> Like "Record Separator": if it's that useful to have a record separator character, why aren't we using this one instead of e.g. commas for comma separated values?
I've done ETLs to/from systems that do use these control characters. It's a joy compared to CSV. I have nothing to escape and no complex parsing logic. Embedded CR/LF-- no problem. Fields containing commas-- no problem.
We should be using these control codes for their purpose but nobody knows about them anymore.
I love using FS and RS in my shell scripts, esp. when I'm processing text data export from a database. As long as the data doesn't include binary data (such as images), I can be pretty certain that the data doesn't include FS and RS characters since they don't appear on a keyboard -- therefore I can preserve things like line breaks in text fields, and don't have worry about if someone inserted a " | " character in the contents of the data.
Of course a pre-pass is to strip out FS / RS just to make sure in case it got in accidentally, and to also know the purpose of the data to ensure that they shouldn't be in the text. But so far that has made my scripts a lot more reliable. The other alternative is to do the light-weight processing using a heaver scripting language that can deal with structured data natively, but setting FS and RS is often times a bit more expedient for me.
It pains me greatly that Hive still can't ingest FS/RS-separated (or \001/\002-separated) data nor does it correctly handle CSV because someone hardcoded \n as the record separator so deep they can't make it configurable.
If it was up to me, remove the more useless old codes that take up precious 1-byte UTF-8 codes and replace them by common characters. Like "Record Separator": if it's that useful to have a record separator character, why aren't we using this one instead of e.g. commas for comma separated values?
I find that the degree symbol (°) is a glaring omission from ASCII.