> don’t know the bit structure of ASCII and the meaning of the odder control cha...

EvanAnderson · on Sept 29, 2023

> Like "Record Separator": if it's that useful to have a record separator character, why aren't we using this one instead of e.g. commas for comma separated values?

I've done ETLs to/from systems that do use these control characters. It's a joy compared to CSV. I have nothing to escape and no complex parsing logic. Embedded CR/LF-- no problem. Fields containing commas-- no problem.

We should be using these control codes for their purpose but nobody knows about them anymore.

derekp7 · on Sept 29, 2023

I love using FS and RS in my shell scripts, esp. when I'm processing text data export from a database. As long as the data doesn't include binary data (such as images), I can be pretty certain that the data doesn't include FS and RS characters since they don't appear on a keyboard -- therefore I can preserve things like line breaks in text fields, and don't have worry about if someone inserted a " | " character in the contents of the data.

Of course a pre-pass is to strip out FS / RS just to make sure in case it got in accidentally, and to also know the purpose of the data to ensure that they shouldn't be in the text. But so far that has made my scripts a lot more reliable. The other alternative is to do the light-weight processing using a heaver scripting language that can deal with structured data natively, but setting FS and RS is often times a bit more expedient for me.

orthoxerox · on Sept 29, 2023

It pains me greatly that Hive still can't ingest FS/RS-separated (or \001/\002-separated) data nor does it correctly handle CSV because someone hardcoded \n as the record separator so deep they can't make it configurable.

gorgoiler · on Sept 29, 2023

Handy tip if you’re parsing the output of a command in your programming language of choice:

  output = run(
    “list-cats”,
    “—format”,
    “%(name)s\x1f%(age)s\x1e”,
  )
  records = output.split(“\x1e”)
  cats = dict(
    r.split(“\x1f”)
    for r in records
  )

In reality it’s a bit fastidious to use RS and US. I tend to just use “\1” and “\2”.