
Ask HN: Why isn't ASCII codes 28 – 31 used more often to serialize tabular data? - derekp7
Typically, data is exported from apps in tab, csv, or pipe delimited format.  Or in a more structured format such as JSON or XML.  The problem is that the more structured formats aren&#x27;t as useful for processing data through Unix command line utilities that process a line at a time, and any delimited format needs to handle cases where that delimiter is present in the exported data.<p>It would seem that as long as the data exported is regular text that a user would enter into an app, it would only contain ASCII characters that have matching keyboard keys.  That would leave ASCII 31 (Unit Separator) perfect for delimiting fields, and 30 (Record Separator) for delimiting records instead of a newline.  That way you wouldn&#x27;t have to worry about escaping anything in the data (of course, binary data would still need special handling, but in this case I&#x27;m referring to predominately text data).  So why isn&#x27;t this practice more common?
======
niftich
People are lazy.

The ASCII C0 separators -- FS, GS, RS, US -- are non-printable (by design),
but this impacts their usability by laypeople, because they don't show up as
obvious symbols in a plaintext editor, and most critically, they visually seem
to be absent from keyboards, so additional domain knowledge is required for
people to figure out how to produce them. So instead, people developed all
sorts of formats that are subject to delimiter collision.

Also, ASCII is recognized as a widely-deployed standard now, but this wasn't
always the case. Computers used dozens of different codepages to represent
characters with bytes, and while 26 English letters, 0-9, and some punctuation
was always present, control characters seldom had equivalents in a different
codepage, so interchange was a problem, because in most machines' native
codepages, these delimiters were absent.

ASCII actually abbreviates 'American Standard Code for Information
Interchange', but it largely came to be used for printable characters only --
"plain text", and not as a format for structured data.

Although by that point, the ship on C0 delimiters had largely sailed, to
compound the chicken-and-egg problem, some codepages that were developed after
ASCII often discarded the notion of control characters entirely, and redefined
their byte sequences as additional printable characters. Windows-1252 was a
notable offender [1].

[1]
[https://en.wikipedia.org/wiki/Windows-1252](https://en.wikipedia.org/wiki/Windows-1252)

~~~
derekp7
As far as them showing up, at least in VIM they show as a caret followed by
64+code (so ^^ and ^_ for RS and US), and can be entered by ctrl v ctrl ^ and
ctrl v ctrl _. Kind of awkward to enter though. Of course, like you said, this
is domain-specific knowledge.

I've converted much of my workflow to use these (I have a script that un-
escapes data from mysql, and replaces tabs/newlines with RS and US). Most
filter commands can have an arbitrary field delimiter specified, but not all
of them can have something other than newline or null for a line terminator.
For these cases I've had to maintain local copies of the commands (sed, sort,
join).

