
ASCII Delimited Text – Not CSV or TAB delimited text - fishy929
https://ronaldduncan.wordpress.com/2009/10/31/text-file-formats-ascii-delimited-text-not-csv-or-tab-delimited-text/
======
Pxtl
I've done this.

Everybody hated it. Most text editors don't display anything useful with these
characters (either hiding them altogether or showing a useless "uknown"
placeholder), and spreadhseet tools don't support the record separator
(although they all let you provide a custom entry separator so the "unit"
separator can work). Besides the obvious problem that there's no easy way to
type the darned things when somebody hand-edits the file.

~~~
3JPLW
It's a shame. The solution is in the charset, but tools never developed to use
it so we don't use it. But I'd wager that if tools had historically supported
them, then the situation would be no different than it is with tab.

There are representational glyphs for tab, return, and others (⇥, ↵), and
editors can show them in 'show whitespace' modes. There could be
representational glyphs for these control characters, too. I'm not sure about
the history of these symbols, but I imagine they were initially on keyboards.
But if these control characters were on keyboards and had a representation in
text then they'd be just as useless as the tab character is today.

Precisely what makes them valuable is their difficulty to type or display.

~~~
Pxtl
> Precisely what makes them valuable is their difficulty to type.

Again, if they'd caught on you'd imagine there would be some eventual
convention in text-editors for what keybind would be used to enter them. Too
bad the AltGr key (intended for entering rarely-used glyphs) doesn't appear on
pure-English keyboards.

~~~
markeganfuller
I'm looking at my English keyboard and it has Alt Gr...

~~~
hnal943
Living in the US, I've never heard of an AltGr key until this discussion.

~~~
spoiler
I'm from Croatia, and we have the AltGr key.

However, I discovered that Alt + Control = AltGr when I needed to use it at
work[1], so it's simply a shortcut, I think.

[1]: We have (I recently switched to an UK keyboard, because it suits me
better) keyboards at work, because the Croatian (all slavic languages, to be
honest) is _horrendously_ counterproductive for programming. Google the
layout, and you'll realise why. An example: You need to press AltGr+B for `{`
(if I remember correctly).

~~~
chinpokomon
Perhaps, but I think it still goes against the original intent. Ctrl-~ or
Ctrl-^ should give you a record separator (RS) and Ctrl-Del or Ctrl-_ should
give you a unit separator (US). For the same reason Ctrl-m or Ctrl-M should
give you carriage return (CR). This is because ASCII values from 00-1F are
_control_ characters and effectively grounded the most significant bits 7 and
6. Shift similarly would toggle or ground bit 6, depending on the
implementation.

What happened was that the Ctrl key became synonymous with "command" after
Teletype, so it became more about doing something. Think about Ctrl-x, Ctrl-c,
and Ctrl-v as an example, but you still see some relics like Ctrl-d as End of
Transmission (EOT) to close a shell or terminal. Alt is like a shift, but it
is actually closer to the Fn key on most laptop keyboards. It was an
alternative function of that particular key, so where the shift key provided
you with an alternate case, Alt was more akin to an entirely different key...
it isn't Alt plus an 'a' key, it is Alt-a.

AltGr was like another Alt key. It was originally there to allow you to enter
an alternate glyph, especially line drawing characters available in extended
ASCII, B0-DF. I thought it was a mapping closer to flipping the most
significant bit to 1, but it doesn't exactly overlay the lower ASCII range, so
that might be another change that evolved on the way to the modern keyboard.

To your original point, Microsoft Windows will now usually treat the chord
Ctrl-Alt as AltGr. I don't know if that is with all layouts, or just those
keyboards that lack AltGr. I find that most Linux distributions tend to follow
Microsoft's lead and provide similar mappings but now they even repurposed the
Win key as Meta or sometimes called Super. So it is likely that Ctrl-Alt is
commonly the equivalent of AltGr.

For the propose of this discussion, I think it'd be better if Ctrl could be
used to type these text separators, but the way modern operating systems map
their modern keyboards, it might be difficult to ever reach consensus on how
this should be done.

~~~
gpvos
This hasn't been true since IBM keyboards became popular. For example, on
older keyboards shift+number would simply toggle a bit, so shift+2 would be a
double quote, etc., but this hasn't been common for decades now.
Unfortunately.

~~~
chinpokomon
Probably on the path to scan codes. By using scan codes, they could abstract
what a particular key meant and thereby remap the keys so that they didn't
have to match the ASCII table layout. I still don't understand why we evolved
scan codes the way we did. This requires the OS to be in sync to be able to
map them back.

------
dxbydt
Don't do this. Tsv has won this race, closely followed by Csv. Anything else
will cause untold grief for you and fellow data scientists and programmers. I
say this as someone who routinely parses 20gb text files, mostly Tsv's and
occasionally Csv's for a living. The solution you are proposing is definitely
superior but isn't going to get adopted soon.

~~~
radicaledward
I was surprised to see you list tsv as more common than csv. I encounter csv's
on a pretty regular basis, but I don't think I've had to parse a tsv in the
past 3 or 4 years. As a junior web developer, I don't have much experience
though. 9 times out of 10, the csv is coming from or going to Excel, or a
system that was designed to support Excel. If you don't mind my asking, what
types of data do you regularly work with that are in tsv format?

~~~
l-p
Your comment disturbs me a little… One of my gripes with Excel was that it
imported and produced TSV data by default when you asked for CSV.

~~~
shrikant
Excel actually doesn't 'care'. It uses the record separator defined in your
Windows "Regional Settings", and the defaults there differ for each system
locale.

------
mikestew
Anyone that's ever had to parse arbitrary data knows of the approximately 14
jiggityzillion corner cases involved when sucking in or outputting CSV/TAB
delimited formats. Yet much like virtual memory and virtual machines, we find
that a solution has existed since the 60s. For those wondering about the
history and use of all those strange characters in your ASCII table:
[http://www.lammertbies.nl/comm/info/ascii-
characters.html](http://www.lammertbies.nl/comm/info/ascii-characters.html)

~~~
joosters
Interesting web page! Despite many years of using ASCII and knowing some of
the more common control codes, I had never even thought about what the other
mysterious 0-31 codes were defined as.

Something that the page doesn't mention is that CR+LF were originally two
separate control codes because the action of returning the print head to the
left hand side would take too long with a standard line printer. Therefore,
separating the actions into two codes meant that the printer would not miss
out any printable characters.

(At least, I read that somewhere on the internet and assumed it was true!)

~~~
iclelland
It's more likely that it's because they are two separate physical actions
(returning the head to the left, and advancing the paper one line). They could
be used independently: You could print a line in _bold_ , for instance, by
issuing a CR without an LF and then printing the same line again.

A carriage-return operation takes _much_ longer than a single character, or
even two or three. It doesn't make sense to issue two characters just to take
up time. The printers always had to have some internal buffer memory (and
handshaking over the communication lines to say when the buffer is full) in
order not to lose any characters.

~~~
drivers99
"You could print a line in bold, for instance, by issuing a CR without an LF
and then printing the same line again."

Last I checked, this still works even on laser printers (at least on a
LaserJet), when sending data to it as plain text. It's not actually printing
over itself, but it knows to make the repeated characters bold.

~~~
mzs
less (among other unix tools) does this too (but you have to do one character,
bs and the character again). There are more, like _, bs, character underlines
(like cat there is ul that handles this specifically). If your terminal
supports os (overstrike) in it's terminal description it handles that
natively.

------
mjn
Alas, I don't think this works with the standard Unix tools, which is the main
way I process tab-delimited text. Changing the field delimiter to whatever you
want is fine, since nearly everything takes that as a parameter. But newline
as record separator is assumed by nearly everything (both in the standard set
of tools, and in the very useful Google additions found in
[http://code.google.com/p/crush-tools/](http://code.google.com/p/crush-
tools/)). Google's defaults are ASCII (or UTF-8) 0xfe for the field separator,
and '\n' for the record separator. I guess that's a bit safer than tabs, but
the kind of data I put in TSV really shouldn't have embedded tabs in a
field... and I check to make sure it doesn't, because they're likely to cause
unexpected problems down the line. Generally I want all my fields to be either
numeric data, or UTF-8 strings without formatting characters.

Not to mention that one of the advantages of using a text record format at all
is that you can _view_ it using standard text viewers.

~~~
sheetjs
Awk lets you set both:

    
    
        $ echo -n "1,2,3|4,5|6|7,8,9,0" | awk 'BEGIN{FS=","; RS="|"} {print NF, $0}'
        3 1,2,3
        2 4,5
        1 6
        4 7,8,9,0
    

In fact, you can also specify the output delimiters as well:

    
    
        $ echo -n "1,2,3|4,5|6|7,8,9,0" | awk 'BEGIN{FS=","; RS="|";OFS="foo";ORS="bar"} {print NF, $0}'
        3foo1,2,3bar2foo4,5bar1foo6bar4foo7,8,9,0bar

~~~
groovy2shoes
Yup. It wasn't fun to type (using Ctrl-V in bash to input the raw control
characters), but it works fine with the ASCII separators as well:

    
    
        $ echo -n 'a^_1^^b^_2^^c^_3^^' |awk 'BEGIN{FS="^_"; RS="^^"} {print $1": "$2}'
        a: 1
        b: 2
        c: 3

------
baddox
My guess is that TSV/CSV won out simply because anyone can easily type those
characters from any standard keyboard on any platform.

~~~
csixty4
TSV/CSV characters were also pretty much guaranteed to exist no matter what
kind of terminal was used, and not cause any side-effects. No doubt some
teletypes & dumb terminals used those FS, GS, RS etc. characters for special
features since they weren't likely to appear in printed data. And I know those
characters are used for other things in PETSCII and ASCII. 0x1C, the File
Separator in ASCII, is used to turn text red in PETSCII.

~~~
csixty4
Sorry, that should say "PETSCII and ATASCII"

------
sigil
Meh. What if some data has ASCII 28-31 in it? If you're not using a "real"
escaping mechanism, and instead relying on the assumption that certain
characters don't appear in your data, then I don't see anything wrong with
using \t and \n (ie TSV). Either way, you know your data, and you're using
whatever fits it best.

If you need something that's never, ever going to break for lack of escaping,
might I suggest doing percent-encoding (aka url encoding) on tabs ("%09"),
newlines ("%0a") and percent characters ("%25")? Percent encoding and decoding
can be made very fast, is recognizable to most developers, and can be used to
escape and unescape anything, including unicode characters. Unlike C-escaping,
which doesn't generalize and accommodate these things nearly so well.

~~~
scintill76
I think the answer is that those shouldn't occur within your data. If you're
dealing with binary data, why are you using a text-based file format? If your
data is textual, it shouldn't have control character delimiters within it, as
they are reserved for that context.

So, strip them out of your data if you have to. If you think they need to be
preserved or escaped, IMO you're doing something wrong.

~~~
sigil
It's nice to be able to use the unix toolset (grep, cut, sort, join, etc) on
all kinds of data, not just strictly "textual" data.

They're often my tool of last and only resort when dealing with very large
datasets. Sure, you could wait for that dump of all of wikipedia to import
into a nice indexed and queryable database, but why not start grepping it
immediately? Maybe you want to sort by a key that's textual, but there's
satellite data that's non-textual. sort(1) is a pretty amazing program in
terms of resource usage; it parallelizes, it makes efficient use of available
memory and disk when merge-sorting.

Anyway, there are plenty of examples!

~~~
scintill76
If you've actually got, say, a JPG file embedded in the middle of a CSV or
something, it's just not designed for that IMO. But if you use the reserved
control characters, you can at least output any text data without escaping (at
least if "textual" is defined as "a string of characters that are not ASCII
delimiter control chars", which ought to be easy to assume, unless something
is corrupted or deliberately trying to mess things up.) I think you can
actually more easily and reliably grep, because you don't need state to know
whether the comma byte is a comma character or a delimiter. You simply use a
comma when you mean a literal comma, and the control char when you want the
delimiter. Anyway, as others have pointed out, there are other obstacles to
widespread adoption of these control chars.

I agree that cut, sort, etc. are good to be familiar with. Someone else[1]
linked a "csvquote" utility that pre-chews (and un-chews at the end of the
text-processing pipeline) CSV data to make it work better with standard UNIX
utilities. Looks neat, so I'll be keeping it in mind next time I'm processing
CSV with UNIX utils.

[1]:
[https://news.ycombinator.com/item?id=7475793](https://news.ycombinator.com/item?id=7475793)

------
rwmj
This is factually wrong about CSV, which can store any character including
commas and even \0 (zero byte), provided it's implemented correctly (a rather
large proviso admittedly, but you should _never_ try to parse CSV yourself).
Here is a CSV parser which does get all the corner cases right:

[https://forge.ocamlcore.org/scm/browser.php?group_id=113](https://forge.ocamlcore.org/scm/browser.php?group_id=113)

~~~
tbrownaw
_you should never try to parse CSV yourself_

Why? Writing a correct parser is not significantly harder than figuring out
how to interface to an existing parser library, and allows cool things like
heuristic parsing of malformed files.

OTOH it's _shocking_ how many people can't write a correct CSV _generator_ ,
even after being explicitly told what they're doing wrong (which is always
either "you need to put quotes around the data" or "you need to double any
quotes that are part of the data") and given examples.

~~~
rwmj
If you knew enough about CSV to be able to write a correct parser, then you'd
know enough not to write one lightly.

Here are some surprising valid CSV files:

[https://forge.ocamlcore.org/plugins/scmgit/cgi-
bin/gitweb.cg...](https://forge.ocamlcore.org/plugins/scmgit/cgi-
bin/gitweb.cgi?p=csv/csv.git;a=tree;f=tests;h=b8d1a991cf7eb5c2290dc8ec027c0c50261667dc;hb=HEAD)

The test program in the same directory shows the semantic content of each.

~~~
tbrownaw
testcsv6.csv at that link is malformed. DQUOT is used to (1) escape itself,
and (2) enclose strings. It is not a generalized escape character the way
backslash is in C-family languages.

Interpreting it as a generalized escape causes two problems. One, if you
generate files that way, they will be unreadable by parsers written according
to the RFC. Two, if you read files that way, you will silently garble files
generated by someone who forgot to escape the quotes that were part of their
data.

~~~
rwmj
I'm afraid you're wrong about this. Excel generates and parses "0 as a zero
byte. The RFC doesn't discuss how CSV files work in the real world. This is
exactly what I was talking about in my comment above.

------
JoshTriplett
Leaving aside the pain of displaying and typing such characters...

> Then you have a text file format that is trivial to write out and read in,
> with no restrictions on the text in fields or the need to try and escape
> characters.

Phrases like that lead to lovely security bugs.

~~~
dfc
The thing that leads to "lovely security bugs" is the nonchalant mindset; it
has nothing to do with the simple text format. The same attitude paired with
ASN.1 data has caused just as many vulnerabilities.

~~~
JoshTriplett
It's not just the nonchalant mindset; it's the thought that because you pick
something you don't expect to form part of your input domain, you don't have
to escape. Either you have to actually _restrict_ your input domain, or you
need escaping.

~~~
userbinator
This, very much this!

Escaping should _always_ be a consideration. Not thinking about it, thinking
"it'll never happen", etc. is what leads to things like HTML and SQL injection
vulnerabilities.

If you're inputting or outputting data in any format, always keep in mind
things like "what are the delimiters? What if the data in the input/output
contains them?"

~~~
dfc
How is this any different than reading ASN.1 data and not worrying about the
size of integers?

------
rgarcia
How about everyone just started following the CSV spec?
[https://tools.ietf.org/html/rfc4180](https://tools.ietf.org/html/rfc4180)

Doesn't allow for tab-delimited or any-character-delimited text and handles
"Quotes, Commas, and Tab" characters in fields.

~~~
oneeyedpigeon
I love the way that, in the _frickin ' formal spec_, the presence of a header
row is ambiguous, so every tool that ever deals with CSV has to ask a human
whether or not a header row is present. Great design decision, that.

~~~
ajanuary
It's an artifact of budding a standard around what people are already doing.

------
yardshop
There actually are glyphs assigned to these characters, at least in the
original IBM PC ASCII character set:

Ascii table for IBM PC charset (CP437) - Ascii-Codes

[http://www.ascii-codes.com/](http://www.ascii-codes.com/)

They correspond to these Unicode characters

    
    
        28  FS  ∟  221f  right angle
        29  GS  ↔  2194  left right arrow
        30  RS  ▲  25b2  black up pointing triangle
        31  US  ▼  25bc  black down pointing triangle
    

They may not be particularly intuitive symbols for this purpose though.

see also: IBM Globalization - Graphic character identifiers:
[http://www-01.ibm.com/software/globalization/gcgid/gcgid.htm...](http://www-01.ibm.com/software/globalization/gcgid/gcgid.html)
(then search for a code point, eg U00025bc)

Unicode code converter [ishida >> utilities]:
[http://rishida.net/tools/conversion/](http://rishida.net/tools/conversion/)

[http://en.wikipedia.org/wiki/Code_page_437](http://en.wikipedia.org/wiki/Code_page_437)

------
trebor
Now I feel silly for having glossed over the control characters since I was a
kid. Those characters are decidedly useful on a machine level, though the
benefit of CSV/TSV is that it's human friendly.

~~~
chiph
Trivia: Carriage return and Line feed are separate characters because they
used to be separate operations for devices like Teletypes. Want double-spaced
text? CRLFLF. Working with a slow device? CRCRCRLF to give the carriage time
to return.

Baudot4Life, yo.

~~~
wglb
Or DEL characters more likely.

~~~
chiph
The operators could have sent LTRS (all holes punched) but they never did --
their finger was already on CR, so they would just hit it a couple of times.
Same net effect - delay until the carriage could return.

Which, BTW, was an indication that your machine needed service. The spring
should have been wound tight enough and the track clean & oiled well enough to
get the carriage back to the first column in time to not drop any characters.
A pneumatic piston ("dash pot") slowed the carriage down as it approached the
first column so it wouldn't crash into the stops and get damaged.

------
Roboprog
CSV is a solved problem - RFC 4180:
[http://tools.ietf.org/html/rfc4180#section-2](http://tools.ietf.org/html/rfc4180#section-2)

As used by Lotus 1-2-3 and undoubtedly others before there was an Excel.

Example record:

    
    
        42,"Hello, world","""Quotes,"" he said.","new
        line",x
    

Now go write a little state machine to parse it... (hint: track odd/even
quotes, for starters)

~~~
rquirk
Coincidentally I did this just this week. I believe there's no need to track
quotes, just if it has an opening quote or not. That RFC really explains it
all very well. The only edge case I had was with empty records (foo,,bar) and
that was probably due to my implementation.

------
mikeash
It doesn't solve the problem, although it does make it _far_ less likely to
run into it.

For a trivial example, try building an ASCII table using this format, with
columns for numeric code, description, and actual character. You'll once again
run into the whole escaping problem when you try to write out the row for
character 31.

~~~
mantrax3
For CSV forbidding commas in data is not practical.

For ASCII delimiters, forbidding ASCII delimiters in data _is_ practical.

Sure - you can't, say, nest ASCII tables into one another due to this
limitation.

But for simple structure, it doesn't hurt to have ASCII separators in the
toolbox.

The only big problem I see is that they're rendered as invisible characters,
which will make debugging harder. If we wouldn't have abandoned and forgotten
the special ASCII chars, this wouldn't be the case.

If your dev tools show special chars (like mine do), then it's perfectly fine
to use them.

~~~
Pxtl
> Sure - you can't, say, nest ASCII tables into one another due to this
> limitation.

In hindsight it's too bad we don't have similar characters that follow a more
sexpr-ish layout - say, ListStart, ListEnd, and Delimiter. Then you could tree
them endlessly. If you wanted to be _really_ fancy you could add an
"assignmentSeparator" character to officially bless key-value-pairs and
encompass a nice JSON-ish format, but Lisp pretty-well demonstrates that isn't
necessary.

But in hindsight it's just too bad we don't use these control characters _at
all_.

~~~
derefr
You know what's nicer than delimiting beginnings and ends of things? Length
prefixing. _Protocol message formats_ and _data encoding formats_ both already
know what they're going to say before they say it, and so know its octet
length.

The only reason to use delimiters, ever, is for user-modifiable data (e.g.
source code) where you might want to insert or delete characters and have the
containing block remain valid.

\---

And now, a fun tangent, to prove that how deeply-rooted this confusion is in
CS: user-modifiable data was originally the sole use-case for \0-terminated "C
strings" in C.

C has two separate types which get conflated nowadays: char arrays, and
\0-terminated strings. Most "strings"\--as we'd expect to find them in other
languages--were, in C, actually char arrays: you knew their length, either
because they were string literals and you could sizeof them, or because you
had #defined both FOO and FOO_LEN, or because you had just allocated len bytes
on the heap for foo, so you could just pass len along with foo. Because you
knew their length, you didn't need to use the string.h functions to manipulate
them. It was idiomatic (and perfectly-safe) C, when dealing with char arrays,
to just iterate through them with a for loop.

The concept of \0-termination, and thus what we think of as "C strings", only
applied to _string buffers_ : fixed-size, stack-allocated, uninitialized char
arrays. The string.h functions are all meant to be employed to manipulate
string buffers, and the \0 is intended to mark where the buffer stops being
useful data, and starts being uninitialized garbage.

The strings in string buffers had short lifetimes, and didn't usually outlive
the stack frame the buffer was declared in. Generally, you'd declare a string
buffer, populate it using some combination of string literals, strcat(3),
sprintf(3), and system calls, and then pass the string--still sitting inside
the buffer-- _to_ a system call like fstat(2) to get what you're really after.
That would be the end of the both string buffer's, and the string's, lifetime.

If you ever _did_ want to preserve the contents of a string buffer into
something you could pass around, though, this would be idiomatic:

    
    
        int give_me_a_path_string(char **out)
        {
          char buf[MAX_PATH];
    
          /* ... */
    
          int len = strlen(buf);
          *out = memcpy(malloc(len), buf, len);
    
          return len;
        }
    

Note that, after this function returns, the pointer it has written to _doesn
't_ point to a "C string": instead, it's a plain pointer to a heap-allocated
array of char, with exactly enough space to hold just those characters. If you
want to know how big it is, you look at the return value.

So:

• C has "C strings", but they were only intended as buffers.

• C also has "char arrays", which are really what you should think of as C's
equivalent to a "string" datatype. char arrays, not "C strings", are the
fundamental data structure for representing and persisting strings in C.

• char arrays are less like "C strings" than they are like Pascal strings:
they come in two parts, a block of memory N chars wide, and an int containing
N. You don't examine the block to determine the length; the length is
explicit.

• Pascal (and thus most modern languages with strings) put both the length and
the character-block on the heap as a unit. C puts the character-block on the
heap, but puts the length _on the stack._ This is more efficient under C's
Unix-rooted assumptions: you need the length on the stack if you want to work
with it to immediately shove the string through a pipe.

~~~
Pxtl
The problem: I have never encountered length-prefixed data. Ever. Every data
interchange file I've ever dealt with has been either delimited or fixed-width
fields (and the widths are not defined anywhere in the file).

~~~
derefr
Examples of length-prefixed data abound in protocols and formats defined by
systems and telecom engineers (e.g. the IETF). IP packets are length-prefixed.
ELF-binary tables and sections are length-prefixed. PNG chunks are length-
prefixed.

It's just these worse-is-better text-based protocols like HTTP, created by
application developers, that toss all the advantages of length-prefixing away.
(And, even then, HTTP _bodies_ are length-prefixed, with the Content-Length
header. It's just the headers that aren't.)

~~~
mikeash
The only problem with length prefixing is that it interferes with streaming
data, because you need to know the full length in advance. Thus HTTP chunked
encoding. Still, it works great in most scenarios.

My favorite way to deal with this stuff is Consistent Overhead Byte Stuffing:

[http://en.wikipedia.org/wiki/Consistent_Overhead_Byte_Stuffi...](http://en.wikipedia.org/wiki/Consistent_Overhead_Byte_Stuffing)

In short, you take the data and encode it with a clever scheme that
effectively escapes all the zero bytes. The output data contains no zeroes,
but results in almost no overhead, with the worst case being an increase of
1/254 over the original size, and the best case being zero increase. (Compare
to e.g. backslash escapes of quotes in quoted strings, where the worst case
doubles the output size.) You then use the now-eliminated zero byte as your
record separator. This lets you stream data (with a small amount of buffering
to perform the encoding) while still easily locating the ends of chunks.

I've played around with COBS but never used it in a real product, so this is
not entirely the voice of experience here. But it is a nifty system.

~~~
penguindev
that is just freaking cool. took me about 4 times to grok it. it sort of
reminds me of utf-8, and how you can synchronize that easily.

------
htp
Took some time to figure out how to type these on a Mac:

1\. Go to System Preferences => Keyboard => Input Sources

2\. Add Unicode Hex Input as an input source

3\. Switch to Unicode Hex Input (assuming you still have the default keyboard
shortcuts set up, press Command+Shift+Space)

4\. Hold Option and type 001f to get the unit separator

5\. Hold Option and type 001e to get the record separator

6\. (Hold Option and type a character's code as a 4-digit hex number to get
that character)

Sadly, this doesn't seem to work everywhere throughout the OS- I can get
control characters to show up in TextMate, but not in Terminal.

~~~
quesera
In Terminal, they are:

    
    
      FS: Control-\ 0x1c (field sep)
      GS: Control-] 0x1d (group sep)
      RS: Control-^ 0x1e (record sep)
      US: Control-_ 0x1f (unit sep)
    

(These control key equivalents have always been the canonical keystrokes to
generate the codes)

But they have to be preceded by a Control-V (like in vi) to be treated as
input characters. Control-V is the SYN code (synchronous idle), but has no
special meaning in an interactive context, which is presumably why it was
chosen.

The full set of control codes (0x00 - 0x1f) and their historical meanings are
why Apple added the open/closed Apple keys, eventually the Command key. They
wanted a set of keystrokes that were unambiguously distinct from the data
stream.

Control-S, e.g., will pause text output in the Terminal (also xterm, etc).
This was super useful in the days before scrollback. :) Control-Q to resume
(actually flush all the buffered output).

Overloading Control sequences was an unforgivable sin committed by Microsoft.

...if I remember the history correctly, Apple decided that having both
open/closed Apple keys was confusing, and having the Apple logo on the
keyboard was tacky, so they renamed the key for the Mac, and Susan Kare
selected a new glyph, which is a Scandinavian "point of interest" wayfinding
symbol.

...as a further aside, Control-N and Control-O are the cause of the bizarre
graphical glyphs you sometimes see if you do something silly like cat a binary
file. Control-N initiates the character set switch, and Control-O restores it.
This can be used to fix your Terminal when things go awry. Most people just
close the window, but I hate losing history. :)

0x20 - 0x74, unshifted:

    
    
       !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
    

0x20 - 0x74, shifted:

    
    
       !"#$%&'()*→←↑↓/▮123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_◆▒bcde°±▒☃┘┐┌└┼⎺⎻─⎼⎽├┤┴┬│≤≥π≠£·
    

...works in Firefox. YMMV.

Terminal-charset-quickfix: at shell, type "echo ^O". To get the literal ^O,
use Control-V then Control-O.

~~~
graywh
Most terminals will let you use ctrl-6 for ctrl-^ and ctrl-7 (and sometimes
ctrl-/) for ctrl-_.

------
slavik81
This seems a little better than it is. Those control characters are appealing
because they're rarely used. Making them important by using them in a common
data exchange format will dramatically increase the rate at which you find
them in the data you're trying to store.

Ultimately, this is a language problem. If we invent new meta-language to
describe data, we're going to use it when creating content. That means the
meta-language will be used in regular language. Which means you're going to
have to transform it when moving it into or out of that delimited file.

There is no fixed-length encoding you can use to handle meta-information
without imposing restrictions on the content. You're always going to end up
with escape sequences.

------
tokenrove
I think people are missing the fact that you have a "control" key on your
keyboard in order to type control characters. (Of course, control is now
heavily overloaded with other uses.)

------
csixty4
Pick databases have used record marks, attribute marks, value marks, sub-value
marks, and sometimes sub-sub-value marks in ASCII 251-255 since the late
1960s. Like the control characters this blog post recommends, the biggest
obstacle for Pick developers working on modern terminals is how on Earth to
enter or display these characters. There's also the question of how to work
with them in environments that strip out non-printable characters.

This isn't some clever new discovery. It's begging us to repeat the same
mistakes that led to the world adopting printable ASCII delimiters in the
first place.

~~~
pacaro
Awesome! I was going to make a comment about Pick but you beat me to it. The
challenge we had with Pick style involved customers using codepages that
required these characters in text.

Encodings aside, the principle of having a hierarchy of delimiters can be
hugely powerful

------
revelation
It is now 2014. The world doesn't use ASCII, you will still need escaping for
binary or misformatted data, and overall the idea of mapping control
characters and text into one space is _dead and dusted_. Don't do it, don't
let other people do it, use a reasonable library that handles the bazillion
edge cases safely if you need to parse or write CSV and its ilk.

~~~
eggie
In science, a lot of people use ASCII and flat files. I used to really dislike
it, but over time I understood that there are certain practical reasons to do
this which deserve respect.

Due to the volume and novelty of data that we work with, we are often pushed
into a corner between human time and machine time. Each data set comprises a
new set of concepts, and each is huge. In this corner, sometimes a character-
delimited file is the best solution. There is not time to carefully craft a
binary format and then document it so it will not be forgotten later, nor is
there time to wait for a general-purpose format parser to operate on tens of
billions of records. We need a solution that can be designed in 1 minute and
be legible by all of our tools without modification.

Typically, I have used tabs in the place of the ASCII separators. This ensures
readability without any kind of parsing. Also, this lets me use the default
behaviors of well-worn, bug-free tools in the core of the Unix toolchain for
basic data processing tasks. Frankly, this is not a bad compromise.

If you are passing messages around a web stack, JSON, XML, and friends are
ideal solutions. If you have to occasionally deal with CSV, use a parser. I
just want to note that for many tasks in data analysis, it's OK to simply use
the _dead and dusted_ convention of mixed delimiters and data.

As these things develop, I will be trying to investigate how to use more
modern formats such as binary JSON representations in my work, and I'd be
curious what solutions people here suggest for working with very large data
(e.g. many trillions of observations).

------
eli
Makes sense, but it's practically a (very simple) binary storage format at
that point. You can't count on being able to edit a document with control
characters in a text editor. And I wouldn't trust popular spreadsheet software
with it either.

------
susi22
Related: This tool:

[https://github.com/dbro/csvquote](https://github.com/dbro/csvquote)

will convert all the record/field separators (such as tabs/newlines for TSV)
into non-printing characters and then in the end reverse it. Example:

    
    
        csvquote foobar.csv | cut -d ',' -f 5 | sort | uniq -c | csvquote -u
    

It's underrated IMO.

~~~
dbro
Thanks for bringing up csvquote. I wrote it last year, and am happy to hear
that other people find it useful.

It is indeed a simple state machine (see
[https://github.com/dbro/csvquote/blob/master/csvquote.c](https://github.com/dbro/csvquote/blob/master/csvquote.c)),
and it translates CSV/TSV files into files which follow the spirit of what's
described in the original article in this thread.

But instead of using control characters as separators, it uses them INSIDE the
quoted fields. This makes it easy to work with the standard UNIX text
manipulation tools, which expect tabs and newlines to be the field and record
separators.

The motivation for writing the tool was to work with CSV files (usually from
Excel) that were hundreds of megabytes. These files came from outside my
organization, and often from nontechnical people - so it would have been
difficult to get them into a more convenient format. That's the killer feature
of the CSV/TSV format: it's readable by the large number of nontechnical
information workers, in almost every application they use. I can't think of a
file format that is more widely recognized (even if it's not always
consistently defined in practice).

------
Falling3
This reminds me of a depressing bug I run into frequently. I do a bit of work
integrating with an inventory management program. Their main method of
importing/exporting information is via CSV. The API also imports and exports
via CSV, except whoever wrote the code that handle the imports decided not to
use any sort of sensible library. Instead they use a built-in function that
splits the string based on commas with absolutely no way of escaping, so that
there is no way to include a comma in a field.

It's led to many a headache.

~~~
chiph
I deal with a vendor who occasionally sends us files without the double-quote
character escaped. I feel your pain.

~~~
flomo
I used to work in an industry where different vendors passed around massive
CSV files. If there was a way to abuse CSV, someone had done it, no two of
them were exactly alike.

------
hrjet
I have created a (work-in-progress) Vim plugin [1], that uses Vim's conceal
feature to visually map the relevant ASCII characters to printable characters.

It sort of works, but there are known issues which I have listed in the
README.

[1] : [https://github.com/hrj/vim-adtConceal](https://github.com/hrj/vim-
adtConceal)

------
mmasashi
It does not solve the problem. Here is the points which I think.

1\. Control characters are not supported in the almost of text editors. 2\.
Control characters are not human friendly. 3\. The text may contain control
characters in the field value.

In any formats, we cannot avoid the escape characters, so even I think CSV/TSV
format is reasonable.

~~~
stronglikedan
You are correct in that it does not solve a problem. Furthermore, the article
tries to create a problem with CSV that does not exist.

> CSV breaks depending on the implementation on Quotes, Commas and lines

CSV does not break; the implementation is broken if it doesn't parse CSV
properly. With a proper implementation, CSV solves every problem that will
arise from this method.

~~~
__david__
The problem with CSV is that it looks so simple that nobody ever uses a real
library to do it—they just roll their own. So you end up with a million
implementations that are all buggy in various different ways. If you receive a
CSV formatted file you can never be sure if it's actually good, valid CSV, or
some invalid crap from that some programmer that reinvented the wheel because
it was "so easy".

~~~
saalweachter
And as a side effect, if you are relying on lots of data files provided by
other people, you inevitably end up with a library 57,000 parsers, 55,000 of
which are for different, slightly broken CSV files.

------
omarforgotpwd
"Alright let me get you some quick test data. Just need to find the 0x29 key
on my keyboard... or 0x30? Wait is this a new row or a new column? What was
the vim plugin for this?"

And then someone wrote an open source CSV parsing library that handles edge
cases well and everyone forgot these characters existed.

------
dsjoerg
This is a good illustration of how the hard part isn't "solving the problem"
\-- it's getting everyone to adopt and actually _use_ the standard.

Reminding everyone that an unused, unloved standard exists is just reminding
everyone that the hard part went undone.

------
tracker1
I actually really appreciate this article, though I've known about it for
decades now. In fact, I used to return javascript results in a post target
frame back in the mid-late 90's and would return them in said delimited
format... field/record/file separated, so that I could return a bunch of data.
Worked pretty well with the ADO Recordset GetString method.

Of course, I was one of those odd ducks doing a lot of Classic ASP work with
JScript at the time.

------
christiangenco
Here's an implementation of ASCII Delimited Text in Ruby using the standard
csv library:
[https://gist.github.com/christiangenco/73a7cfdb03e381bff2e9](https://gist.github.com/christiangenco/73a7cfdb03e381bff2e9)

The only trouble I ran into was that the library doesn't like getting rid of
your quote character[1], and I don't see an easy way around it[2].

That said, I really don't like this format. The entire point of CSV is that
you have a serialization of an object list that can be edited by hand. Sure
using weird ASCII characters compresses it a bit because you're not putting
quotes around everything, but if you're worried about compression you should
be using another form of serialization - perhaps just gzip your csv or json.

In Ruby in particular, we have this wonderful module called Marshal[3] that
serializes objects to and from bytes with the super handy:

    
    
        serialized = Marshal.dump(data)
        deserialized = Marshal.load(serialized)
        deserialized == data # returns true
    

I cannot think of a single reason to use ASCII Delimited Text over Marshal
serialization _or_ CSV.

1\. ruby/1.9.1/csv.rb:2028:in `init_separators': :quote_char has to be a
single character String (ArgumentError)

2\.
[http://rxr.whitequark.org/mri/source/lib/csv.rb](http://rxr.whitequark.org/mri/source/lib/csv.rb)

3\. [http://www.ruby-doc.org/core-2.1.1/Marshal.html](http://www.ruby-
doc.org/core-2.1.1/Marshal.html)

------
yitchelle
The big problem with the two markers mentioned in the post is they are not
part of the visible character set. Using a comma delimiter is good as it is
visible, you can just use a basic text view to see it.

A tab delimiter is not preferable as it is not visible, and can be problematic
to parse via command line tools (ie what do I set as the delimiter
character?).

I think that is the whole point of having ASCII delimited text files is to
have human readable data in it.

~~~
NoodleIncident
If you're using command line tools, others have posted how to use them.

C-v-shift-_ and C-v-shift-^ both work for me.

They print a little strangely, but if you were really dedicated to the idea,
you could alias the tools you use to use these by default for their input and
output separators.

------
peterwwillis
You know where this is useful? Databases.

No, please, put the gun down... let me explain. _Sometimes_ you have a
database that's so complex and HUGE that changing tables would be a nightmare,
or you just don't have the time. You have a field that you want to shove some
serialized data into in a compact way and not have to think about formatting.
You could use JSON, you could use tabs or csv, but both of those require a
parser.

With these ascii delimiters you can serialize a set of records quickly and
shove them into a string, and later extract them and parse them with virtually
no logic other than looking for a single character. And because it's a control
character, you can strip it out before you input the data, or replace control
characters with \x{NNN} or similar, which is still less complex than
tab/csv/json parsing.

Granted, the utility of this is extremely limited, probably mainly for
embedded environments where you can't add libraries. But if you just need to
serialize records with the simplest parsing imaginable, this seems like an
adequate solution.

~~~
csixty4
You just described a Pick or "multivalue" database. They were a nightmare to
work on, but I'll admit that's mostly because of the tools (or lack thereof).
It led to people storing all sorts of different data in one table and the
queries got really messy because multivalue fields had to be treated
differently than regular ones.

------
DEinspanjer
I never knew about these which is just a bit shaming considering how long I've
been in the data munging field. :)

I agree with several other comments that the biggest issue is not being able
to represent them in an editor. If you use some form of whitespace, then it is
likely to lead to confusion with the whitespace characters you are borrowing
(i.e. tab and line feed). If you use special glyphs, then you have to agree on
which ones to use, and it still doesn't solve the problem of readability.
Without whitespace such as tab and line feed, all the data would be a big
unreadable (to humans) blob, and with whitespace, it would lend confusion
about what the separator actually is. Someone might insert a tab or a
linefeed, intending to make a new field or record, and it wouldn't work. If
the editor automatically accepted a tab or linefeed and translated it to US
and RS, then there would have to be an additional control to allow the user to
actually insert the whitespace characters that this is supposed to enable. :/

------
htns
CSV if you do it as in RFC 4180 [1] already has everything the link describes,
plus pretty good interoperability with most things out there. If you abused
CSV you could even store binary data, while ASCII has no standard way to
escape the delimiter characters.

1: [http://tools.ietf.org/html/rfc4180](http://tools.ietf.org/html/rfc4180)

------
Pitarou
While we're on the subject, we should probably be using control code 16 (Data
Link Escape) instead of the backslash character to escape strings.

The problem is, of course, that we can't see it (no glyph) and we can't
"touch" it (no key for it) so people won't use it. Ultimately, we're all still
stick-wielding apes.

~~~
XorNot
All this is making me realize that we also could have easily avoided having
confusion about what's a command line argument separator and what's part of a
file name if some of these were keyboard keys. "Field separators" to break up
your command line, and regular spaces as just regular spaces? Hell yes please.

------
ChuckMcM
I've used these in ASCII files and they are quite useful. But as most folks
point out, actually using control characters for "control" conflicts with a
lot of legacy usage of "some other control." Which is kind of too bad. Maybe
when the world adopts Unicode we'll solve this, oh wait...

~~~
maxerickson
I bet someone stubborn would find a way to screw it up anyway (I guess
stubborn people are the biggest problem with csv...).

------
chinpokomon
Having read through all the comments, I think the only real benefit to using
the control characters is in the original intent. That is a flat file that
represents a file system, with file separators (FS), group separators (GS),
like a table, record separators (RS), and unit separators (US), to identify
the fields in the record, storing only printable ASCII values.

This isn't intended to be a data exchange format, it is a serial data storage
format. In this way, there may be some valid usages, but modern file systems
do not need this sort of representation and it has no real benefit over *SV
formats for most use cases. I suppose It could still be used for limited
exchange, but since it can't be used storing binary, much less Unicode (except
for perhaps UTF-8), other formats are less ambiguous and more capable.

------
Balgair
Haha! Oh wow, I just finished a project dealing with this exactly. The obvious
problem is that most editors make dealing with the non-standard keyboard keys
very difficult. As a consequence, most programs (Python, MatLab, etc) really
don't like anything below 0x20. I was reading in binary through a serial port,
and then storing data for records and processing. Any special character got
obliterated in the transfer to MatLab, Python, etc. I ended up storing it as a
very long HEX string and then parsing that sucker. I'd have loved to use
special characters to have it auto sort into rows and columns, but that meant
having it also escape things and wreck programs. Ces la vie.

~~~
sp332
C'est la vie ;)

~~~
Balgair
Gratzie

~~~
Yetanfou
Grazie |-)

~~~
Balgair
Day nada

------
hamburglar
I will be sure to use this if I ever encounter data that is guaranteed to be
pure ASCII again.

------
binarymax
Protip - if it doesnt appear on keyboards, you can use ALT+DDD (DDD being 000
to 255) to enter a control character. For those on windows, drop into a
command prompt and hold ALT while pressing 031 on the numpad. You will see it
produce a ^_ character.

~~~
chrisBob
On what computer? Is this windows only? Now that I use a mac this might be the
only thing I miss from windows computers.

~~~
JadeNB
As htp mentioned above
([https://news.ycombinator.com/item?id=7474951](https://news.ycombinator.com/item?id=7474951)),
you can do this on a Mac by using Unicode Hex Input as your input source.

------
wisty
Devil's advocate - CSV is superior, because edge case bugs (a comma in the
data) are likely to be tested.

The edge case bugs in ASCII codes could still crop up. It shouldn't, but then,
valid SQL shouldn't crop up in a web form either. And when it does, we'll need
escape codes just like CSV, only it won't be well tested in all the tools
(because it's not going to frequently happen).

It's like all the OSS advocates laughing at Microsoft's idiotic "My Documents"
folder. It's not there because they didn't realise how much trouble it would
case programmers, it's there because they _wanted_ for force people to deal
with edge cases.

------
gwu78
Why not use the non-printing char as the comma instead of the record
separator.

1\. Replace all the commas in the text with the unique non-printing char
before converting to CSV.

2\. Convert this char back to a comma when processing the CSV for output to be
read by humans.

Because commas in text are usually followed by a space, the CSV may still even
be readable when using the non-printing char.

I must admit I've never understood why others view CSV as so troublesome vis-
a-vis other popular formats.

in: sed 's/,/%2c/g' out: sed 's/%2c/,/g'

I guess I need someone to give me a really hairy dataset for me to understand
the depth of the problem with CSV.

Meanwhile, I love CSV for its simplicity.

~~~
dbro
that's exactly what
[https://github.com/dbro/csvquote](https://github.com/dbro/csvquote) does for
commas and newlines both.

~~~
gwu78
Why use this instead of sed, awk, flex, lua, etc.?

sed does the job and on almost all UNIX clones it never needs to be installed.

Because it's already there.

------
rcthompson
If you use unprintable characters in your file, it's no longer human-editable
as text. It may as well be XML (i.e. technically text-based but not
practically human-readable).

~~~
JadeNB
> It may as well be XML (i.e. technically text-based but not practically
> human-readable).

Is this really a standard complaint about XML? I thought the main complaint
was that it wasn't human- _writeable_. I wouldn't want to read novels in XML,
but I've never had a problem opening up an XML file in a text editor to get at
bits of it.

~~~
rcthompson
Well, you'll often get XML files with a single line and no whitespace between
elements, and that makes things a lot more interesting. Basically, you can't
rely it being practical to quickly poke around in the text of an XML file. I
always feel sad when I have to read the text of an XML file to get
information.

~~~
JadeNB
> Well, you'll often get XML files with a single line and no whitespace
> between elements, and that makes things a lot more interesting.

That can be fixed with `xmllint --format`; but I agree that, once you need to
bring in external tools, it's not clear that calling it 'human-readable' is
really appropriate any more.

------
notimetorelax
Would it work if the file was encoded with UTF8 or UTF16?

~~~
bkyan
I had the same question and found this:

[http://en.wikipedia.org/wiki/Unicode_control_characters](http://en.wikipedia.org/wiki/Unicode_control_characters)

Seems like the same control characters are present.

------
zenbowman
Interestingly, Apache Hive uses control characters for column and collection
delimiters by default. I commend them for that decision.

------
co_dh
How do you enter them ? in console, in editor? Since they are invisible, how
do you find if you have entered a wrong character?

~~~
thebelal
You can type them in console and vim with

    
    
        File separator - C-v C-\
        Group separator - C-v C-5
        Record separator - C-v C-6
        Unit separator - C-v C-7
    

They are all visible characters in both vim and emacs by default.

You can see them on the terminal with `cat -v`

It would be nice if more tools were built to take advantage of these
characters, but there are some that do.

~~~
tmalsburg2
In Emacs, one way to enter these characters is to use `M-x ucs-insert` and
then enter the hex code of the separator:

    
    
        M-x ucs-insert 1c
        M-x ucs-insert 1d
        M-x ucs-insert 1e
        M-x ucs-insert 1f
    

for file, group, record, and unit separators. Is there an easier way?

~~~
fshaun
C-q (quoted-insert) will insert the next input character ignoring any other
bindings. So: C-q C-\ C-q C-] C-q C-^ C-q C-_

------
snorkel
Hah! Wonderful example of a forgotten feature.

It's not often that the tab delimited format is problematic, at least nothing
that a simple string-replace operation can't solve, so it's not worth trying
to convince every existing text reader and text processors to recognize these
long forgotten record separators correctly instead.

------
neves
Wow. I can't count the number of times that I saw a bug due to a newline, a
coma or quotation marks inside a field.

------
sitkack
We need a font to display the hidden characters and a keyboard with 4 more
keys. Problem solved.

------
Splendor
I tried to use it but one of the tools I rely on (BeyondCompare) resorts to
hex comparisons when it detects these comparisons. In contrasts, it treats CSV
files better than anything; letting you declare individual fields as key
values, etc.

------
rcfox
I've recently had to deal with exclamation mark separated values. It sure is
exciting, especially when there are empty fields:

    
    
        foo!bar!baz!
        a!b!c!
        d!!!
        e!f!!

------
mncolinlee
I thought I was one of the few using pipeline-delimited format in my tools.
You can handle many of the incidental problems by having both pipeline and
quotes like CSV.

------
kyllo
This is awesome, but sadly no one is going to use it until Microsoft Excel
allows you to export spreadsheets in ASCII delimited format.

------
efalcao
mind. blown.

------
Eleutheria
EDI is actually a wonderful and simple ASCII format for complex documents in
use for over 30 years.

The underlaying mapping formats for specific industries are a pain to parse
but everything is easily formatted using stars or pipes as field separators

    
    
        ST|101
        NAM|john|doe
        ADR|123 sunset blv|sunrise city|CA
        DAT|20140326|birthday
    

Ah, the joy of simplicity.

~~~
commandar
HL7 used in healthcare is similar.

Just grabbing the first few segments from the example message in the wiki
article:

    
    
      MSH|^~\&|MegaReg|XYZHospC|SuperOE|XYZImgCtr|20060529090131-0500||ADT^A01^ADT_A01|01052901|P|2.5
      EVN||200605290901||||200605290900
      PID|||56782445^^^UAReg^PI||KLEINSAMPLE^BARRY^Q^JR||19620910|M||2028-9^^HL70005^RA99113^^XYZ|260 GOODWIN CREST DRIVE^^BIRMINGHAM^AL^35209^^M~NICKELL’S PICKLES^10000 W 100TH AVE^BIRMINGHAM^AL^35200^^O
    

The delimiters are defined at the beginning of the opening MSH (message
header) segment. HL7 is zero-indexed, but your zero index is always the
segment label, so it's easy for non-technical people to count naturally to get
the field identifier without having to explain counting n-1 to them.

The one exception to that is the MSH segment. Things get a little screwier
there because the first instance of the field delimiter is _also_ counted as a
full field in the spec, so it tends to trip people up. So even though "^~\&"
above looks like it should be MSH.1, it's actually MSH.2, etc.

The delimiters used in the wiki example are the most common you encounter, but
some systems do things differently because reasons. The primary HIS at my
hospital uses colons and semicolons, for example (and I want to poke out my
eyes with ice picks every time I have to look at the messages coming from it
as a result). But since it's all defined right in the message header, it's
trivial to convert between delimiters when you need/want to.

Either way, this is how the vast majority of electronic medical records are
transmitted today.

~~~
Eleutheria
And the good thing is that you can generate very complex structured documents,
not only lists like in cvs.

Of course xml and json do the same, but more verbose.

ASCII formats like EDI were invented when every byte in transmission counted.

