Hacker News new | past | comments | ask | show | jobs | submit login
ASCII Delimited Text – Not CSV or TAB delimited text (ronaldduncan.wordpress.com)
695 points by fishy929 on Mar 26, 2014 | hide | past | web | favorite | 280 comments

I've done this.

Everybody hated it. Most text editors don't display anything useful with these characters (either hiding them altogether or showing a useless "uknown" placeholder), and spreadhseet tools don't support the record separator (although they all let you provide a custom entry separator so the "unit" separator can work). Besides the obvious problem that there's no easy way to type the darned things when somebody hand-edits the file.

It's a shame. The solution is in the charset, but tools never developed to use it so we don't use it. But I'd wager that if tools had historically supported them, then the situation would be no different than it is with tab.

There are representational glyphs for tab, return, and others (⇥, ↵), and editors can show them in 'show whitespace' modes. There could be representational glyphs for these control characters, too. I'm not sure about the history of these symbols, but I imagine they were initially on keyboards. But if these control characters were on keyboards and had a representation in text then they'd be just as useless as the tab character is today.

Precisely what makes them valuable is their difficulty to type or display.

> Precisely what makes them valuable is their difficulty to type or display.

Exactly. A naive[1] character-delimited format is only robust when its delimiter isn't on the keyboard. Which means it's practical to read and write only in specialized applications, not in a text editor. Which sort of defeats the point of a character-delimited format anyway. If your use cases are constrained to specialized applications, you may as well just use JSON or something similar, instead of a character-delimited format. (I'm imagining a world where we get to pick our ideal formats, not one where Excel happens to have an almost-working CSV implementation.)

[1] By "naive" I mean one where the format specification describes record and line characters only. Thus the format has no escaping system.

Is this a problem that's solved with a "movement" ? I'm willing to contribute parsers and perhaps a text editor plugin if other people are willing to chip in. But indeed, this is a problem that's solvable by making the tools support it. Since most of the tools are OSS, this is mostly solvable.

You're missing my point. If this were a "solved problem," then it would be no better than the status quo with TSV. Tabs are a similar kind of control character, except they can be typed and displayed easily. And TSV files are a mess because tabs end up in the fields themselves. It's a catch-22. (I'm exaggerating; this would still be somewhat better since there will never be a big key on your keyboard dedicated to these ASCII values.)

Tabs are not a record separator. They are a formatting code. That's why a mess is created when they are used to separate records.

You just gave further proof that tabs are a catch-22.

They are if you're saving formatted text in a tsv- thus all the discussion about this rather mediocre attempt to replace them.

Yes, and he also specified why they are different from record separators in this regard.

It's no different, if it can be displayed on the screen and it is on the keyboard, then the user will type it into the text file for how it look, not because they want to really use it as record separator. And if user can type it in the content then you have to escape it, which is back to the same problem with using tabs.

But if it's not on the keyboard, then the user won't type it. So it doesn't get used.

It's very different.

Tab, comma, semicolon are actual characters people DO and need to use inside records. So not only they can be typed into records, they absolutely HAVE to be typed for most textual records.

ASCII record separators are not needed inside records at all. If someone adds them "for how it looks", it's his problem.

One (not using a record separator inside a record) is a matter of choice and doing the right thing. The other (not using comma, semicolon or tab) is a non starter.

Obviously there's a difference.

I'd suggest that the most common tool used to generate and manipulate tabular data is Excel, which has terrible CSV export.

> Precisely what makes them valuable is their difficulty to type.

Again, if they'd caught on you'd imagine there would be some eventual convention in text-editors for what keybind would be used to enter them. Too bad the AltGr key (intended for entering rarely-used glyphs) doesn't appear on pure-English keyboards.

I'd bet that they were on the eleventyzillion key IBM terminal keyboards.

I'm looking at my English keyboard and it has Alt Gr...

Qwerty keyboards intended specifically for the US market (opposed to the UK for instance) frequently do not have AltGr keys. The thinkpad I'm using right now doesn't have one, and the Das Professional I have next to me doesn't have it either. Come to think of it, I'm not sure if I've ever owned a keyboard with an AltGr key..

To get around this, I have taken to using xmodmap to turn my right alt key into altgr.

I'm surprised to learn that the Das doesn't have it.

How would you tell…?

Ha. Only the Das Ultimate doesn't have printed keys. I got the Professional which has printed keys (http://www.daskeyboard.com/model-s-professional/). I'm not that show-offy about touch-typing. ;)

Tip: I've got both, the blank one isn't worth the hassle. I can type fine on it, except when I need to find & or something. Then I have to press all the numeric keys to figure it out.

Here in Belgium we have to AltGr-the hell out of our keyboard during development. These characters can only be entered using the AltGr key on a Belgian Azerty keyboard (be-latin1): '|' '@' '#' '^' '{' '}' '[' ']' '\' and '~' Its not realy a problem, as long as you're used to it :-)

Of those, |, @, #, ^, {, } and ~ are all keys that you need shift for on US Qwerty keyboards, so I imagine that the typing experience for those is relatively comparable.

IF those characters are actually displayed on the corresponding keys. If it's like the # on the Mac keyboard - not displayed anywhere - it's a right pain to learn in the first place without any visual cues.

I'm looking at a couple Macs sitting around the house and they all have a pretty obvious # right above the 3. Perhaps you mean some other key?

(I do really miss the days when Mac keyboards had the weird symbol they use for "alt" in menus on the key.)

Which keyboard do you have? The standard aluminium full size one doesn't label it, at least in the uk

On US Mac keyboards, # is Shift-3. On UK Mac keyboards Shift-3 is the Pound (Sterling) sign. # becomes Alt-3 and is unlabelled on the keyboard.

In the past, I've bought US Mac keyboards just for the #, or switched the keyboard layout in software to US.

I've become used to the alt-3 combo now, but it took a long time. I type # a lot more than I ever type £. $, too.

All my machines have US keyboards.

I fixed my keyboard -- @,#,^,{,}.... no shift by default. Have to press shift to get 2,3,6,[,],9,0,...

Must be annoying to have to press shift to get those symbol chars while you're coding, eh?

> Must be annoying to have to press shift to get those symbol chars while you're coding, eh?

Honestly no. It's all muscle memory for me, I don't think about it any more than I think about typing capital letters.

I think it's french keyboards that are like that by default. Symbols are the default and you have to press the Shift-key to get numbers.

Yeah, and some people complain that they have to use shift to get numbers.

I just think people will criticise their local layout no matter what. I use both a French AZERTY and a Québec QWERTY everyday (at work/home) and I think they are simply equally good for both typing French and for coding. The Québec keyboard (maybe actually Canadian multilingual or something) might have an edge because it more easily allows typing accented letters in uppercase, but on the other hand it doesn't let me type the € sign, so...

So most Belgian developers are not switching layouts when coding? I thought most of us non-US developers are "bi-lingual" when it comes to keyboard layouts.

That's nothing. Standard Italian layout lacks tilde and backtick completely… and one needs three fingers to type “{” or “}”. I use only US layout for programming.

Actually, I switch between three layouts on my machine (the third one is for my mother tongue), and I also use AltGr for typographic characters.

Living in the US, I've never heard of an AltGr key until this discussion.

I'm from Croatia, and we have the AltGr key.

However, I discovered that Alt + Control = AltGr when I needed to use it at work[1], so it's simply a shortcut, I think.

[1]: We have (I recently switched to an UK keyboard, because it suits me better) keyboards at work, because the Croatian (all slavic languages, to be honest) is horrendously counterproductive for programming. Google the layout, and you'll realise why. An example: You need to press AltGr+B for `{` (if I remember correctly).

Perhaps, but I think it still goes against the original intent. Ctrl-~ or Ctrl-^ should give you a record separator (RS) and Ctrl-Del or Ctrl-_ should give you a unit separator (US). For the same reason Ctrl-m or Ctrl-M should give you carriage return (CR). This is because ASCII values from 00-1F are control characters and effectively grounded the most significant bits 7 and 6. Shift similarly would toggle or ground bit 6, depending on the implementation.

What happened was that the Ctrl key became synonymous with "command" after Teletype, so it became more about doing something. Think about Ctrl-x, Ctrl-c, and Ctrl-v as an example, but you still see some relics like Ctrl-d as End of Transmission (EOT) to close a shell or terminal. Alt is like a shift, but it is actually closer to the Fn key on most laptop keyboards. It was an alternative function of that particular key, so where the shift key provided you with an alternate case, Alt was more akin to an entirely different key... it isn't Alt plus an 'a' key, it is Alt-a.

AltGr was like another Alt key. It was originally there to allow you to enter an alternate glyph, especially line drawing characters available in extended ASCII, B0-DF. I thought it was a mapping closer to flipping the most significant bit to 1, but it doesn't exactly overlay the lower ASCII range, so that might be another change that evolved on the way to the modern keyboard.

To your original point, Microsoft Windows will now usually treat the chord Ctrl-Alt as AltGr. I don't know if that is with all layouts, or just those keyboards that lack AltGr. I find that most Linux distributions tend to follow Microsoft's lead and provide similar mappings but now they even repurposed the Win key as Meta or sometimes called Super. So it is likely that Ctrl-Alt is commonly the equivalent of AltGr.

For the propose of this discussion, I think it'd be better if Ctrl could be used to type these text separators, but the way modern operating systems map their modern keyboards, it might be difficult to ever reach consensus on how this should be done.

This hasn't been true since IBM keyboards became popular. For example, on older keyboards shift+number would simply toggle a bit, so shift+2 would be a double quote, etc., but this hasn't been common for decades now. Unfortunately.

Probably on the path to scan codes. By using scan codes, they could abstract what a particular key meant and thereby remap the keys so that they didn't have to match the ASCII table layout. I still don't understand why we evolved scan codes the way we did. This requires the OS to be in sync to be able to map them back.

I'm a Spanish living in UK and I love using UK keyboards for the same reason: the layout is way better for programming!

Windows used to have a "Czech (Programmers)" keyboard layout. As I remember, it was one of the few layouts that _just worked_. First thing I would do on new installs is make it the only available layout.

Almost nobody calls it AltGr, it's just the goddamn "right Alt key" :) ...and yes, it has a different key code from the left one, even if in most software it works just like another Alt.

The clever thing is how some international keyboard layouts use it like some kind of "second shift" for typing character with accents/decorations: like AltGr+a => "ă", AltGr+q => "â", AltGr+s => "ș" etc. ...but not even these keyboard layouts are popular, and usually marked as "alternative" or "programmers' layout for language XYZ", because people are stupid and refuse to learn how to use this and prefer instead a funky layout national language keyboard instead of an US English keyboard with an AltGr that would just solve 99% of special characters problems.

If all the keyboards in the world would just be US English Standard keyboards with an AltGr (most US English keyboards I've seen do have an AltGr!), all latin-alphabet languages with special characters would be easy to type, we polyglots could easily use the same keyboard for typing in multiple languages without having to remember what keys' positions have radically changed on each layout... but people are stupid and refuse to learn even simple key combinations.

Oh, and somebody should shoot the British (and French) for adding that annoying extra key to the right of the left Shift that I always have to disable (and making the Shift much smaller), and for creating extra confusion by branding them as "british international" or "us english business" keyboards.

Living in the UK, I've always seen it and for some reason never questioned what the Gr meant!

so.. what does it mean?

According to wikipedia:

The meaning of the key's abbreviation is not explicitly given in many IBM PC compatible technical reference manuals. However, IBM states that AltGr is an abbreviation for alternate graphic, and Sun keyboards label the key as Alt Graph.

Apparently, AltGr was originally introduced as a means to produce box-drawing characters, also known as pseudographics, in text user interfaces. These characters are, however, much less useful in graphical user interfaces, and rather than alternate graphic the key is today used to produce alternate graphemes.

If there isn't a glyph, you can always use text representation for control codes. Here's an example using Notepad++. http://i60.tinypic.com/nps3d1.png

> but I imagine they were initially on keyboards

I wouldn't be surprised if the last time they were on a keyboard, it was a teletype keyboard or a keyboard that punched cards!

Although, actually, were they just typed using the control key from the start?

ASCII 0-31 are called "control" characters so it should come as no surprise that you could type them using the Control key.

Unit Separator is Control-_ (underscore) and Record Separator is Control-^ (caret), for instance.

Most modern text editors won't pass through every control character. Vim lets me type the unit separator, but not the record separator, for instance.

Control-C and Control-D, End of Text and End of Transmission, still have utility in most shells, and it goes back directly to these ASCII control characters.


Unit Separator is Control-_ (underscore) and Record Separator is Control-^ (caret), for instance.

Most modern text editors won't pass through every control character. Vim lets me type the unit separator, but not the record separator, for instance.

Vim has bindings for some control key combos, which is why you can't type them directly. I can type control-_ without issue, but to get an ASCII 30 (RS) to show up, I have to type control-v first (just like when at the shell prompt).

Useful hint: if you look at the output of man ascii (at least on linux with man-pages-3.22) find the control character you want to type in the left column and look in the same row on in the right column to find the letter/key to use with the control key.

For example, to type NAK, it's <control-u>. Vertical tab is <control-k>:

   $ echo <control-v><control-k> | od -c
   0000000  \v  \n
This is useful for control characters that don't have backslash escape expansions.

...and that's why, when running a command that's reading what you're typing to stdin, you press Ctrl-D to tell it that you're done. Because, as your "man ascii" trick shows, this generates the "end of transmission" character.

Similarly, Ctrl-L clears the screen in most Unix apps because that generates the "form feed" character, and on a line printer or paper-based teletype terminal, "form feed" means to advance to the next page -- a nice empty clear piece of paper.

when running a command that's reading what you're typing to stdin, you press Ctrl-D to tell it that you're done. Because, as your "man ascii" trick shows, this generates the "end of transmission" character.

This is slightly different than being able to actually generate those characters as input. If you put a control-d in a file, nothing will stop reading the file when it sees the control-d, it's just another byte. readline knows how to interpret a bunch of control characters. There's also the terminal driver that has interpretations, which you can see with `stty -a`. When the tty layer sees a control-d or a control-c, it closes stdin or generates SIGINT, respectively. But even this can be dependent on if you're on a pty or attached directly to a serial line. There's nothing special about the mapping of control-d to end-of-transmission character, that's just how the layer that sees it is interpreting it. You can, for example, change how control-c is interpreted by `stty intr ^B`, which will make control-b generate SIGINT. And there's a way to put the terminal driver in complete transparent/passthru mode.

Sure, the default is overridable, and the character has no magic effect when found in a binary file, but there's a reason Ctrl-D was chosen as the standard keystroke sequence to end a stdin transaction, and it has everything to do with the fact that "D" shows up next to "end of transmission" in `man ascii`.

Thank you for clarifying something that no one was disputing or event mentioned.

Yeah, I hate it when people point out interesting things I didn't know before too.

I'm sure raldi can repeat, again, making it 3 times, that control-d means end of transmission and that's why it was chosen for signalling end of stdin, on a thread about how to type the control characters.

I am going to love not having to type out 'clear' all the time - thank you!

what about Ctrl-C to break out of stuff? Does that correspond to an ascii control character?

Ctrl-C stops the program because the TTY driver watches for it and sends an interrupt signal when it appears. It's a slightly different mechanism.

There are more details in "man 3 termios", but be warned, the TTY layer is not a pleasant subject for reading.

I was disappointed that Ctrl-S and Ctrl-Q don't correspond (at least on the first ASCII chart I googled) to anything, because Ctrl-S stops text on my Unix terminals the exact same way it did on my brother's Apple ][e in 1984.

> I was disappointed that Ctrl-S and Ctrl-Q don't correspond (at least on the first ASCII chart I googled) to anything

I'm pretty sure every possible combination of 7-bits is assigned SOME name in ascii, and Ctrl-any-case-insensitive-letter is a defined 7bit value.

This chart says control-s is `DC3 (Device Control, X-OFF)` and control-q is `DC1 (Device Control, X-ON)`. Yup, that's what they do alright.


Don't forget everyone's favorite control-G, BEL, useful sending beeps across the wire on old school IRC, BBSen, and chat services. Or, you know, telegraph machines or something.

For anyone wondering, X-ON and X-OFF are flow control characters. A device sends X-OFF to say "my buffer is full, stop sending data" and X-ON when it's ready to resume.

On most Unix terminals, these have the effect of pausing and resuming the display when a bunch of data is scrolling by.

...and they're helpfully disabled with stty -ixon

Oh, thank you. The reference I found[1] only showed them as "device control." I thought I imagined them having that.

Also, the classic control-G is still in C. "\a" is "alert (beep)."

[1] http://www.asciitable.com/

I think you mean your brother's Apple //e - they switched from the original ][ to // for later models. That's what I started programming on too - except in my case it was my Dad's.

Ah, but the boot screen said "APPLE ]["

Ctrl-C was originally "End of Text".

That's a terrific hint. So the ctrl-<dec-val + 64 char> = <dec val char>

> Unit Separator is Control-_ (underscore) and Record Separator is Control-^ (caret), for instance.

And this is in fact how they show up in vim (or at least my fairly uncustomized vim), using ^ to stand for ctrl as was once conventional: `^_` and `^^`

The legacy MARC binary format still used for library data uses ascii 29, 30, and 31 -- although for reasons with probably some bizarre historical definition uses them DIFFERENTLY than defined in ascii.

0x1D == 29 == ascii group seperator == MARC Record separator

0x1E == 30 == ascii record seperator == MARC Field Terminator

0x1F == 31 == ascii unit separator == MARC subfield seperator


Back in the day with a numeric keypad you could type Alt-(number) to get any ASCII character, but not sure if that still works. In Firefox apparently Alt-1 takes you to the first tab, Alt-2 second tab...

In most graphical *nix environments that I'm familiar with, if you hold Control and Shift¹ and type U, you can type any Unicode character by its (hex) code point. For instance, if you want an em dash, which is U+2014, hold Control and Shift, and type "U2014".

¹You can release Control and Shift after typing "U", in which case the character will appear after you type a space. Or, you can hold Control and Shift while typing the code point, in which case the character will appear after you release either modifier key.

I just tried in Debian, works! Very handy.

On Windows yes, just proceed the number with a 0 to get Unicode. See the "How to enter Unicode characters in Microsoft Windows" page [1] for more details, including one of my favorite tricks, using Alt-X in WordPad. It is especially useful when you want to know what the code point is for a Unicode character.

[1] http://www.fileformat.info/tip/microsoft/enter_unicode.htm

Turn on Num Lock and it still works, including in Firefox.

Alt-(3 digits)

Vim will let you type the record separator, but you may have to precede it with Ctrl-V. In other words, type the Ctrl-V Ctrl-^ key sequence.

At least, that worked for me. Interesting that the unit separator does not have the same requirement to precede it with Ctrl-V.

More info in the vim docs:


Yes, Ctrl-V in {insert,command} mode lets you input literal control characters, as well as numeric codes (31 for unit separator, 30 for record separator). If you're using vim on Windows, however, note that the default vimrc loads mswin.vim which changes Ctrl-V to paste. In that case, you can simply remove mswin.vim from your vimrc or use Ctrl-Q instead. (See :h i_Ctrl-V).

You can also use Ctrl-K to enter RFC1345 digraphs, in which case unit separator is 'Ctrl-K US' and record separator is 'Ctrl-K RS'. (See :h Ctrl-K).

Emacs has similar features, as well as a nifty TeX input mode: http://stackoverflow.com/q/6269618

Huh. I wonder how many of those are completely disused, or would be meaningless in a file?

Yes, but nowadays this is just another bootstrap problem: editors don't support them because no documents use them, no documents use them because editors don't support them, and users scream and bawl because their cheese has moved. Using an editor with some good support for control-character-separated data would be pleasant, more pleasant than the usual experience of fiddling with CSV in a text editor. That said, another nasty aspect of the bootstrap process would be running control-character-separated documents through standard text-processing tools and pipelines and finding out how many of them strip non-tab control characters, turn them to spaces or mangle them pseudo-randomly.

I think the more fundamental problem with using the ASCII control characters is that there's no way (or rather, no obvious and standard way AFAICS) to use them to define (most) recursive data structures. That's not a problem when you stick to simple CSV-like arrays-of-arrays, but it's enough of a restriction to make one wonder if bringing ASCII back is worth the battles.

Absolutely. Too bad they went for a flat format instead of a sexpr-like format. You wouldn't even need 4 characters - just start,end, and separator.

if the object is given a UUID, the recursive data structure can just point to the UUID when it's serialized to this? what's the problem?

Then you'd have 2 problems.

If I'm serializing nodes to ascii, and the rows are objects of type Node, and their contents were their UUID then VAL then NEXT, NEXT being a UUID, I don't see how you have 2 problems now.

It's pretty easy using Emacs. Just type C-q C-_ or C-q C-^, alternatively C-x 8 RET [037|036] RET.

Emacs makes it really easy to insert arbitrary unicode as well, even if you don't remember the code point. C-x 8 RET s n o w m a n RET will insert a ☃ (snowman).

In the standard CP437 PC charset, those characters are up/down triangular arrows. The standard MS-DOS editor also easily supports entering them with Alt+30/31, so maybe 20 years ago a proposal to do this might've been more accepted than it is today.

Don't do this. Tsv has won this race, closely followed by Csv. Anything else will cause untold grief for you and fellow data scientists and programmers. I say this as someone who routinely parses 20gb text files, mostly Tsv's and occasionally Csv's for a living. The solution you are proposing is definitely superior but isn't going to get adopted soon.

I was surprised to see you list tsv as more common than csv. I encounter csv's on a pretty regular basis, but I don't think I've had to parse a tsv in the past 3 or 4 years. As a junior web developer, I don't have much experience though. 9 times out of 10, the csv is coming from or going to Excel, or a system that was designed to support Excel. If you don't mind my asking, what types of data do you regularly work with that are in tsv format?

Your comment disturbs me a little… One of my gripes with Excel was that it imported and produced TSV data by default when you asked for CSV.

Excel actually doesn't 'care'. It uses the record separator defined in your Windows "Regional Settings", and the defaults there differ for each system locale.

TSV is nicer for output (on stderr/out or a logfile), so tends to crop op if you want to parse the output/logfile of something. I haven’t seen Excel in use at my workplace yet.

Surely only if you're dealing with fixed-width fields.

Floating point values with a given precision and some integers. We’ll have to buy a proper supercomputer before the latter take more than seven digits :\

ohhm this is pain in the ass format. srsly everything can happen there...

There tends to be less overhead in TSV. Unless you want to represent text that has embedded tabs it seems unnecessary. It works with standard *nix tools. Not a bad compromise and part of the reason that people whose standard "file" is 100Gb prefer it.

You're telling the HN crowd not to do something because it might cause confusion and... disruption? Good luck! ;)

TLDR: it's superior, but don't do it...

That's unfortunately a very accurate summary:) Real estate data, traffic data, weather data, population demographics, stock prices, tweets - I've parsed all that and more. Every one of them was a giant Tsv (except finance ones, which were csv's because Excel). Say you purchase the database containing every single home sold/bought in California for past decade. That's 11 20gb Tsv's with 250 tab separated columns plus 1 data dictionary which tells you what each of the 250 columns mean.That's what Reology sells you - gigantic txt files with tabs, that are easy to handle with awk, cut, sed and more.

I could preface a lot of this with 'kids these days', but...

What you write is so true. So many large companies use text files to shuttle around data. I worked at one place that used pipe-delimited 10+GB files. It's not sexy, and using awk/sed/cut seems like a hack at first, then you realize that it works and it is the simplest solution to the problem.

awk, cut, sed and less

It's simply too late. There was MAYBE a chance 20 years ago to push adoption of this into major text editors and spreadsheet software.

But now it's like harping on the benefits of HDDVD or Bluray.

It is strictly less expressive, because it can't handle nesting. This makes it inferior.

What would make one pair of ASCII characters (comma/linefeed or tab/linefeed) handle nesting any better than another pair (unit separator/record separator)?

Because CSV actually has three special characters. The field separator is ',' (or tab for TSV). The record separator is '\n' (or "\r\n"). And the quote/escape character is '"'. Commas or newlines which are part of a quoted string are data rather than control characters. Quote characters can be escaped by preceding them with a second quote character. There is an RFC that describes this.

You could use ESC (\x1b) to escape itself and either of your delimiter characters, but of course now you've gotten back all that complexity you were trying to avoid by using non-printable characters.

I think CSV is more common, since it allows for escaping whereas with TSV I don't believe there is any method of escaping.

I had to deal with a system using TSV once, with that "feature", and that point made it so that we had to do the escaping at some higher level, with \t and \\.

Still, most *SV parsers can use arbitrary char or even regexp for a separator.

It's hardly too late. If you just look at the use case of application logging for example, the latest fashion is logging using JSON format log files, which is INSANE for a different set of reasons. I have servers that generate TB of log files on a regular basis. ASCII delimited log file format standard could be adopted by the application logging space, could result in some uniform tools that provide better streaming support for log shipping, and gain adoption in other adjacent use cases from there.

yeah, so many people use horses, don't make the horses angry

There is another: fixed width. It doesn't cause grief.

Sure it does - any change in supported field length requires a schema change.

You trade simplicity of parsing for rigidity of schema.

Anyone that's ever had to parse arbitrary data knows of the approximately 14 jiggityzillion corner cases involved when sucking in or outputting CSV/TAB delimited formats. Yet much like virtual memory and virtual machines, we find that a solution has existed since the 60s. For those wondering about the history and use of all those strange characters in your ASCII table: http://www.lammertbies.nl/comm/info/ascii-characters.html

> One might question why all control codes in the ASCII character set have low values, but the DEL control code has value 127. This is, because this specific character was defined for deleting data on paper tapes. Most paper tapes in that time used 7 holes to code the data. The value 127 represents a binary pattern were all seven bits are high, so when using the DEL character on an existing paper tape, all holes are punched and existing data is erased.

I love this, it shows just how old the roots of ASCII are.

There's a problem with that explanation: the binary representation of "deleted data" has a different meaning than the control code "delete data".

Interesting web page! Despite many years of using ASCII and knowing some of the more common control codes, I had never even thought about what the other mysterious 0-31 codes were defined as.

Something that the page doesn't mention is that CR+LF were originally two separate control codes because the action of returning the print head to the left hand side would take too long with a standard line printer. Therefore, separating the actions into two codes meant that the printer would not miss out any printable characters.

(At least, I read that somewhere on the internet and assumed it was true!)

It's more likely that it's because they are two separate physical actions (returning the head to the left, and advancing the paper one line). They could be used independently: You could print a line in bold, for instance, by issuing a CR without an LF and then printing the same line again.

A carriage-return operation takes much longer than a single character, or even two or three. It doesn't make sense to issue two characters just to take up time. The printers always had to have some internal buffer memory (and handshaking over the communication lines to say when the buffer is full) in order not to lose any characters.

"You could print a line in bold, for instance, by issuing a CR without an LF and then printing the same line again."

Last I checked, this still works even on laser printers (at least on a LaserJet), when sending data to it as plain text. It's not actually printing over itself, but it knows to make the repeated characters bold.

less (among other unix tools) does this too (but you have to do one character, bs and the character again). There are more, like _, bs, character underlines (like cat there is ul that handles this specifically). If your terminal supports os (overstrike) in it's terminal description it handles that natively.

> You could print a line in bold, for instance, by issuing a CR without an LF and then printing the same line again.

True, but mildly redundant: "overprinting" was explicitly the purpose of 0x08 backspace (which had nothing, originally, to do with 0x7F deletion.)

To overprint a whole line using 0x08, you'd need one 0x08 for each character in the line. So an N-character line overprinted that way would take N3 characters in memory.

Using CR, you'd need N2 + 1 characters.

I had a daisey wheel printer in the late '80s that had a few characters of buffer. I had to know that it took quite some time for every CR. It would do the niave bold of the full line with CR, but if it got X^HX it would hit the X and then slide the head over a bit to the right to smear and get the bold effect, which looked much better. It was not uncommon and that's another reason a lot code did it the BS way for OS capable HC devices.

Ugh, wish I had checked this after sending it, now it's too late to edit. I think the gist of it is clear, but to make sure:

Overprinted with 0x08: requires 3N characters

Overprinted with CR: requires 2N+1 characters

Even ignoring additional time needed to send those extra characters, it also was a lot faster than those control-H's, and (I guess) caused way less wear on your printer.

That presumes you're going like A^HAB^HB; you could just emit foo, and then strlen(foo) ^Hs.

I wonder how precise the backspacing was - how many times could you print a character and backspace before the head had drifted by a full point?

Yes. Separate Carriage Return & Line Feed date back to Murray's 1901 variant of Baudot encoding, and ASCII was created to standardize the various teletype encodings out there so it inherited this way of doing things.

In the days of mechanical line printers CR by itself allowed overprinting for special effects like underlining and bold.

Depends on your platform. Even today a Unix/Linux line ending is almost always just LF.

See: http://en.wikipedia.org/wiki/Newline

Unix = LF Mac = CR DOS = CRLF

The DOS line endings are inherited from previous systems, and while they precisely convey carriage and paper movement of a printer it can be a pain to deal with today.

Unix = LF Mac = CR DOS = CRLF

Please add to that list also



However, I still have to see I single webserver that does not accept LF alone instead of CRLF.

Somewhat ironic that this detail of how teletypewriters work was not carried over into Unix whereas every intricacy of how video terminals worked still is with us today.

Only old Macs. They switched to POSIX-style LF with OS X.

True, but since Mac OSX is already known by most people to be *nix based, at least this audience, I felt like that might just add to confusion. There are certainly other OSs like BeOS that I didn't mention, in part because I didn't think they added to the discussion, and also while having fond memories of BeOS, I honestly don't remember what it used. I want to say it was LF, but I no longer remember.

A much more complete history is detailed here: http://web.archive.org/web/20120213005708/http://www.transba...

Agreed. A few days ago, I learned that spaces are illegal between separators in CSV. For example, `"val1", "val2"` is illegal (it should be `"val1","val2"`). Kind of unintuitive given that in most languages, non-delimited spaces are insignificant.

I don't think CSV is a rigidly-defined format - I'm sure some implementations will happily accept spaces between the comma and the opening quote.

This is one reason why I prefer tab delimited files... The format is pretty simple with few edge cases. There really only one caveat to worry about - fields with tab characters. And that's extremely rare in my field.

CSV on the other hand has a few different variations.

That's one reason I prefer TSV... the other one is that you can't use any of the regular Unix tools reliably on CSVs.

Can you explain how tab is any easier than comma? If you have to deal with escaping a character, then certainly it doesn't matter which character it is? For the general case, that is.

Who said anything about escaping a character? For most datasets, having actual tab characters is rare, especially if you can just replace them with spaces. Same with newlines - the just aren't needed in a lot of data. If you're dealing with user-derived text content, tab delimited files might not be the best choice. However, it's great for tabular data.

With CSV, you have to escape quotes, commas, and newlines. With tab delimited, you only have to escape tabs and newlines - and that's if they can't be sanitized out to begin with.

I mean for the general case. Sure, more datasets may be tab-safe than comma-safe. CSV doesn't need quotes if there's no commas. TSV needs quotes or something if the data contains tabs.

Or, you just escape the tab. \t style. (Well, then you need to escape \\, and \n, etc). Really, it just gets messy at that point. Which is why I try to avoid free text in tab delimited files. It's not much better in CSV if you allow newlines within cells.

The thing that gets to me is that the CSV RFC is over 4000, and is less than 10 years old.

There is RFC 4180, which is probably the best formal description of the format we have.

My own CSV parsers (I have written a few by now) usually parse that as if the space before (or after) the quote wasn't there. It's nonetheless something to avoid when writing CSV files (Postel's law, etc.).

I wonder if a 'whitespace separated values' format would make sense, assuming a) the separator is 'any run of whitespace except for a single space (ascii 32)' b) all whitespace in values is folded to a single space (which covers 99.99% of CSV usage). This would make parsing trivial, visual formatting (e.g. alignment) possible, and escaping separators a non-issue.

Spaces are valid in fields, even at the start or end. Furthermore you cannot start quoting in the middle of a field, which is why this doesn't work. For unquoted fields you can put as many spaces as you like after the separator, but they become part of the field, then.

Parsing and outputting CSV can definitely be a pain, and there are a lot of corner cases. I think that this is certainly one of those times where rolling your own should be avoided if at all possible. I am a big fan of Text::CSV or the Python csv module, but even using Text::CSV have had different behavior from different versions.

Or just google for "ASCII table" and open the first hit:


But your link does have more in-depth explanations (including historical info) for some of the control characters.

> man ascii

Just sayin'

Alas, I don't think this works with the standard Unix tools, which is the main way I process tab-delimited text. Changing the field delimiter to whatever you want is fine, since nearly everything takes that as a parameter. But newline as record separator is assumed by nearly everything (both in the standard set of tools, and in the very useful Google additions found in http://code.google.com/p/crush-tools/). Google's defaults are ASCII (or UTF-8) 0xfe for the field separator, and '\n' for the record separator. I guess that's a bit safer than tabs, but the kind of data I put in TSV really shouldn't have embedded tabs in a field... and I check to make sure it doesn't, because they're likely to cause unexpected problems down the line. Generally I want all my fields to be either numeric data, or UTF-8 strings without formatting characters.

Not to mention that one of the advantages of using a text record format at all is that you can view it using standard text viewers.

Awk lets you set both:

    $ echo -n "1,2,3|4,5|6|7,8,9,0" | awk 'BEGIN{FS=","; RS="|"} {print NF, $0}'
    3 1,2,3
    2 4,5
    1 6
    4 7,8,9,0
In fact, you can also specify the output delimiters as well:

    $ echo -n "1,2,3|4,5|6|7,8,9,0" | awk 'BEGIN{FS=","; RS="|";OFS="foo";ORS="bar"} {print NF, $0}'

Yup. It wasn't fun to type (using Ctrl-V in bash to input the raw control characters), but it works fine with the ASCII separators as well:

    $ echo -n 'a^_1^^b^_2^^c^_3^^' |awk 'BEGIN{FS="^_"; RS="^^"} {print $1": "$2}'
    a: 1
    b: 2
    c: 3

> Alas, I don't think this works with the standard Unix tools.

It kind of works with standard unix tools.

    cut -d$'\37' -f ...
    sort -t$'\37' -k ...
    join -t$'\37' ...
Those will parse ASCII-31-separated fields. But records are still newline separated, no way to change that AFAIK, short of running everything through

    tr '\036' '\n'
first. Which defeats the purpose of choosing "weird" delimiters in the first place.

(Also note that the $'\..' syntax is bash-specific and doesn't exist in POSIX sh.)

My guess is that TSV/CSV won out simply because anyone can easily type those characters from any standard keyboard on any platform.

TSV/CSV characters were also pretty much guaranteed to exist no matter what kind of terminal was used, and not cause any side-effects. No doubt some teletypes & dumb terminals used those FS, GS, RS etc. characters for special features since they weren't likely to appear in printed data. And I know those characters are used for other things in PETSCII and ASCII. 0x1C, the File Separator in ASCII, is used to turn text red in PETSCII.

Sorry, that should say "PETSCII and ATASCII"

Meh. What if some data has ASCII 28-31 in it? If you're not using a "real" escaping mechanism, and instead relying on the assumption that certain characters don't appear in your data, then I don't see anything wrong with using \t and \n (ie TSV). Either way, you know your data, and you're using whatever fits it best.

If you need something that's never, ever going to break for lack of escaping, might I suggest doing percent-encoding (aka url encoding) on tabs ("%09"), newlines ("%0a") and percent characters ("%25")? Percent encoding and decoding can be made very fast, is recognizable to most developers, and can be used to escape and unescape anything, including unicode characters. Unlike C-escaping, which doesn't generalize and accommodate these things nearly so well.

I think the answer is that those shouldn't occur within your data. If you're dealing with binary data, why are you using a text-based file format? If your data is textual, it shouldn't have control character delimiters within it, as they are reserved for that context.

So, strip them out of your data if you have to. If you think they need to be preserved or escaped, IMO you're doing something wrong.

It's nice to be able to use the unix toolset (grep, cut, sort, join, etc) on all kinds of data, not just strictly "textual" data.

They're often my tool of last and only resort when dealing with very large datasets. Sure, you could wait for that dump of all of wikipedia to import into a nice indexed and queryable database, but why not start grepping it immediately? Maybe you want to sort by a key that's textual, but there's satellite data that's non-textual. sort(1) is a pretty amazing program in terms of resource usage; it parallelizes, it makes efficient use of available memory and disk when merge-sorting.

Anyway, there are plenty of examples!

If you've actually got, say, a JPG file embedded in the middle of a CSV or something, it's just not designed for that IMO. But if you use the reserved control characters, you can at least output any text data without escaping (at least if "textual" is defined as "a string of characters that are not ASCII delimiter control chars", which ought to be easy to assume, unless something is corrupted or deliberately trying to mess things up.) I think you can actually more easily and reliably grep, because you don't need state to know whether the comma byte is a comma character or a delimiter. You simply use a comma when you mean a literal comma, and the control char when you want the delimiter. Anyway, as others have pointed out, there are other obstacles to widespread adoption of these control chars.

I agree that cut, sort, etc. are good to be familiar with. Someone else[1] linked a "csvquote" utility that pre-chews (and un-chews at the end of the text-processing pipeline) CSV data to make it work better with standard UNIX utilities. Looks neat, so I'll be keeping it in mind next time I'm processing CSV with UNIX utils.

[1]: https://news.ycombinator.com/item?id=7475793

What if your delimited file needs to contain other delimited files?

Then you should probably use something like JSON or XML, or at least do the file combining/splitting with something like tar or zip.

This is factually wrong about CSV, which can store any character including commas and even \0 (zero byte), provided it's implemented correctly (a rather large proviso admittedly, but you should never try to parse CSV yourself). Here is a CSV parser which does get all the corner cases right:


You can find a CSV parser that gets nearly all, if not all, the corner cases right for just about any language.

The problem is that the data you will be given to parse, by some third party agency, will quite likely not have been produced in such a way to get all the corner cases right.

As anyone who routinely has to parse such data is unpleasantly aware of. So now you've got to manually fix data, or have a parser that _doesn't_ get the corner cases 'right' but instead uses heuristics to try to get what was intended out of your particular idiosyncratic and illegal data.

you should never try to parse CSV yourself

Why? Writing a correct parser is not significantly harder than figuring out how to interface to an existing parser library, and allows cool things like heuristic parsing of malformed files.

OTOH it's shocking how many people can't write a correct CSV generator, even after being explicitly told what they're doing wrong (which is always either "you need to put quotes around the data" or "you need to double any quotes that are part of the data") and given examples.

If you knew enough about CSV to be able to write a correct parser, then you'd know enough not to write one lightly.

Here are some surprising valid CSV files:


The test program in the same directory shows the semantic content of each.

There is no such thing as a valid CSV file.

The RFC declares itself "informational" and says things like there is no formal specification in existence, which allows for a wide variety of interpretations of CSV files. This section documents the format that seems to be followed by most implementations:

There are certainly reasonable arguments about the useful subset of rules.

testcsv6.csv at that link is malformed. DQUOT is used to (1) escape itself, and (2) enclose strings. It is not a generalized escape character the way backslash is in C-family languages.

Interpreting it as a generalized escape causes two problems. One, if you generate files that way, they will be unreadable by parsers written according to the RFC. Two, if you read files that way, you will silently garble files generated by someone who forgot to escape the quotes that were part of their data.

I'm afraid you're wrong about this. Excel generates and parses "0 as a zero byte. The RFC doesn't discuss how CSV files work in the real world. This is exactly what I was talking about in my comment above.

Thanks, I came here to make sure someone said this. Seriously, escaping special characters in CSV isn't all that difficult if you do it right from the start.

Leaving aside the pain of displaying and typing such characters...

> Then you have a text file format that is trivial to write out and read in, with no restrictions on the text in fields or the need to try and escape characters.

Phrases like that lead to lovely security bugs.

The thing that leads to "lovely security bugs" is the nonchalant mindset; it has nothing to do with the simple text format. The same attitude paired with ASN.1 data has caused just as many vulnerabilities.

It's not just the nonchalant mindset; it's the thought that because you pick something you don't expect to form part of your input domain, you don't have to escape. Either you have to actually restrict your input domain, or you need escaping.

This, very much this!

Escaping should always be a consideration. Not thinking about it, thinking "it'll never happen", etc. is what leads to things like HTML and SQL injection vulnerabilities.

If you're inputting or outputting data in any format, always keep in mind things like "what are the delimiters? What if the data in the input/output contains them?"

How is this any different than reading ASN.1 data and not worrying about the size of integers?

Yes, but simple filtering is a lot easier than escaping.

How about everyone just started following the CSV spec? https://tools.ietf.org/html/rfc4180

Doesn't allow for tab-delimited or any-character-delimited text and handles "Quotes, Commas, and Tab" characters in fields.

I love the way that, in the frickin' formal spec, the presence of a header row is ambiguous, so every tool that ever deals with CSV has to ask a human whether or not a header row is present. Great design decision, that.

It's an artifact of budding a standard around what people are already doing.

Nah. Two split() calls are much easier.

There actually are glyphs assigned to these characters, at least in the original IBM PC ASCII character set:

Ascii table for IBM PC charset (CP437) - Ascii-Codes


They correspond to these Unicode characters

    28  FS  ∟  221f  right angle
    29  GS  ↔  2194  left right arrow
    30  RS  ▲  25b2  black up pointing triangle
    31  US  ▼  25bc  black down pointing triangle
They may not be particularly intuitive symbols for this purpose though.

see also: IBM Globalization - Graphic character identifiers: http://www-01.ibm.com/software/globalization/gcgid/gcgid.htm... (then search for a code point, eg U00025bc)

Unicode code converter [ishida >> utilities]: http://rishida.net/tools/conversion/


Now I feel silly for having glossed over the control characters since I was a kid. Those characters are decidedly useful on a machine level, though the benefit of CSV/TSV is that it's human friendly.

Trivia: Carriage return and Line feed are separate characters because they used to be separate operations for devices like Teletypes. Want double-spaced text? CRLFLF. Working with a slow device? CRCRCRLF to give the carriage time to return.

Baudot4Life, yo.

Or DEL characters more likely.

The operators could have sent LTRS (all holes punched) but they never did -- their finger was already on CR, so they would just hit it a couple of times. Same net effect - delay until the carriage could return.

Which, BTW, was an indication that your machine needed service. The spring should have been wound tight enough and the track clean & oiled well enough to get the carriage back to the first column in time to not drop any characters. A pneumatic piston ("dash pot") slowed the carriage down as it approached the first column so it wouldn't crash into the stops and get damaged.

CSV is a solved problem - RFC 4180: http://tools.ietf.org/html/rfc4180#section-2

As used by Lotus 1-2-3 and undoubtedly others before there was an Excel.

Example record:

    42,"Hello, world","""Quotes,"" he said.","new
Now go write a little state machine to parse it... (hint: track odd/even quotes, for starters)

Coincidentally I did this just this week. I believe there's no need to track quotes, just if it has an opening quote or not. That RFC really explains it all very well. The only edge case I had was with empty records (foo,,bar) and that was probably due to my implementation.

Took some time to figure out how to type these on a Mac:

1. Go to System Preferences => Keyboard => Input Sources

2. Add Unicode Hex Input as an input source

3. Switch to Unicode Hex Input (assuming you still have the default keyboard shortcuts set up, press Command+Shift+Space)

4. Hold Option and type 001f to get the unit separator

5. Hold Option and type 001e to get the record separator

6. (Hold Option and type a character's code as a 4-digit hex number to get that character)

Sadly, this doesn't seem to work everywhere throughout the OS- I can get control characters to show up in TextMate, but not in Terminal.

In Terminal, they are:

  FS: Control-\ 0x1c (field sep)
  GS: Control-] 0x1d (group sep)
  RS: Control-^ 0x1e (record sep)
  US: Control-_ 0x1f (unit sep)
(These control key equivalents have always been the canonical keystrokes to generate the codes)

But they have to be preceded by a Control-V (like in vi) to be treated as input characters. Control-V is the SYN code (synchronous idle), but has no special meaning in an interactive context, which is presumably why it was chosen.

The full set of control codes (0x00 - 0x1f) and their historical meanings are why Apple added the open/closed Apple keys, eventually the Command key. They wanted a set of keystrokes that were unambiguously distinct from the data stream.

Control-S, e.g., will pause text output in the Terminal (also xterm, etc). This was super useful in the days before scrollback. :) Control-Q to resume (actually flush all the buffered output).

Overloading Control sequences was an unforgivable sin committed by Microsoft.

...if I remember the history correctly, Apple decided that having both open/closed Apple keys was confusing, and having the Apple logo on the keyboard was tacky, so they renamed the key for the Mac, and Susan Kare selected a new glyph, which is a Scandinavian "point of interest" wayfinding symbol.

...as a further aside, Control-N and Control-O are the cause of the bizarre graphical glyphs you sometimes see if you do something silly like cat a binary file. Control-N initiates the character set switch, and Control-O restores it. This can be used to fix your Terminal when things go awry. Most people just close the window, but I hate losing history. :)

0x20 - 0x74, unshifted:

0x20 - 0x74, shifted:

...works in Firefox. YMMV.

Terminal-charset-quickfix: at shell, type "echo ^O". To get the literal ^O, use Control-V then Control-O.

Most terminals will let you use ctrl-6 for ctrl-^ and ctrl-7 (and sometimes ctrl-/) for ctrl-_.

This seems a little better than it is. Those control characters are appealing because they're rarely used. Making them important by using them in a common data exchange format will dramatically increase the rate at which you find them in the data you're trying to store.

Ultimately, this is a language problem. If we invent new meta-language to describe data, we're going to use it when creating content. That means the meta-language will be used in regular language. Which means you're going to have to transform it when moving it into or out of that delimited file.

There is no fixed-length encoding you can use to handle meta-information without imposing restrictions on the content. You're always going to end up with escape sequences.

It doesn't solve the problem, although it does make it far less likely to run into it.

For a trivial example, try building an ASCII table using this format, with columns for numeric code, description, and actual character. You'll once again run into the whole escaping problem when you try to write out the row for character 31.

Sure, but this is a very special application. whitespace and '"' are much more common in normal text, so using dedicated characters for telling entries apart should be superior in almost all cases.

(The problem reminds me of what it's like using "/" as a delimiter when using `sed` to edit file paths.)

Right, but the "almost all" gives me pause.

I'm wary of anything that solves a problem partially while still remaining vulnerable in the end, because it can discourage properly solving the problem. Rather than play musical chairs with the separator character to try to minimize the chance of a conflict, I'd rather see a sensible encoding/escaping scheme used to eliminate that chance entirely.

For CSV forbidding commas in data is not practical.

For ASCII delimiters, forbidding ASCII delimiters in data is practical.

Sure - you can't, say, nest ASCII tables into one another due to this limitation.

But for simple structure, it doesn't hurt to have ASCII separators in the toolbox.

The only big problem I see is that they're rendered as invisible characters, which will make debugging harder. If we wouldn't have abandoned and forgotten the special ASCII chars, this wouldn't be the case.

If your dev tools show special chars (like mine do), then it's perfectly fine to use them.

> Sure - you can't, say, nest ASCII tables into one another due to this limitation.

In hindsight it's too bad we don't have similar characters that follow a more sexpr-ish layout - say, ListStart, ListEnd, and Delimiter. Then you could tree them endlessly. If you wanted to be really fancy you could add an "assignmentSeparator" character to officially bless key-value-pairs and encompass a nice JSON-ish format, but Lisp pretty-well demonstrates that isn't necessary.

But in hindsight it's just too bad we don't use these control characters at all.

You know what's nicer than delimiting beginnings and ends of things? Length prefixing. Protocol message formats and data encoding formats both already know what they're going to say before they say it, and so know its octet length.

The only reason to use delimiters, ever, is for user-modifiable data (e.g. source code) where you might want to insert or delete characters and have the containing block remain valid.


And now, a fun tangent, to prove that how deeply-rooted this confusion is in CS: user-modifiable data was originally the sole use-case for \0-terminated "C strings" in C.

C has two separate types which get conflated nowadays: char arrays, and \0-terminated strings. Most "strings"--as we'd expect to find them in other languages--were, in C, actually char arrays: you knew their length, either because they were string literals and you could sizeof them, or because you had #defined both FOO and FOO_LEN, or because you had just allocated len bytes on the heap for foo, so you could just pass len along with foo. Because you knew their length, you didn't need to use the string.h functions to manipulate them. It was idiomatic (and perfectly-safe) C, when dealing with char arrays, to just iterate through them with a for loop.

The concept of \0-termination, and thus what we think of as "C strings", only applied to string buffers: fixed-size, stack-allocated, uninitialized char arrays. The string.h functions are all meant to be employed to manipulate string buffers, and the \0 is intended to mark where the buffer stops being useful data, and starts being uninitialized garbage.

The strings in string buffers had short lifetimes, and didn't usually outlive the stack frame the buffer was declared in. Generally, you'd declare a string buffer, populate it using some combination of string literals, strcat(3), sprintf(3), and system calls, and then pass the string--still sitting inside the buffer--to a system call like fstat(2) to get what you're really after. That would be the end of the both string buffer's, and the string's, lifetime.

If you ever did want to preserve the contents of a string buffer into something you could pass around, though, this would be idiomatic:

    int give_me_a_path_string(char **out)
      char buf[MAX_PATH];

      /* ... */

      int len = strlen(buf);
      *out = memcpy(malloc(len), buf, len);

      return len;
Note that, after this function returns, the pointer it has written to doesn't point to a "C string": instead, it's a plain pointer to a heap-allocated array of char, with exactly enough space to hold just those characters. If you want to know how big it is, you look at the return value.


• C has "C strings", but they were only intended as buffers.

• C also has "char arrays", which are really what you should think of as C's equivalent to a "string" datatype. char arrays, not "C strings", are the fundamental data structure for representing and persisting strings in C.

• char arrays are less like "C strings" than they are like Pascal strings: they come in two parts, a block of memory N chars wide, and an int containing N. You don't examine the block to determine the length; the length is explicit.

• Pascal (and thus most modern languages with strings) put both the length and the character-block on the heap as a unit. C puts the character-block on the heap, but puts the length on the stack. This is more efficient under C's Unix-rooted assumptions: you need the length on the stack if you want to work with it to immediately shove the string through a pipe.

The problem: I have never encountered length-prefixed data. Ever. Every data interchange file I've ever dealt with has been either delimited or fixed-width fields (and the widths are not defined anywhere in the file).

Examples of length-prefixed data abound in protocols and formats defined by systems and telecom engineers (e.g. the IETF). IP packets are length-prefixed. ELF-binary tables and sections are length-prefixed. PNG chunks are length-prefixed.

It's just these worse-is-better text-based protocols like HTTP, created by application developers, that toss all the advantages of length-prefixing away. (And, even then, HTTP bodies are length-prefixed, with the Content-Length header. It's just the headers that aren't.)

The only problem with length prefixing is that it interferes with streaming data, because you need to know the full length in advance. Thus HTTP chunked encoding. Still, it works great in most scenarios.

My favorite way to deal with this stuff is Consistent Overhead Byte Stuffing:


In short, you take the data and encode it with a clever scheme that effectively escapes all the zero bytes. The output data contains no zeroes, but results in almost no overhead, with the worst case being an increase of 1/254 over the original size, and the best case being zero increase. (Compare to e.g. backslash escapes of quotes in quoted strings, where the worst case doubles the output size.) You then use the now-eliminated zero byte as your record separator. This lets you stream data (with a small amount of buffering to perform the encoding) while still easily locating the ends of chunks.

I've played around with COBS but never used it in a real product, so this is not entirely the voice of experience here. But it is a nifty system.

that is just freaking cool. took me about 4 times to grok it. it sort of reminds me of utf-8, and how you can synchronize that easily.

In contrast, I just got done designing an internal protocol today that has a length prefix.

My team pretty much length prefixes everything. :)

For ASCII delimiters, forbidding ASCII delimiters in data is practical.

Data formats that can't handle recursion are never practical. They only work until they don't, at which point they're entrenched and impossible to replace.

You may have heard about spreadsheets or relational databases...

Yes. I've seen spreadsheets where some of the cells contain CSV data. I've heard recent news of relational databases adding support for JSON columns and seen them used to store XML, and know that a LOB field could be used to store an image of another database. I've seen hierarchical key/document databases which contain other hierarchical key/document databases (and relational databases, and anything else you can imagine).

If only they had included an official escape character in ascii ... oh wait

That ESC is to allow more control characters to be defined than what was initially put into the standard. Consider for instance how ANSI colors are written to the terminal. They are escape sequence control (ESC? I'm not sure if the original designer meant to be meta) characters, identified by the escape sequence (ESC)[.

I'm pretty sure that's later development. Otherwise it would be called "prefix" or "extend".

Unfortunately I cannot quickly find a better source than:

"The "escape" character (ESC, code 27), for example, was intended originally to allow sending other control characters as literals instead of invoking their meaning."


I think people are missing the fact that you have a "control" key on your keyboard in order to type control characters. (Of course, control is now heavily overloaded with other uses.)

Pick databases have used record marks, attribute marks, value marks, sub-value marks, and sometimes sub-sub-value marks in ASCII 251-255 since the late 1960s. Like the control characters this blog post recommends, the biggest obstacle for Pick developers working on modern terminals is how on Earth to enter or display these characters. There's also the question of how to work with them in environments that strip out non-printable characters.

This isn't some clever new discovery. It's begging us to repeat the same mistakes that led to the world adopting printable ASCII delimiters in the first place.

Awesome! I was going to make a comment about Pick but you beat me to it. The challenge we had with Pick style involved customers using codepages that required these characters in text.

Encodings aside, the principle of having a hierarchy of delimiters can be hugely powerful

It is now 2014. The world doesn't use ASCII, you will still need escaping for binary or misformatted data, and overall the idea of mapping control characters and text into one space is dead and dusted. Don't do it, don't let other people do it, use a reasonable library that handles the bazillion edge cases safely if you need to parse or write CSV and its ilk.

In science, a lot of people use ASCII and flat files. I used to really dislike it, but over time I understood that there are certain practical reasons to do this which deserve respect.

Due to the volume and novelty of data that we work with, we are often pushed into a corner between human time and machine time. Each data set comprises a new set of concepts, and each is huge. In this corner, sometimes a character-delimited file is the best solution. There is not time to carefully craft a binary format and then document it so it will not be forgotten later, nor is there time to wait for a general-purpose format parser to operate on tens of billions of records. We need a solution that can be designed in 1 minute and be legible by all of our tools without modification.

Typically, I have used tabs in the place of the ASCII separators. This ensures readability without any kind of parsing. Also, this lets me use the default behaviors of well-worn, bug-free tools in the core of the Unix toolchain for basic data processing tasks. Frankly, this is not a bad compromise.

If you are passing messages around a web stack, JSON, XML, and friends are ideal solutions. If you have to occasionally deal with CSV, use a parser. I just want to note that for many tasks in data analysis, it's OK to simply use the dead and dusted convention of mixed delimiters and data.

As these things develop, I will be trying to investigate how to use more modern formats such as binary JSON representations in my work, and I'd be curious what solutions people here suggest for working with very large data (e.g. many trillions of observations).

Unicode has the initial 32 control characters, so this is technically still relevant and useful information for processing text data, even in 2014.

Makes sense, but it's practically a (very simple) binary storage format at that point. You can't count on being able to edit a document with control characters in a text editor. And I wouldn't trust popular spreadsheet software with it either.

Related: This tool:


will convert all the record/field separators (such as tabs/newlines for TSV) into non-printing characters and then in the end reverse it. Example:

    csvquote foobar.csv | cut -d ',' -f 5 | sort | uniq -c | csvquote -u
It's underrated IMO.

Thanks for bringing up csvquote. I wrote it last year, and am happy to hear that other people find it useful.

It is indeed a simple state machine (see https://github.com/dbro/csvquote/blob/master/csvquote.c), and it translates CSV/TSV files into files which follow the spirit of what's described in the original article in this thread.

But instead of using control characters as separators, it uses them INSIDE the quoted fields. This makes it easy to work with the standard UNIX text manipulation tools, which expect tabs and newlines to be the field and record separators.

The motivation for writing the tool was to work with CSV files (usually from Excel) that were hundreds of megabytes. These files came from outside my organization, and often from nontechnical people - so it would have been difficult to get them into a more convenient format. That's the killer feature of the CSV/TSV format: it's readable by the large number of nontechnical information workers, in almost every application they use. I can't think of a file format that is more widely recognized (even if it's not always consistently defined in practice).

This reminds me of a depressing bug I run into frequently. I do a bit of work integrating with an inventory management program. Their main method of importing/exporting information is via CSV. The API also imports and exports via CSV, except whoever wrote the code that handle the imports decided not to use any sort of sensible library. Instead they use a built-in function that splits the string based on commas with absolutely no way of escaping, so that there is no way to include a comma in a field.

It's led to many a headache.

I deal with a vendor who occasionally sends us files without the double-quote character escaped. I feel your pain.

I used to work in an industry where different vendors passed around massive CSV files. If there was a way to abuse CSV, someone had done it, no two of them were exactly alike.

I have created a (work-in-progress) Vim plugin [1], that uses Vim's conceal feature to visually map the relevant ASCII characters to printable characters.

It sort of works, but there are known issues which I have listed in the README.

[1] : https://github.com/hrj/vim-adtConceal

It does not solve the problem. Here is the points which I think.

1. Control characters are not supported in the almost of text editors. 2. Control characters are not human friendly. 3. The text may contain control characters in the field value.

In any formats, we cannot avoid the escape characters, so even I think CSV/TSV format is reasonable.

You are correct in that it does not solve a problem. Furthermore, the article tries to create a problem with CSV that does not exist.

> CSV breaks depending on the implementation on Quotes, Commas and lines

CSV does not break; the implementation is broken if it doesn't parse CSV properly. With a proper implementation, CSV solves every problem that will arise from this method.

The problem with CSV is that it looks so simple that nobody ever uses a real library to do it—they just roll their own. So you end up with a million implementations that are all buggy in various different ways. If you receive a CSV formatted file you can never be sure if it's actually good, valid CSV, or some invalid crap from that some programmer that reinvented the wheel because it was "so easy".

And as a side effect, if you are relying on lots of data files provided by other people, you inevitably end up with a library 57,000 parsers, 55,000 of which are for different, slightly broken CSV files.

"Alright let me get you some quick test data. Just need to find the 0x29 key on my keyboard... or 0x30? Wait is this a new row or a new column? What was the vim plugin for this?"

And then someone wrote an open source CSV parsing library that handles edge cases well and everyone forgot these characters existed.

This is a good illustration of how the hard part isn't "solving the problem" -- it's getting everyone to adopt and actually _use_ the standard.

Reminding everyone that an unused, unloved standard exists is just reminding everyone that the hard part went undone.

I actually really appreciate this article, though I've known about it for decades now. In fact, I used to return javascript results in a post target frame back in the mid-late 90's and would return them in said delimited format... field/record/file separated, so that I could return a bunch of data. Worked pretty well with the ADO Recordset GetString method.

Of course, I was one of those odd ducks doing a lot of Classic ASP work with JScript at the time.

Here's an implementation of ASCII Delimited Text in Ruby using the standard csv library: https://gist.github.com/christiangenco/73a7cfdb03e381bff2e9

The only trouble I ran into was that the library doesn't like getting rid of your quote character[1], and I don't see an easy way around it[2].

That said, I really don't like this format. The entire point of CSV is that you have a serialization of an object list that can be edited by hand. Sure using weird ASCII characters compresses it a bit because you're not putting quotes around everything, but if you're worried about compression you should be using another form of serialization - perhaps just gzip your csv or json.

In Ruby in particular, we have this wonderful module called Marshal[3] that serializes objects to and from bytes with the super handy:

    serialized = Marshal.dump(data)
    deserialized = Marshal.load(serialized)
    deserialized == data # returns true
I cannot think of a single reason to use ASCII Delimited Text over Marshal serialization or CSV.

1. ruby/1.9.1/csv.rb:2028:in `init_separators': :quote_char has to be a single character String (ArgumentError)

2. http://rxr.whitequark.org/mri/source/lib/csv.rb

3. http://www.ruby-doc.org/core-2.1.1/Marshal.html

The big problem with the two markers mentioned in the post is they are not part of the visible character set. Using a comma delimiter is good as it is visible, you can just use a basic text view to see it.

A tab delimiter is not preferable as it is not visible, and can be problematic to parse via command line tools (ie what do I set as the delimiter character?).

I think that is the whole point of having ASCII delimited text files is to have human readable data in it.

If you're using command line tools, others have posted how to use them.

C-v-shift-_ and C-v-shift-^ both work for me.

They print a little strangely, but if you were really dedicated to the idea, you could alias the tools you use to use these by default for their input and output separators.

You know where this is useful? Databases.

No, please, put the gun down... let me explain. Sometimes you have a database that's so complex and HUGE that changing tables would be a nightmare, or you just don't have the time. You have a field that you want to shove some serialized data into in a compact way and not have to think about formatting. You could use JSON, you could use tabs or csv, but both of those require a parser.

With these ascii delimiters you can serialize a set of records quickly and shove them into a string, and later extract them and parse them with virtually no logic other than looking for a single character. And because it's a control character, you can strip it out before you input the data, or replace control characters with \x{NNN} or similar, which is still less complex than tab/csv/json parsing.

Granted, the utility of this is extremely limited, probably mainly for embedded environments where you can't add libraries. But if you just need to serialize records with the simplest parsing imaginable, this seems like an adequate solution.

You just described a Pick or "multivalue" database. They were a nightmare to work on, but I'll admit that's mostly because of the tools (or lack thereof). It led to people storing all sorts of different data in one table and the queries got really messy because multivalue fields had to be treated differently than regular ones.

I never knew about these which is just a bit shaming considering how long I've been in the data munging field. :)

I agree with several other comments that the biggest issue is not being able to represent them in an editor. If you use some form of whitespace, then it is likely to lead to confusion with the whitespace characters you are borrowing (i.e. tab and line feed). If you use special glyphs, then you have to agree on which ones to use, and it still doesn't solve the problem of readability. Without whitespace such as tab and line feed, all the data would be a big unreadable (to humans) blob, and with whitespace, it would lend confusion about what the separator actually is. Someone might insert a tab or a linefeed, intending to make a new field or record, and it wouldn't work. If the editor automatically accepted a tab or linefeed and translated it to US and RS, then there would have to be an additional control to allow the user to actually insert the whitespace characters that this is supposed to enable. :/

CSV if you do it as in RFC 4180 [1] already has everything the link describes, plus pretty good interoperability with most things out there. If you abused CSV you could even store binary data, while ASCII has no standard way to escape the delimiter characters.

1: http://tools.ietf.org/html/rfc4180

While we're on the subject, we should probably be using control code 16 (Data Link Escape) instead of the backslash character to escape strings.

The problem is, of course, that we can't see it (no glyph) and we can't "touch" it (no key for it) so people won't use it. Ultimately, we're all still stick-wielding apes.

All this is making me realize that we also could have easily avoided having confusion about what's a command line argument separator and what's part of a file name if some of these were keyboard keys. "Field separators" to break up your command line, and regular spaces as just regular spaces? Hell yes please.

I've used these in ASCII files and they are quite useful. But as most folks point out, actually using control characters for "control" conflicts with a lot of legacy usage of "some other control." Which is kind of too bad. Maybe when the world adopts Unicode we'll solve this, oh wait...

I bet someone stubborn would find a way to screw it up anyway (I guess stubborn people are the biggest problem with csv...).

Having read through all the comments, I think the only real benefit to using the control characters is in the original intent. That is a flat file that represents a file system, with file separators (FS), group separators (GS), like a table, record separators (RS), and unit separators (US), to identify the fields in the record, storing only printable ASCII values.

This isn't intended to be a data exchange format, it is a serial data storage format. In this way, there may be some valid usages, but modern file systems do not need this sort of representation and it has no real benefit over *SV formats for most use cases. I suppose It could still be used for limited exchange, but since it can't be used storing binary, much less Unicode (except for perhaps UTF-8), other formats are less ambiguous and more capable.

Haha! Oh wow, I just finished a project dealing with this exactly. The obvious problem is that most editors make dealing with the non-standard keyboard keys very difficult. As a consequence, most programs (Python, MatLab, etc) really don't like anything below 0x20. I was reading in binary through a serial port, and then storing data for records and processing. Any special character got obliterated in the transfer to MatLab, Python, etc. I ended up storing it as a very long HEX string and then parsing that sucker. I'd have loved to use special characters to have it auto sort into rows and columns, but that meant having it also escape things and wreck programs. Ces la vie.

C'est la vie ;)


Grazie |-)

Day nada

I will be sure to use this if I ever encounter data that is guaranteed to be pure ASCII again.

Protip - if it doesnt appear on keyboards, you can use ALT+DDD (DDD being 000 to 255) to enter a control character. For those on windows, drop into a command prompt and hold ALT while pressing 031 on the numpad. You will see it produce a ^_ character.

On what computer? Is this windows only? Now that I use a mac this might be the only thing I miss from windows computers.

As htp mentioned above (https://news.ycombinator.com/item?id=7474951), you can do this on a Mac by using Unicode Hex Input as your input source.

Yes, I think that is windows only.

So on Mac OS X Mavericks: http://support.apple.com/kb/PH13867

Devil's advocate - CSV is superior, because edge case bugs (a comma in the data) are likely to be tested.

The edge case bugs in ASCII codes could still crop up. It shouldn't, but then, valid SQL shouldn't crop up in a web form either. And when it does, we'll need escape codes just like CSV, only it won't be well tested in all the tools (because it's not going to frequently happen).

It's like all the OSS advocates laughing at Microsoft's idiotic "My Documents" folder. It's not there because they didn't realise how much trouble it would case programmers, it's there because they wanted for force people to deal with edge cases.

Why not use the non-printing char as the comma instead of the record separator.

1. Replace all the commas in the text with the unique non-printing char before converting to CSV.

2. Convert this char back to a comma when processing the CSV for output to be read by humans.

Because commas in text are usually followed by a space, the CSV may still even be readable when using the non-printing char.

I must admit I've never understood why others view CSV as so troublesome vis-a-vis other popular formats.

in: sed 's/,/%2c/g' out: sed 's/%2c/,/g'

I guess I need someone to give me a really hairy dataset for me to understand the depth of the problem with CSV.

Meanwhile, I love CSV for its simplicity.

that's exactly what https://github.com/dbro/csvquote does for commas and newlines both.

Why use this instead of sed, awk, flex, lua, etc.?

sed does the job and on almost all UNIX clones it never needs to be installed.

Because it's already there.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact