Hacker News new | comments | show | ask | jobs | submit login

The fact that ASCII does not have balanced quotes is one of the great catastrophes of computing. It makes everything more complicated than it needs to be, from embedding code in strings to parsing CSV files, to regexps. For example, if I want to embed a quoted string in another quoted string, I have to escape the inner quotes like so:

"This is string containing an embedded \"quoted\" string"

Then I have to think about whether or not the system I'm going to send that string to is going to "helpfully" remove the backslashes, in which case I need to write:

"This is a string containing an embedded \\"quoted\\" string"

God help you if you want to go two levels deep.

All this horrible complexity could have been avoided if we could just write:

«This is a string containing an «embedded» quoted string»


The complexity might be minimized, but not avoided. You would still need an escape mechanism for something like «She said «The \» key on the server doesn't work.»»

ASCII did add <>, [], and {}, any of which could have been used for quoted strings, had the programming language designers chosen that option.

https://en.wikipedia.org/wiki/String_literal#Paired_delimite... points out that PostScript and Tcl have a string literal which allows matched quotes.

  PostScript: (The quick (brown fox))
  Tcl: {The quick {brown fox}}

Ruby lets you use arbitrary tokens for string literals with %s{} (where the braces can be a bunch of things). I wish more languages would adopt this tbh.

C++11 has this feature too [1], e.g.:

    const char * str = R"*^*(This is string containing an embedded "quoted" string)*^*";
[1] http://en.cppreference.com/w/cpp/language/string_literal

Apache Groovy also had that in its early 1.0 betas but they were removed before their official 1.0 release party.

Ruby lifted that from Perl.

  say qq<I can do '" in here>;

C++ has something similar.

> You would still need an escape mechanism for something like...

Yes, but that's a pretty rare case, much more so than embedded strings.

Even that case could be solved by having two different quotes, like Python which allows both 'string' and "string". So you could do:

«This is a string that mentions the ” character without escaping it»

“This is a string that mentions the « character without escaping it”

Yes, there are still some edge cases, like embedding both “ and « in the same string. But that's really rare.

You don't want any 'rare' cases at all. That's the point.

Stop using "punctuation" when you are attempting to "delimit" text. Use a character that is not punctuation, specifically designed for "field delimiter" purposes.

Trying to do two things at once is ridiculous.

If your text is always valid UTF-8, there are various illegal UTF-8 octets available for this purpose: 0xff, 0xfe, and so on. Unlike null terminators or record separator characters, these characters are guaranteed not to exist in your string by the UTF-8 validation code you're already running.

I've been trying to push TSV (tab separated values) as a standard response/implementation when they ask for CSV. "Yes but its comma separated!", sure is, but text can contain commas... I have seen issues with Google Spreadsheets not recognizing the tabs however... Excel doesnt know what to do with a TSV either. But both have a complete wizard for parsing CSV...

> but text can contain commas

Erm.. text can contain tabs, too. This problem was solved so, so long ago when all the various ANSI/ASCII/whatever encodings were compiled by specifically reserving not one but two characters precisely to serve as field and record delimiters.

0x30 and 0x31 solved not only the problem of having commas or tabs in your text preventing you from treating them as field delimiters, but also allowing you to include new lines and carriage returns in your fields, too!

0x30 is the unit separator (aka field delimiter) and 0x31 is the record separator (aka, well, the record separator).

I _believe_ there was a record key on some standardized keyboard layout back in the day, too.

Edit: sorry, they are decimal, not hex. Thanks @jrochkind1

I find no matter what you do you will _sometimes_ need escaping. There will, eventually, come a time when you want to embed an ASCII 30 (0x usually means hex, it's actually decimal code 30, but hex 0x1E) RS Record Separator in some record delimited by 30 RS. So you'll need some method of escaping anyway. Or it'll be annoying.

I have spent some time working with MARC 21 binary encoding (used for library cataloging records) which uses ASCII 0x1D, 0x1E, and 0X1F as delimiters. I would def not call it appreciably more _convenient_ than a more modern 'text' record format. If it has benefits, convenience isn't really one of them.

I think it's common to use ESC (0x1b) and then set the high bit on the next byte, so ESC itself would be sent as 0x1b, 0x8b.

Yes, but at that point the text file is basically binary - it contains exotic characters that confuse most text editors and can't be typed.

I know XML et al are frustrating, but I'd rather see them than a "creative" solution. It seems like 60% of the reason we still have to deal with archaic flat formats is support for Excel.

I have to suspect the fatal flaw is that these code points don’t look like anything and can’t be found on the keyboard.

Granted, that’s the whole point, but it also makes authoring and instruction harder. (And we all know how many programmers are really just competent copy-pasters.)

Right. And if you add 0x28 (file separator) and 0x29 (group separator) to the mix, then you have a whole set of nice options to concatenate multiple data files into a single stream, etc.

There is actually an ASCII character that doesn't appear in strings, and is meant to be used as such a separator. Actually two of them, record (30) and unit separator (31).

Sadly, eventually someone will want to enter one document as a field in another document and then you end up needing escaping anyways. Using a rare symbol for the delimiter would still be nice for typing documents by hand, but it would have to be available on modern keyboards to be convenient.

>Sadly, eventually someone will want to enter one document as a field in another document and then you end up needing escaping anyways.

Yes, but CSV files are record collections, they are not in 99% of cases recursive like that.

If a column contains escaped secondary documents, there's something wrong.

Excel will convert any tabulated text file into a spreadsheet regardless of the delimiters, or even lack of, as you can set which character(s) to delimit by or even just go by column numbers for tables of fixed widths. This is actually one of the few things Excel gets right with regards to CSV files as I've found it a horrid tool if you need to save any changes and preserve the original formatting of the CSV file (even the data itself gets altered!!)

Also most CSV parsers support quotation marks and escaping to get around the comma and new line et al problems. eg:

    "full name", "address"
    "Homer Simpson", "742 Evergreen Terrace,\nSpringfield"
    "Bart \"El Barto\" Simpson", "742 Evergreen Terrace,\nSpringfield"
Granted it's not the prettiest and some spreadsheets really love to break the formatting upon save (cough Microsoft Excel cough) but it does work.

As a side note, the best spreadsheet I've found for manipulating CSV data without breaking the formatting upon saving was OpenOffice Calc. This was a few years back before the LibreOffice fork was created as I've thankfully not needed to deal with CSV files large enough to warrant a full blown spreadsheet editor, but I would assume LibreOffice Calc would behave the same.

Your CSV would actually look like this instead:

    full name,address
    Homer Simpson,"742 Evergreen Terrace,
    "Bart ""El Barto"" Simpson","742 Evergreen Terrace,
(Omitted optional quotes for fields that don't need them). Quotes are escaped with "", and line breaks don't need escaping, they just have to be in a quoted field. And there is no space after a comma, except you want that space to be part of the field's value.

Thanks for the correction regarding escaping, but I think you went a little overboard on the other alterations:

> Omitted optional quotes for fields that don't need them

I think it's good policy to always wrap your contents in quotes regardless of whether you have a delimiter that needs quoting. And in fact many CSV marshallers will do just this.

> And there is no space after a comma

That was added purely for readability on HN. I agree it's not how you'd normally marshal the contents.

If it's for my own programs, I use pipe (|) separated values. They're visually appropriate and even less likely than tabs.

ASCII does contain control characters set aside for record and unit separators (codes 30 and 31 respectively). Sometimes I wish they got more use than they do.

Except pipe is really easy to get as a typo. It's right next to the enter key. And then you're dealing with escaping characters and before you know if you've rolled your own file format.

Been there. Use a lib that implements a documented standard, even a bad one. Only problem is Excel, which basically standardizes on CSV and occasionally mangles your data into malformed dates because reasons anyways.

What's the problem you're having with Excel reading TSV? Works fine here.

Most *SV importers will actually accept any character as the delimiter, it's just people insist on believing CSV is utterly trivial and thus not worth using a real library for it.

> You would still need an escape mechanism for ...

I think this is actually desirable, since in your case the escape denotes different semantics. The unescaped pairs act like quotation operators while the escaped version is a character literal.

Also Ruby:

  %q{This is a string with an %q{embedded quote}.}

Powershell: "This is a string with an 'embedded quote'."

It's helpful to remember that quotes will interpret the variables inside, while apostrophes will not. Very useful for scripting the creation of scripts. Example:

"It is $time" > It is 15:22

'It is $time' > It is $time

"'$time' is $time" > '$time' is 15:22

That's not a string where the same thing used for quoting is used inside the string without escaping, nor is it an example of the distinct begin-vs-end quote pairs approach under discussion.

But, yes, having single and double quoted strings is another way to avoid escaping (which Ruby and a number of other languages discussed as supporting the approach being discussed also support.)

not sure why you were downvoted. Your comment is relevant and the convention can be useful. (PHP worked exactly the same way as your first two examples, although the third would have produced <'15:22' is 15:22>.)

Actually, ASCII have mechanism for solving the problem that you describe, with control codes FS, GS, RS and US.

I disagree. Sure it might regex better, but my typing speed and typo rate would be much worse if I had to type separate open and close quotes for all my strings.

>All this horrible complexity could have been avoided if we could just write

Only if there was no chance of unbalanced quotes to need to be in the string.

You'd still have escape sequences for those cases.

If ASCII had balanced quotes then they would be used by programming languages to delimit strings and we would be back to square one with regards to escaping them!

You don’t need escaping in «This is a string containing an «embedded» quoted string».

«Hi. How do I open a quote?»

«Oh, you just use the « character.»

Parse error. Unexpected EOF.

That wouldn’t be a good idea but you could adapt your parser to support that case. You can write `/* /* */` in C, for instance.

To clarify: we’d still need escaping but in fewer cases.

Debatable whether that would actually work in practice.

    /* nested /* comments */ don't work */

They don't work in C, but that's an arbitrary decision made by its designers. There are many languages that have balanced comments that can be nested. In OCaml, for example, this is legal:

    (* nested (* comments *) work *)

This is possible in D using /+ comments +/.


This allows commenting out code containing comments, which can be useful when debugging or giving usage examples in the code.

I think it would make the parsing a tiny bit more inconvenient though

Yup, suddenly your parser needs to keep a count of how many nested strings deep it is.

It's not hard. Obviously the parser has to do that things that aren't a single token like parenthesized expressions.

Even for things where the nesting does happen during lexical analysis, it's pretty trivial to keep a count in your lexer. Lots of languages support nesting comment syntax or string interpolation, which both have equivalent difficulty.

Don't forget that "parser" also includes human brains, which tend to not be that great at parsing nested things.

To use formal language theory, strings containing escape characters are regular, i.e. parseable with a finite-state machine. Allowing nesting means you need a stack to find the matching pair.

It’s trivial to do; parsers handle nested things quite easily.

ruby does that. computers are pretty fast now.

So, how would you encode this in a string:

  To end a string, use the » character.


You escape it. What I was saying was nested strings wouldn’t need escaping.

Yeah, seen a ton of tools that auto-format to left/right quote automatically but then output ASCII and mangle the conversion.

let's rewrite social idioms to use < > as quotes.

«These characters» are the usual way of quoting in several languages.

See https://en.wikipedia.org/wiki/Guillemet

Well then it's a good thing c used its ASCII equivalent that's accessible on anglophone keyboards for bit shifting and so if any programming language tried to use <<strings>> you'd have c grognards screaming about lshift and rshift.

> The fact that ASCII does not have balanced quotes is one of the great catastrophes of computing.


Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact