Hacker News new | past | comments | ask | show | jobs | submit login

It seems that so many programming headaches have the same root cause: the set of characters that compose "text" is the same set that we use to talk about text. Hence the nightmares with levels of quoting and escaping. The use of out-of-band characters like NULLs to separate pieces of text does help, but I don't think there is a complete solution. Because, eventually, we want to explain how to use these special characters, which means we must talk about them, by including them in text....



> Hence the nightmares with levels of quoting and escaping.

PostgreSQL has an interesting approach to this problem that I've found really straight forward and allows me to express text as text without getting into strange characters. What they've done is allowed using a character sequence for quoting rather than relying on a single character. They start with a character sequence that is unlikely to appear in actual text: $$, it's called dollar quoting. Beyond just $$, you can insert a word between the $$ to allow for nesting. Better explained in the docs:

https://www.postgresql.org/docs/current/static/sql-syntax-le...

What the key here is that I am able to express string literals in PostgreSQL code (SQL & PL/pgSQL) using all of the normal text characters without escaping and the $$ quoting hasn't come with any additional cognitive load like complex escaping can (and before dollar quoting, PostgreSQL had nighmareish escaping issues). I wish other languages had this basic approach.


Perl's had something like that for a long time: quote operators. You can quote a string using " or ' (which mean different things), and you can quote a regex using /. But for each of these you can change the quote character by using a quote operator: qq for the double-quote behavior, q for the single-quote behavior, and qr for the regex behavior. (There are a few others two, but I used these most often.)

    my $str1 = qq!This is "my" string.!;
    my $str2 = qq(Auto-use of matching pairs);
    $str2 =~ qr{/url/match/made/easy};
The work I did with Perl included a LOT of url manipulation, so that qr{} syntax was really helpful in avoiding ugly /\/url\/match\/made\/hard/ style escaping.


Perl is still, I think, the gold standard for quoting and string manipulation syntax. I am to this day routinely perplexed by the verbosity and ugliness of simple operations on strings in other languages.

(Of course, this may also be one of the reasons that programmers in its broad language family have a pronounced tendency to shoehorn too many problems into complex string manipulation, but I suppose no capability comes without its psychological costs.)


Yup, the 8085 CPU emulator in VT102.pl[1] uses a JIT which is essentially a string-replacement engine.

[1]: http://cvs.schmorp.de/vt102/vt102 (note - contains VT100 ROM as binary data, but opens in browser as text)


Perl also supports heredocs — blocks of full lines with explicit terminator-line:

  print '-', substr(<<EOT, 0, -1), '!\n';
  Hello, World
  EOT
Prints:

  -Hello, World!
iirc sh-shells also have that.


This seems like an awesome feature. I wish Python had something like it.


Python has triple-quoted strings which generally do the trick, and uses prefixes for "non-standard" string behaviours (though it doesn't have a regex version IIRC, Python 3.6 adds interpolation via f-strings)

    str1 = f"""This is "my" string."""
    str2 = """Auto-use of matching pairs"""
    str3 = r"""/url/match/made/easy"""


Yes, I've belatedly caught on to using triple-quotes to avoid some escaping. But I didn't know about the f-strings - thanks! (I'll be using those when I start using 3.6.)


Interesting, especially as I use PostreSQL. Unfortunately, "$$" is very common in actual text (millions of TeX documents, for example) as is $TAG$. But this could still work if you were careful to use TAGs that would never be found in text. But what if the document that you linked to itself had to be quoted? Would that lead to a problem?


I think it would be wrong to call it a perfect system or one created with the intention of so being. I'm sure in some disciplines, especially technical disciplines, you may well come across those sequences on a much more common basis... which sounds like your experience. Most of what I do is in mid-range business systems, after 20 years of professional life, it's something I've never come across. I suspect those sequences are fairly rare outside of specific domains and thus why that choice was made by the PostgreSQL developers.

Your question about self-referential documents and linking I don't understand; maybe an example. The PostgreSQL dollar sign quoting feature is simply a way to use single quotes (important in SQL) without having as many escaping issues. So instead of:

  SELECT 'That''s all folks!';
You could write:

  SELECT $$That's all folks!$$;  
or

  SELECT $BUGS$That's all folks!$BUGS$
And where it starts to save you in PostgreSQL is with something like (PL/pgSQL):

  DO
      $anon$
          BEGIN
              PERFORM $quote$That's all folks!$quote$;
              PERFORM 'Without special quoting';
          END;
      $anon$;
Note: this code produces nothing, it just should run without error (I ran it on PostgreSQL 9.4). In PL/pgSQL, the body of the procedural code is simply a string literal... but that means any SQL related single quoting would have to be escaped if we used single quotes. So using normal single quotes the previous code example would look something like:

  DO
      '
          BEGIN
              PERFORM ''That''''s all folks!'';
              PERFORM ''Without special quoting'';
          END;
      ';
And it gets worse as you get into less trivial scenarios... which is why I suspect this dollar quoting system was created to begin with.


I agree this solution handles a lot of common cases and makes the code easier to read than when forced to escape everything. I wasn't clear in my comment about self-reference. I meant that, suppose you are storing the text of articles in the DB (not a great idea, but it happens). The article (the one that you linked to) explains the $$ mechanism by showing how it works, so it's full of $$ sequences - the very sequences that we are assuming won't be encountered in normal text. That's what I meant in my beginning comment when I said that handling text that talks about our quoting conventions will lead to problems.


Ah, that's clearer for me.

There are a couple ways to handle depending on the scenario. If I were dealing with a static text under my control, say a direct insert of the text, I would either just enclose it all in traditional ' characters or come up with some unique quote text between the $$.

If I'm dealing with arbitrary text coming from, say a blogging website, I would either handle traditional SQL escaping in my input sanitizing code (or thereabouts) since I have to do that anyway ($$ is great for handwritten code where escaping introduces cognitive load, but not necessarily important for machine generated code) or I might create an inserting PL/pgSQL function with the article text as a parameter... that will get escaped without my having to do anything assuming I simply insert the text directly from the parameter.


> The use of out-of-band characters like NULLs to separate pieces of text does help, but I don't think there is a complete solution.

NULL is actually in-band, not out-of-band, and in fact it illustrates the issues with in-band communication you mention. That's what, presumably, ESC was for: a way to signal that the following character was raw and did not hold its normal meaning.

You can still devise a pretty good protocol with straight ASCII over a wire, using SYN to synchronise the signal, the separator characters to separate data values and ESC to escape the following character (like '\' is used in many programming languages).


> That's what, presumably, ESC was for: a way to signal that the following character was raw and did not hold its normal meaning.

Yes, ESC is a code extension mechanism: it means that the following character(s) are not to be interpreted according to their plain ASCII meaning, but some other pre-arranged meaning. Ultimately a shared alternate meaning for terminal control was standardized as ISO 6429 aka ECMA-48 aka “ANSI”. Free reading: https://www.ecma-international.org/publications/standards/Ec...

That this gave us keyboards with an Escape key that GUIs would repurpose to mean ‘get me out of here’ is a coincidence. (Plain ASCII had Cancel = 0x18 = ^X for that.)

MIT culture for historical non-ASCII reasons also referred to Escape as ‘Altmode’, which is ultimately how EMACS and xterms ended up with their Alt-key/ESC-prefix clusterfjord.


DLE not ESC https://en.wikipedia.org/wiki/C0_and_C1_control_codes#DLE

ESC is used for introducing C1 control sequences


We recently had to deal with an issue like this. My decision was to just sort of punt on the issue, and just base-64 encode the text. So there would be no shenanigans with escape character processing and such. The loss in efficiency was considered acceptable.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: