Hacker News new | comments | show | ask | jobs | submit login
ASCII and Unicode quotation marks (cam.ac.uk)
176 points by anschwa 71 days ago | hide | past | web | 186 comments | favorite



I hate the "" -> “” thing with a passion. I don't know how much productivity that the world has lost with that “” shit.

It doesn't look that much better, and it always fucks with me at random times. That shit is on the list of annoying problems that shouldn't exist in the first place, along with the \nl\cr thing, and the txt saved as rtf thing, and the UTF-8 encoding-character-at-the-beginning-of-the-file or whatever it is called.

Someone complains your program gives them an error when they open a csv file you sent them. You tested your program, it works. You go on the phone with them for 30 minutes, try to figure out what the fuck was going on. There it is, it was opened in a program that meddles with that "" and replaces it with the “” shit.

Also, there has to be at least one time you're fucked by the "" -> “” snobbiness when you go to a random Wordpress site and paste the command they tell you to do to the command line and realize it doesn't work. You pull your hair for a couple of minutes, and there is that sneaky ” thing. Wordpress does that for anything it doesn't think as code (inb4 ”good programmers don't paste commands from wordpress to GNU+bash“).

One of the first things that I do when I set up a new Mac computer is to turn that damn "" -> “” ““““feature”””” off.


You forgot to mention leading apostrophes getting autocorrected to left_single_quote instead of right_single_quote, as in

John Doe ‘42

An abomination ‘cause it's a damn apostrophe, not opening a quotation.


For anyone wondering, this is how to turn that feature off.

System Preferences -> Keyboard -> Text -> Use smart quotes and dashes (uncheck).


On a US-English keyboard on the mac, you can always type the proper glyphs directly on the keyboard. That's always been harder on windows where one would have to type some numerical character code each time they wanted some non-ascii character. That's where the auto-replace originated.

On the mac:

    Option [          “      (English open quote)
    Option Shift [    ”      (English close quote)
A lesser-known feature is that some other quotations are also possible from that same US-English keyboard:

    Option \          «      (French open quote)
    Option Shift \    »      (French close quote)
    Option Shift W    „      (German open quote)


> at-the-beginning-of-the-file

That thing's the BOM.


Which you don't really need with UTF-8, it only has a purpose for UTF-16+.


It's useful to tell a text editor "This is UTF-8, not Windows-1252 or ISO-8859-1 or whatever you might be used to".


No, just do 8-bit clean, don't SCREW with the encoding if you weren't asked to.


An editor can't "just do 8-bit clean", it has to display the characters. The same bytes will sometimes be displayed differently in utf-8 and (e.g.) ISO-8859-1.

I'm not sure if a BOM is a good way to handle it, but saying 'just do 8-bit clean' doesn't work when you're displaying or printing the characters for humans to understand.


[flagged]


I never said I hate the 69 quotation thing. I said I hate the automatic conversion of the straight quote to the 69 quote behind my back.

Imagine they automatically replace the "fi" with the styled "fi" behind your back and breaks text searches and normal text files.

You can be snobby all you want, as long as you initialed it and know you don't break programs' assumptions. No one says you can't have nice shit with LaTEX and Pagemaker when you need it.

Here people who just open the file and save it, or copy the text and paste it, don't mean it, yet it happens without them ever knowing.


Regarding fi, it just shows how broken most search engines are.

As a reader, what is the difference between between fi and fi? Or between a and а?

If the search engine is distinguishing those because of Unicode characters, it has failed completely.


Unicode turns trivial string comparison problems into pretty much AI-complete ones. Same visual characters, different meaning. Different characters, but same visual meaning for the person who initiated the query. Up/down-case not well defined, and even if it is, it's not always reversible (i.e. tolower(toupper(tolower(string))) is not equal to tolower(string)). There's lots of that.

The longer I live, the bigger fan of simplification and standardization I become. Like with dates and times. Timezones and DST are a fucking nightmare, because politics. And then doubly so, because people enjoy themselves with their favourite regional writing formats. At this point I'm all for enforcing ISO 8601 on every communication involving dates and times.

Some people object that this is turning humans into machines, etc. So be it. Nature isn't perfect, and clear communications doesn't come to us naturally. Yet it is absolutely vital in a technological society.


That’s closed-mindedness bordering on the silly. ISO 8601, for instance, only represents Gregorian calendar dates and times. But what about Japanese, Hebrew and Buddhist calendars, for example?

Looking at the world as only “what will my Emacs store on the drive when I paste some text into it” is ridiculous. As computers become more advanced—even mobile phones are super computers now¹—we should use that computer power to make the technology work for us, not bend over to some 1980s concepts of computing and standards. It may be more difficult for you to build such a system, I understand, but as a used, I really don’t give a damn.

¹: https://www.macrumors.com/2017/09/13/a11-bionic-chip-geekben...


>I understand, but as a used, I really don’t give a damn.

Yeah, Because we already changed " to 69 behind their back, let's double down on correcting the 69 so it processes like "!

Let's engineer our dumb, close-minded O(n) string search and CSV parser to do some AI image recognition shit in O(2^n) to figure out when the fuck the "used" had the ` involuntarily changed to ' because MacOS or Wordpress decided it looks better.

That's brilliant engineering right there.

Satire aside, you realize that there is a cost of security and maintenance to over-engineer shit, it's just not just convenience, right?


Your “used” “joke” is so … satirical.

Your specific bug with CSV is a developer bug. The place where you copied the incorrect CSV format should not have had such substitution enabled. On the Mac, developer can specify for each text view and text field what substitutions are allowed by default. Likewise in web, it is possible to specify which substitutions should be allowed for text areas.

So instead of blaming incompetent developers for their incorrect use of system features, thus ruining some very narrow cases, let's hold back any kind of text input and processing advancements, because you are unable to input some CSV properly.

That's brilliant engineering right there.

When my mother types on her computer, she just wants things to work. When she searches for something, she doesn't care if she typed the wrong incorrect Unicode character. That's the cases that need to be solved for users.

And it is only O(2ⁿ) algorithm if you naively look at text as an array of bytes. Time to, perhaps, broaden some horizons.


When you use your mom as a baseline, then do you think she would give a single fuck about how the " should look like? Does she zoom in the text with a magnifier glass to complain oh no they didn't change my " to 69, this doesn't look right, my os is shit?

Or she cares more when the shit she copy-and-pasted from some random website or your note to her to do something doesn't work because the site or the copy-paste process meddled with it?

PS: The joke is on the O(2^n) with AI Tensorflow, not on the "used." The "used" thing was 101% serious.


It would be fine if it just changed visually depending on the context, without actually changing the underlying character data.


The problem is that you can't really detect which single quotation mark to use unambiguously.

    'n'
can either be an abbreviation of "and", or a single-quoted letter "n".

as an abbreviation of "and", it should be rendered as [right single quote]n[right single quote]

’n’

and as a single-quoted letter "n", it should be rendered as [left single quote]n[right single quote]

‘n’


If you can't determine it automatically, then why is that "feature" there in the first place?


And here it would be creating a very insidious problem - representation (external format) disconnected from underlying data (internal format).

What I mean is, for example, this: when I see " in a comment, I know the browser is actually storing the " character somewhere in memory. I know that when I send this comment form, HN will receive " character. If I copy and paste the comment into my Emacs, and save it, I know my hard drive now stores the " character.

When external format gets completely disconnected from internal format, understanding anything about what happens with the data gets much more difficult.


But we already have that with things such as style sheets, where you can specify font styles which transform the text to all upper case, all lower case, small caps, etc. And as another comment pointed out there's even ligatures, which interpret a sequence of characters and render them together with a different singular glyph.


What about ligatures and other font character substitution features? What will you Emacs do then and why does it matter? As we advance further, we should strive to disconnect from the technology—which should just work, despite how complex it is.


Then it would have to be a feature of the font, and I am not familiar with any font that does that automatically. Technically, it might be possible, with advanced scripting available in OpenType, but considering how even the most basic of ligatures are unsupported under most Linux and Windows fonts, I would not have my hopes up to have this any time soon.


As far as i know it's (very) hard to do correctly.

There are (small or big) differences among languages and it is not always obvious to detect ,if the quote should be converted at all, if it's a left one, or a right one.

As a reference, this is a Python Script trying to do the conversion in Scribus:

https://wiki.scribus.net/canvas/Convert_Typewriter_Quotes_to...

There is also an article by the author of the script, explaining his work and pointing to at least one common case it cannot handle:

https://opensource.com/article/17/3/python-scribus-smart-quo...

Putting all this logic in a font might or not be something your really want...

But I 100%: this should be done at the font level... and hard replacing characters is not a good solution.


The fact that ASCII does not have balanced quotes is one of the great catastrophes of computing. It makes everything more complicated than it needs to be, from embedding code in strings to parsing CSV files, to regexps. For example, if I want to embed a quoted string in another quoted string, I have to escape the inner quotes like so:

"This is string containing an embedded \"quoted\" string"

Then I have to think about whether or not the system I'm going to send that string to is going to "helpfully" remove the backslashes, in which case I need to write:

"This is a string containing an embedded \\"quoted\\" string"

God help you if you want to go two levels deep.

All this horrible complexity could have been avoided if we could just write:

«This is a string containing an «embedded» quoted string»

Alas.


The complexity might be minimized, but not avoided. You would still need an escape mechanism for something like «She said «The \» key on the server doesn't work.»»

ASCII did add <>, [], and {}, any of which could have been used for quoted strings, had the programming language designers chosen that option.

https://en.wikipedia.org/wiki/String_literal#Paired_delimite... points out that PostScript and Tcl have a string literal which allows matched quotes.

  PostScript: (The quick (brown fox))
  Tcl: {The quick {brown fox}}


Ruby lets you use arbitrary tokens for string literals with %s{} (where the braces can be a bunch of things). I wish more languages would adopt this tbh.


C++11 has this feature too [1], e.g.:

    const char * str = R"*^*(This is string containing an embedded "quoted" string)*^*";
[1] http://en.cppreference.com/w/cpp/language/string_literal


Apache Groovy also had that in its early 1.0 betas but they were removed before their official 1.0 release party.


Ruby lifted that from Perl.

  say qq<I can do '" in here>;


C++ has something similar.


> You would still need an escape mechanism for something like...

Yes, but that's a pretty rare case, much more so than embedded strings.

Even that case could be solved by having two different quotes, like Python which allows both 'string' and "string". So you could do:

«This is a string that mentions the ” character without escaping it»

“This is a string that mentions the « character without escaping it”

Yes, there are still some edge cases, like embedding both “ and « in the same string. But that's really rare.


You don't want any 'rare' cases at all. That's the point.

Stop using "punctuation" when you are attempting to "delimit" text. Use a character that is not punctuation, specifically designed for "field delimiter" purposes.

Trying to do two things at once is ridiculous.


If your text is always valid UTF-8, there are various illegal UTF-8 octets available for this purpose: 0xff, 0xfe, and so on. Unlike null terminators or record separator characters, these characters are guaranteed not to exist in your string by the UTF-8 validation code you're already running.


I've been trying to push TSV (tab separated values) as a standard response/implementation when they ask for CSV. "Yes but its comma separated!", sure is, but text can contain commas... I have seen issues with Google Spreadsheets not recognizing the tabs however... Excel doesnt know what to do with a TSV either. But both have a complete wizard for parsing CSV...


> but text can contain commas

Erm.. text can contain tabs, too. This problem was solved so, so long ago when all the various ANSI/ASCII/whatever encodings were compiled by specifically reserving not one but two characters precisely to serve as field and record delimiters.

0x30 and 0x31 solved not only the problem of having commas or tabs in your text preventing you from treating them as field delimiters, but also allowing you to include new lines and carriage returns in your fields, too!

0x30 is the unit separator (aka field delimiter) and 0x31 is the record separator (aka, well, the record separator).

I _believe_ there was a record key on some standardized keyboard layout back in the day, too.

Edit: sorry, they are decimal, not hex. Thanks @jrochkind1


I find no matter what you do you will _sometimes_ need escaping. There will, eventually, come a time when you want to embed an ASCII 30 (0x usually means hex, it's actually decimal code 30, but hex 0x1E) RS Record Separator in some record delimited by 30 RS. So you'll need some method of escaping anyway. Or it'll be annoying.

I have spent some time working with MARC 21 binary encoding (used for library cataloging records) which uses ASCII 0x1D, 0x1E, and 0X1F as delimiters. I would def not call it appreciably more _convenient_ than a more modern 'text' record format. If it has benefits, convenience isn't really one of them.


I think it's common to use ESC (0x1b) and then set the high bit on the next byte, so ESC itself would be sent as 0x1b, 0x8b.


Yes, but at that point the text file is basically binary - it contains exotic characters that confuse most text editors and can't be typed.

I know XML et al are frustrating, but I'd rather see them than a "creative" solution. It seems like 60% of the reason we still have to deal with archaic flat formats is support for Excel.


I have to suspect the fatal flaw is that these code points don’t look like anything and can’t be found on the keyboard.

Granted, that’s the whole point, but it also makes authoring and instruction harder. (And we all know how many programmers are really just competent copy-pasters.)


Right. And if you add 0x28 (file separator) and 0x29 (group separator) to the mix, then you have a whole set of nice options to concatenate multiple data files into a single stream, etc.


There is actually an ASCII character that doesn't appear in strings, and is meant to be used as such a separator. Actually two of them, record (30) and unit separator (31).


Sadly, eventually someone will want to enter one document as a field in another document and then you end up needing escaping anyways. Using a rare symbol for the delimiter would still be nice for typing documents by hand, but it would have to be available on modern keyboards to be convenient.


>Sadly, eventually someone will want to enter one document as a field in another document and then you end up needing escaping anyways.

Yes, but CSV files are record collections, they are not in 99% of cases recursive like that.

If a column contains escaped secondary documents, there's something wrong.


Excel will convert any tabulated text file into a spreadsheet regardless of the delimiters, or even lack of, as you can set which character(s) to delimit by or even just go by column numbers for tables of fixed widths. This is actually one of the few things Excel gets right with regards to CSV files as I've found it a horrid tool if you need to save any changes and preserve the original formatting of the CSV file (even the data itself gets altered!!)

Also most CSV parsers support quotation marks and escaping to get around the comma and new line et al problems. eg:

    "full name", "address"
    "Homer Simpson", "742 Evergreen Terrace,\nSpringfield"
    "Bart \"El Barto\" Simpson", "742 Evergreen Terrace,\nSpringfield"
Granted it's not the prettiest and some spreadsheets really love to break the formatting upon save (cough Microsoft Excel cough) but it does work.

As a side note, the best spreadsheet I've found for manipulating CSV data without breaking the formatting upon saving was OpenOffice Calc. This was a few years back before the LibreOffice fork was created as I've thankfully not needed to deal with CSV files large enough to warrant a full blown spreadsheet editor, but I would assume LibreOffice Calc would behave the same.


Your CSV would actually look like this instead:

    full name,address
    Homer Simpson,"742 Evergreen Terrace,
    Springfield"
    "Bart ""El Barto"" Simpson","742 Evergreen Terrace,
    Springfield"
(Omitted optional quotes for fields that don't need them). Quotes are escaped with "", and line breaks don't need escaping, they just have to be in a quoted field. And there is no space after a comma, except you want that space to be part of the field's value.


Thanks for the correction regarding escaping, but I think you went a little overboard on the other alterations:

> Omitted optional quotes for fields that don't need them

I think it's good policy to always wrap your contents in quotes regardless of whether you have a delimiter that needs quoting. And in fact many CSV marshallers will do just this.

> And there is no space after a comma

That was added purely for readability on HN. I agree it's not how you'd normally marshal the contents.


If it's for my own programs, I use pipe (|) separated values. They're visually appropriate and even less likely than tabs.


ASCII does contain control characters set aside for record and unit separators (codes 30 and 31 respectively). Sometimes I wish they got more use than they do.


Except pipe is really easy to get as a typo. It's right next to the enter key. And then you're dealing with escaping characters and before you know if you've rolled your own file format.

Been there. Use a lib that implements a documented standard, even a bad one. Only problem is Excel, which basically standardizes on CSV and occasionally mangles your data into malformed dates because reasons anyways.


What's the problem you're having with Excel reading TSV? Works fine here.


Most *SV importers will actually accept any character as the delimiter, it's just people insist on believing CSV is utterly trivial and thus not worth using a real library for it.


> You would still need an escape mechanism for ...

I think this is actually desirable, since in your case the escape denotes different semantics. The unescaped pairs act like quotation operators while the escaped version is a character literal.


Also Ruby:

  %q{This is a string with an %q{embedded quote}.}


Powershell: "This is a string with an 'embedded quote'."

It's helpful to remember that quotes will interpret the variables inside, while apostrophes will not. Very useful for scripting the creation of scripts. Example:

"It is $time" > It is 15:22

'It is $time' > It is $time

"'$time' is $time" > '$time' is 15:22


That's not a string where the same thing used for quoting is used inside the string without escaping, nor is it an example of the distinct begin-vs-end quote pairs approach under discussion.

But, yes, having single and double quoted strings is another way to avoid escaping (which Ruby and a number of other languages discussed as supporting the approach being discussed also support.)


not sure why you were downvoted. Your comment is relevant and the convention can be useful. (PHP worked exactly the same way as your first two examples, although the third would have produced <'15:22' is 15:22>.)


Actually, ASCII have mechanism for solving the problem that you describe, with control codes FS, GS, RS and US.


I disagree. Sure it might regex better, but my typing speed and typo rate would be much worse if I had to type separate open and close quotes for all my strings.


>All this horrible complexity could have been avoided if we could just write

Only if there was no chance of unbalanced quotes to need to be in the string.


You'd still have escape sequences for those cases.


If ASCII had balanced quotes then they would be used by programming languages to delimit strings and we would be back to square one with regards to escaping them!


You don’t need escaping in «This is a string containing an «embedded» quoted string».


«Hi. How do I open a quote?»

«Oh, you just use the « character.»

Parse error. Unexpected EOF.


That wouldn’t be a good idea but you could adapt your parser to support that case. You can write `/* /* */` in C, for instance.

To clarify: we’d still need escaping but in fewer cases.


Debatable whether that would actually work in practice.

    /* nested /* comments */ don't work */


They don't work in C, but that's an arbitrary decision made by its designers. There are many languages that have balanced comments that can be nested. In OCaml, for example, this is legal:

    (* nested (* comments *) work *)


This is possible in D using /+ comments +/.

https://wiki.dlang.org/Commenting_out_code#Nested_comments

This allows commenting out code containing comments, which can be useful when debugging or giving usage examples in the code.


I think it would make the parsing a tiny bit more inconvenient though


Yup, suddenly your parser needs to keep a count of how many nested strings deep it is.


It's not hard. Obviously the parser has to do that things that aren't a single token like parenthesized expressions.

Even for things where the nesting does happen during lexical analysis, it's pretty trivial to keep a count in your lexer. Lots of languages support nesting comment syntax or string interpolation, which both have equivalent difficulty.


Don't forget that "parser" also includes human brains, which tend to not be that great at parsing nested things.

To use formal language theory, strings containing escape characters are regular, i.e. parseable with a finite-state machine. Allowing nesting means you need a stack to find the matching pair.


It’s trivial to do; parsers handle nested things quite easily.


ruby does that. computers are pretty fast now.


So, how would you encode this in a string:

  To end a string, use the » character.

?


You escape it. What I was saying was nested strings wouldn’t need escaping.


Yeah, seen a ton of tools that auto-format to left/right quote automatically but then output ASCII and mangle the conversion.


let's rewrite social idioms to use < > as quotes.


«These characters» are the usual way of quoting in several languages.

See https://en.wikipedia.org/wiki/Guillemet


Well then it's a good thing c used its ASCII equivalent that's accessible on anglophone keyboards for bit shifting and so if any programming language tried to use <<strings>> you'd have c grognards screaming about lshift and rshift.


> The fact that ASCII does not have balanced quotes is one of the great catastrophes of computing.

Okay.


I'm pretty sure text like:

  ``quoted''
Is how you're supposed to write short quotes in the TeX/LaTeX typesetting system.

[edit: My point being that the author seems to think this type of quoting originated with X11... which is actually newer than TeX (X11 was first released in 1984), and that the prevalence of this type of quoting likely originated with TeX when it was released in 1978... which isn't mentioned at all in the article. In fact, since TeX/LaTeX is what all the CS, Physics, and Math types were using for journal articles, it is likely the X11 font bitmap glyphs were intentionally shaped like curly quotes to make editing your TeX source files prettier.

At least, that's how I remember it...]


Interesting historical note. Of course that's TeX input, and the author using it knows that it will be interpreted in TeX's special way and the correct characters used in the typeset output. Also, with the current Unicode-aware TeX engines, you can just input the normal Unicode quotation marks. That makes your source easier to read.


Yeah but that gets rendered as the proper quote glyph in the final document.


Another giveaway of a TeX-savvy writer out of water is when you see --- for em-dash, i.e., ‘—’.


Does HN markdown understand &mdash; or &lsquo;?

EDIT: Nope.


It doesn't need to. You can put the mdash directly in comments: —

As opposed to regular dash: -


Well, since my keyboard doesn't have an mdash key, if there's no support for something like &mdash; or --- then I can't use mdashes.


At least on the US keyboard, Mac OS lets you input all three dash types.

Hitting the button to the right of [0] outputs a hyphen ("-"). Holding alt/option when hitting it will output an en dash ("–"), holding both alt/option and shift will output an em dash ("—").

- = -

⌥- = –

⇧⌥- = —


If you're on Linux (maybe macOS supports this too?) you can open the keyboard settings and turn one of several keys into the Compose/Multi key (I picked CapsLock because I rarely hit it on accident and don't use it). Then you can type all kinds of weird combinations:

https://www.x.org/releases/X11R7.7/doc/libX11/i18n/compose/e...

For example, em dash is Compose+minus+minus+minus, en dash is Compose+minus+minus+period.

It's also useful for foreign glyphs like Compose+a+a for the Nordic å, Compose+s+s for German ß (with Shift it becomes ẞ, which the new German orthography rules officially recognise!), Compose+quote+<vowel> for the various umlauts and Compose+i+period for the dotless ı (with Shift it becomes the dotted İ).


The best way to enter arbitrary Unicode characters on Linux I have found is fcitx's https://fcitx-im.org/Unicode

Just press Ctrl-Alt-Shift-U, type (part of) the name of the character and select from the list of results.

— (em dash), ︱ (presentation form for vertical em dash), ⤐ (rightwards two-headed triple-dash arrow), ﷽ (arabic ligature bismillah ar-rahman ar-raheem) are all easy to enter.


I'm using AutoHotKey on Windows. Not only for this, but for other things, as well.

For example (<^>! means the AltGr key):

  <^>!4::Send „
  <^>!5::Send “
  <^>!2::Send ‚
  <^>!3::Send ‘
  <^>!+6::Send “
  <^>!+7::Send ”
  <^>!+8::Send ‘
  <^>!+9::Send ’
  <^>!-::Send –
  <^>!.::Send …
Also to use Caps Lock as another Control key:

  Capslock::Ctrl
Or make windows stay on top of others, if even they lose focus:

  <^>!t::Winset, Alwaysontop, TOGGLE, A


Fire up vim, use a digraph (^K followed by the two relevant characters -- I believe em dash is -M -- in insert mode), copy & paste.


Nobody's keyboard had an em-dash key (well, maybe some compositor keyboards...). It's well worth the time to figure out how to enter at least the most useful Unicode characters. See, here's an em-dash: —.


Linux users should enable the compose key, it's very useful.

Test it with:

  setxkbmap -option compose:menu
Then press the menu/compose key (next to right control), then C, then =. You get €.

Try compose, 1, 2 for ½.

Compose, ^, 3 for ³.

Compose, A, : for Ä.

It's pretty intuitive for the most useful characters, and easily the fastest way I have of typing the ö, ñ and å in various colleagues' names.


Actually, A: is not valid with the default compose bindings (it is however valid with vim digraphs). In general umlauts or trémas are inserted with the double quote (A").


Worth noting that international layouts (with dead keys) support many of those without the use of compose key (for obvious reasons not your first example – but the € symbol has its own key combination at least where it's commonly used). I admit it might be hard to adjust to caret being a dead key, though.


Dead keys support accents, and a few symbols printed on the keyboard are discoverable, but I find the Compose key is much more intuitive for occasional use.

I don't yet speak the language of my adopted country, so it's better for me to keep []{} etc where I like them in the British layout, and use three keypresses for typing the ø in a (place) name like København.

If I do end up typing lots of Danish, I'll probably map AltGr+A,E,O to Å, Æ, Ø. É is rare, so I'll still use the Compose key for that and German / Swedish names.


I use the default German keyboard layout. Even with dead keys I wouldn't have Nordic characters or the Turkish dotless i and dotted I. I love the Compose key. It's actually making me consider using the US layout because the lack of German glyphs was what was keeping me back.


I use the code behind this web app: http://latex2unicode.herokuapp.com/

It converts latex to unicode where possible. It's pretty impressive how much of Latex can be replaced with Unicode today.

This program translates LaTeX markup to human-readable Unicode when possible.

Here's the default text from that webapp:

Basic math notations: ∵ A͡B + B͡C ≠ A͡C ∴ ∬∜x̅ ξᶿ⁺¹ - ⅜ ≤ Σ ζᵢ ∴ ∃x∀y x ∈ Â

Easily type in hundreds of other symbols and special characters: , ℵ, Œ, ⇊, etc.

Font styles support: 𝔹𝕝𝕒𝕔𝕜 𝔹𝕠𝕒𝕣𝕕 𝔹𝕠𝕝𝕕, 𝔉𝔯𝔞𝔨𝔱𝔲𝔯, 𝐁𝐨𝐥𝐝 𝐅𝐚𝐜𝐞, 𝓒𝓪𝓵𝓵𝓲𝓰𝓻𝓪𝓹𝓱𝓲𝓬, 𝐼𝑡𝑎𝑙𝑖𝑐, 𝙼𝚘𝚗𝚘𝚜𝚙𝚊𝚌𝚎.

Now type in this box and try it yourself. ⌣̈


Careful — some of the Unicode pseudoalphabets won't render on mobile.


No problem on iOS; I see the same thing that I see on my Windows desktop.


If I need a special character, I find a web page or document that has it and use copy/paste.


I have an A4 pinned next to me with all the windows Alt codes. Old-school methods are fastest and I subconsciously learn them by heart over the years.


Nice idea. I don't need them often enough to bother. Do you have a source for the chart?


Yeah. Can't count how many times I've just googled "yen" or "e with accent" because I can't remember the shortcut.


Even then the character pallete built into MacOS is much more convenient for searching, saving, and inputting.


The character palette built into Windows uses too small a font, I find it almost useless. And the size is not adjustable. I've been tempted on more than one occasion to write my own.


Keyboards and typing aids can enter any character if you configure them properly.


not that HN needs more pedantry, but HN’s lightweight markup format is not in any sense a Markdown. I believe that literally the only thing they have in common is that a single set of asterisks yields italics. Bolding, headings, code, lists, quotes, links, etc. don’t transfer from one to the other.


That's because single quotes used to be rendered as a right single quote (as you might have in a contraction), and the backtick was angled much less aggressively. That is, it looked much more natural at the time.


Yeah, I think the motivation for `' is for markup too. I'm pretty sure they've been recommended in GNU info and groff for that reason.


It is, and denotes opening and closing quotes.


I always hated this horribly inconsistency.

    \left( \right)
    \left{ \right}
    `` ''
Why not

    \left" \right"
    \left' \right'
Better yet, make it completely DRY:

    \( \)
    \{ \}
    \`` \''
    \` \'


MS PGothic, a very common font in Japan, still uses this type of quote. "Quoting like this'' (double quote, then two single quotes) looks the most natural in this font. "Using two double quotes" looks quite odd (see screenshot) [1]

If you've ever seen an English-language page on a Japanese website that used weird quotes, this is probably why.

[1] https://i.imgur.com/zcuFZa1.png


Ah, the old dead giveaway "this game was translated from japanese and we CBA to handle localisation properly" fonts.


So I just saw parentheses like this "()" on japanese twitter

it starts with a regular parenthesis "(" but ends with a fullwidth parenthesis ")"(U+FF09)

is this a similar thing?


The usage of of an accent as syntax in markup and programming languages annoys me to no end. And it will still be used, to this day, the latest example are template string in Javascript.

• It is semantically idiotic because it's an accent, not a character.

• It is visually annoying because you almost can't see the thing.

• It is bad for usability, because on non-US keyboards the accents are implemented as dead keys. Yes, accent + space gives you the character but that's really unintuitive for people who grew up expecting accents only over letters.


Same, I’ve never cared for it. For these reasons I’ve decided to take a stand and avoid using the grave accent for anything in a programming language I’m working on. Same goes for the dollar sign, because it’s somewhat Americentric, and as a currency character it doesn’t have any great semantic or mnemonic value except for, well, currency units. I guess you could argue for $trings (BASIC) or $calars (Perl) if you have $igils, but I don’t.

Sacrificing these bits of ASCII is fine by me, because the language is small enough, and I also allow Unicode. For example, curved quotes are allowed and can be nested or contain ASCII quotes without escaping:

    // Character literals
    ‘'’
    =
    '\''

    // Text literals
    “Some "text" with “curved quotes”.”
    =
    "Some \"text\" with “curved quotes”."
For the sake of usability, of course, everything in the core language & standard library has an ASCII spelling, like in Perl6. I’d like for other languages to adopt this view as well. If new languages allow proper Unicode notation in some sensible places, then programming editors’ input methods will catch up, e.g., automatically replacing “->” with “→” or “\theta” with “θ” (like Emacs’ TeX input mode).

Also, does anyone know of a reference for keyboard layouts from around the world that includes estimates of the number of people using them? I’ve tried to keep things relatively easy to type on all the major layouts I know of, but I don’t want to alienate anyone if I can help it.


> Same goes for the dollar sign, because it’s somewhat Americentric, and as a currency character it doesn’t have any great semantic or mnemonic value except for, well, currency units.

By that metric, wouldn't & be too Anglo-centric, and # be too Euro-centric? There are layouts out there on which neither is readily available.


I suppose so. It’s just one of the many small judgement calls you make when designing a language, and definitely falls into the category of “design” more than engineering or science. At some point I decided that grave and dollar were out, while ampersand and octothorpe are in. And you can still define a dollar-sign operator if you want, it’s just not in the core language or standard library.

English is the lingua franca of programming, so it’s hard to avoid some Anglicisms (like ampersand meaning “and”, dot instead of comma for decimals, and English-language keywords) without going against strong precedents set by other languages. If I really wanted to be pedantic, I might use /\ and \/ for logical “and” and “or”—those spellings are the major reason that the backslash even exists in ASCII.


If anything, & is Latin rather than English :-)


It's very odd for me to see the grave accent (`) as quoting mark in bash and other programming languages. I understand that the accent alone lose its function for the human language. But still uncomfortable to se an accent as delimiter to a string.


Well, we also use the dollar sign to signify a variable in bash and PHP even though we aren't talking about an amount of USD. Likewise we use single and doublequote to mean special things.

HTML tags have nothing to do with less than or greater than, yet here we are.

In conclusion, it's simply convenient to use the standard keys we have and to use the symbols that are on it to mean something different from their original meaning in order to be able to express ourselves succinctly so that we don't have to spend so much time typing as we'd otherwise have to.

Of course you could always buy yourself an APL keyboard and write your programs in APL and use an APL REPL as your command line instead of using bash ;)


I'm not even sure why ASCII has a grave accent. There are no combining marks so you could never write it over another letter.

Edit: I forgot HTAB was actually part of ASCII. Oh well!


On a teletype, ALL characters are combining marks because you can backspace (another ASCII character derived directly from teletype codes) and type another character overtop it.


Are you old enough to remember when you printed your code on a teletype machine that blank spaces were represented by a "b" with a slash through it? I hated that.

Even worse, I remember one shop where the teletypes didn't have question marks, so people used capital P's instead.


In ASCII instead of combining marks what you have to do is write three characters:

* The unaccented letter

* The backspace character

* The accent character

If this makes no sense to you, try to imagine a literal, physical typewriter. Windows line terminators also work with a similar principle.


From what I recall from my childhood, physical typewriters worked slightly differently: the accent keys were non-advancing ("dead") keys. You pressed the "acute" key followed by the "e" key for an é, for instance. If you wanted a bare accent, you pressed the accent key followed by the space bar.

(The typewriters I recall also didn't have a 0 or 1 key, you used uppercase O or I for these numbers.)


Yes. I consider it a mistake of Unicode that combining characters follow rather than precede the base character. If they preceded, most dead keys could simply generate the appropriate combining character, rather than requiring complicated input method support. (And finding the end of a sequence of multiple combining character wouldn't require lookahead.)


The Unicode way makes sorting easier. Your way would require special knowledge about the characters to know that ä should sort directly after a, rather than directly before ë.


In fact this requires special language-specific knowledge anyway (which unicode provides in some tables and algorithms actually). In some languages ä should sort exactly as if it were 'a'. "aa", "äb", "ac". In others it should sort as a distinct letter (but not necessarily between 'a' and 'b'). Different Latin languages sort differently, I'm not sure if exact UTF-8 (or UTF 16 or UTF 32) byte ordering is actually appropriate collation in any latin-alphabet language.

But I do suspect it had something to do with ascii compatibility, I don't recall what. Very little of unicode is accidental, there's usually some reason for whatever in it.


some languages even sort

aa ah az ba bh bz ca cz ch

treating "ch" as a single letter that comes between c and d.

or

ab ah az b c .... z aa

treating aa the same as a separate letter at the end of the alphabet.

Then there's other rules for sorting that aren't directly alphabetic, like that names beginning with "Mc" should be treated as "Mac" or "St " as "Saint ".

"10 cats" should sort after "2 cats", not before it.

Anyone who tries to sort by just numeric ordering is doing it wrong.


Except not really, since in Swedish, it's sorted xyzåäö, not aåäbc. Also, it used to be that w and v were equivalent sorting-wise and you'd mix them together.


I believe that depended on the manufacturer and country convention. Most US keyboards didn't have an accent character. For example, here's one from the 1950s:

http://www.typewriters101.com/uploads/1/7/6/6/17660651/s7662...

For acute or umlaut you could use a + backspace + ' or u + backspace + " (or the opposite order). For grave or circumflex, I don't think there was a solution. Write it in by hand?


When I was in school back in the 1990s, that was certainly the approach taken for the Vietnamese edition of the school newsletter.


to add to the confusion:

' PRIME (U+2032)

" DOUBLE PRIME aka inch mark (U+2033)

have their own codepoints

http://practicaltypography.com/foot-and-inch-marks.html

which describes implications for typesetting coordinates and other things:

118° 19′ 43.5″

118° 19’ 43.5” wrong (curly quotes, although it renders identical in some fonts)

118° 19' 43.5" right


And further confusion:

An ʻokina (U+02BB, as found in "Hawaiʻi") is neither an apostrophe nor a left quotation mark.

https://en.wikipedia.org/wiki/%CA%BBOkina


Those should be added to the document, with a note that they are NOT quotes!


I doubt this will ever be updated, since it's reference for a very specific, 20 year old, code interpretation related proposal und not meant for type setting.

But for anything related to contemporary typesetting on the web I recommend Practical Typography and especially the Type Composition chapter:

http://practicaltypography.com/type-composition.html#links

Including notes on quotes and apostrophes:

http://practicaltypography.com/straight-and-curly-quotes.htm...

http://practicaltypography.com/apostrophes.html


Hmm. I was going to link the classic https://alistapart.com/article/emen (and I guess I just did), but I noticed a disclaimer at the top pointing out that it “is now obsolete.” And how! It proclaims that not enough text editors support UTF-8 yet, which thankfully hasn’t been true in ages


It just occurred to me how much easier certain text-operations (like syntax highlighting, regular expressions and other parsers) if we consistently used the right unicode symbols for quotes and apostrophes


The only languages I know off the top of my head that use balanced delimiters for strings are M4 and Perl 6.

Hey, imagine being able to nest strings without escaping! What a concept!


Perl does that as well, and you can even choose the delimiters you wanna use:

> For the constructs except here-docs, single characters are used as starting and ending delimiters. If the starting delimiter is an opening punctuation (that is (, [, {, or < ), the ending delimiter is the corresponding closing punctuation (that is ), ], }, or >). If the starting delimiter is an unpaired character like / or a closing punctuation, the ending delimiter is the same as the starting delimiter. Therefore a / terminates a qq// construct, while a ] terminates both qq[] and qq]] constructs.


PostScript! It uses (...) for strings.

Nesting string literals without escaping is a somewhat poor concept, though. Firstly, what does that even mean? Given `abc `def' ghi', what is the string here? Is it abc def ghi or is it abc `def' ghi? Secondly, what if I want to just have an unbalanced ` character in the string data?


Common Lisp doesn't have it built in, but the cl-interpol library adds this (and you can add your own custom delimiters too).

http://weitz.de/cl-interpol/#syntax


Not a language, but I adopted this concept as well: http://jstimpfle.de/projects/wsl/main.html

And I guess you could count HTML in, too.


And every time I get in an argument with a poorly-escaped CSV file, I wish we had just used ASCII 28-31 as delimiters. (File, Group, Record and Unit Separator)


FYI For a long time GNU coding standards prescribed using the grave accent, but this changed some years ago now

https://www.gnu.org/prep/standards/html_node/Quote-Character...


From the link:

> Although GNU programs traditionally used 0x60 (‘`’) for opening and 0x27 (‘'’) for closing quotes, nowadays quotes ‘`like this'’ are typically rendered asymmetrically, so quoting ‘"like this"’ or ‘'like this'’ typically looks better.

Is this link saying I can quit using `QUOTES' in my Emacs-documentation? That style always struck me as odd :)


Curved single quotes (‘...’) are recommended now:

https://www.gnu.org/software/emacs/manual/html_node/elisp/Do...


CSB: Years ago I was working on a team that developed a scripting language and we had this recurring problem where someone would write up a code sample in a Word document and it would break if you cut and pasted it because all of the single and double quotes would be Unicode. My boss was this tough guy who tried to snap the whole team to a standard of strictly disabling that behavior in all of our Office applications, but I piped up and said maybe we should just make the language treat all of those characters like apostrophes and quotes.... I think around version 5 they finally made an API for doing proper anti-injection escaping because you pretty much needed a PhD to get it right due to all of the variations introduced by the extended characters.


Or ... use a text editor?


You know a lot of PMs who write specs in notepad?


What bothers me about Unicode isn't that apostrophe (U+0027) is overloaded by having two semantic meanings ("apostrophe" or "single straight quote"), but that they exacerbate the confusion by recommending to overload "right single quote" (U+2019) to also mean apostrophe.

We now have two characters for apostrophe and extra ambiguity for processing correct right single quotes. Great job not breaking historical documents Unicode.


And now, imagine that your own name has an apostrophe in it. Like my family name. I can tell you, I crashed many databases and in 90% of the cases where people need to find again my name in a database, it is ending up with requesting my address because each time a different character is put by the clerk doing the data entry and they cannot match my name. Even state level authorities are bad, really bad, at it.


> Please do not use the ASCII grave accent (0x60) as a left quotation mark together with the ASCII apostrophe (0x27) as the corresponding right quotation mark (as in `quote').

Tell that to GCC:

  /usr/lib/gcc/i686-linux-gnu/4.6/../../../i386-linux-gnu/crt1.o: In function `_start':
  (.text+0x18): undefined reference to `main'
Looks good to me, by the way.

> Where ``quoting like this'' comes from

I did it for a while out of a habit acquired from working with TeX. In TeX, it is the source code syntax for encoding quotes. Of course, it is lexically analyzed and converted to proper typesetting.

> If you can use only ASCII’s typewriter characters, then use the apostrophe character (0x27) as both the left and right quotation mark (as in 'quote').

It looks like shit in any font in which the apostrophe is a little nine, which is historically correct. What you want is a little "six" on one side and a "nine" on the other, or at least some approximation thereof. Even if the apostrophe is crappily rendered as a little vertical notch, it still pairs with a backwards-slanted `.

(The representation of apostrophe as a little vertical notch, I suspect, caters to literals in programming languages.)

> If you can use Unicode characters ...

then you should still stick to ASCII unless you have other good reasons to. ``Can'' is not the same thing as ``should'', let alone ``must''.

> For example, 0x60 and 0x27 look under Windows NT 4.0 with the TrueType font Lucida Console (size 14) like this:

The idea that people should change their behavior because of which font is default on the Windows cmd.exe console is laughable.


> then you should still stick to ASCII unless you have other good reasons to.

Why? Using non-ASCII Unicode characters acts like a nice canary for detecting character encoding issues. Besides, why would I purposely limit my text to ASCII? It doesn't even suffice for English, let alone almost any other language I use ­— including my native language Dutch, German, and Japanese.


All sorts of reasons. Diagnostic printf message in some embedded firmware. Do you need to drag Unicode into it? Git log message. Ditto.


> Diagnostic printf message in some embedded firmware. Do you need to drag Unicode into it?

Why not? The firmware itself would usually have no reason to care about the details of a diagnostic message's encoding, whether that be ASCII or UTF-8 - it can mostly just treat strings as bags of bytes. There might be some byte values that are special (nul terminator, % for printf, etc.), but UTF-8 is a superset of ASCII and represents extended characters using only bytes with the highest bit set, so there will never be 'false positives' of the special byte values. Other than that, the bytes can stay uninterpreted as they go over whatever serial port or diagnostic protocol the device is using, until they eventually show up on - most likely - some sort of terminal application on a modern computer, which probably supports UTF-8 already. So in most cases it should 'just work'.

Of course, there are situations where it won't just work, such as if the firmware needs to display the diagnostic message on a screen (by itself), but from what I've seen those are the minority.

edit: As for Git, what's wrong with people writing log messages in their language of choice? (Other than the social issue of it making it harder for English speakers to use the codebase.)


Not specifically trying to weigh in on the overall conversation, but aren't git commands generally UTF-8?

> git commit and git commit-tree issues a warning if the commit log message given to it does not look like a valid UTF-8 string, unless you explicitly say your project uses a legacy encoding.

> git log, git show, git blame and friends look at the encoding header of a commit object, and try to re-code the log message into UTF-8 unless otherwise specified.

[from https://git-scm.com/docs/git-commit]


> It looks like shit in any font in which the apostrophe is a little nine, which is historically correct. What you want is a little "six" on one side and a "nine" on the other, or at least some approximation thereof. Even if the apostrophe is crappily rendered as a little vertical notch, it still pairs with a backwards-slanted `.

> (The representation of apostrophe as a little vertical notch, I suspect, caters to literals in programming languages.)

"Historically", U+0027 has been used as all of an opening quote, a closing quote, an apostrophe, a prime symbol, an ʻokina, a modifier, etc.

So the historically correct thing is render it as a vertical notch so it looks non-horrible in all these uses, and render U+2018 and U+2019 as the "little nine" and "little six" symbols.

You don't need to speculate what the representation caters to; the Unicode spec actually does explain this (see Unicode 9.0 Chapter 6 Section 2)...

> The idea that people should change their behavior because of which font is default on the Windows cmd.exe console is laughable.

So your alternative is to change behavior because of which font is default on a system from 1984 which no one uses anymore?


Open any random book in the English language printed in the last 200 years.

All the apostrophes look like a little nine: in contractions like it's, and the possessive 's.

That's the character that was included in the American Standard Code for Information Interchange.

Image: http://www.worldpowersystems.com/J/codes/X3.4-1963/page5.JPG

The glyph appearing in the standard looks like a little 9. It is denoted as "APOS" in parentheses. A reference to it is made in A6.8, calling it "apostrophe".

Wikipedia's (https://en.wikipedia.org/wiki/Apostrophe) page refers to a vertical notch glyph as a "typewriter apostrophe". The normal non-typewriter apostrophe looks like a comma.

Okina? That indicates a glottal stop in some languages none of which are English, and so which were understandably not represented in the American Standard Code.


> That's the character that was included in the American Standard Code for Information Interchange.

Yes, and that's the character that was immediately overloaded to used to mean a whole bunch of other things, because ASCII only included 95 printable characters, and did not include a prime symbol, an 'okina, or a left single quotation mark.

For that reason, U+0027 is not an apostrophe anymore. As the only ASCII character that can be used for a long list of uses, it's been massively overloaded, which is why Unicode currently defines U+0027 as a typewriter apostrophe and U+2019 as a real apostrophe.


gcc will output fancy Unicode quotes if you set locale. This of course even more fun if LANG is set incorrectly and you still have an 8 bit xterm; then the entire quoted string just disappears!


Things become really fun when you're trying to figure out why that command fails when you've copy/pasted it from another application window.

Often it's the quotes which have been silently (automatically) converted to a visually similar (but functionally incompatible) character variant.


It seems that half of the people in this company use the wrong acute sign `as an apostrophe instead of ' or ’. Unfortunately it's the half that creates presentations and talks to customers.

It looks terrible and to me it's a disgrace!

Example: it`s versus it's or it’s. (first one is wrong).


I've seen a café which had its name written in large, lit letters on the façade and it included the following gem: Cafe`. Yes, the wrong accent, and not even combined. Easy access to DTP tools (or even a word processor) for the typographically uneducated masses ends up with quite painful results sometimes.


In another life, I analyzed enterprise data. Variation in quotation marks was a common problem. I mean, is it "D'arcy" or "D’arcy"? Sometimes, I think, people would mangle data in spreadsheets, with auto-correct on.


While I can’t expect many to follow suit, I myself often type educated quotes and nice apostrophes. The macOS keyboard combinations (nearly-intuitive combinations of Option-(Shift)-[ and -] for “”‘’) have long been committed to muscle memory. And since nearly all (web) file formats seem to be UTF-8, the days of manually typing &ldquo; and friends are long, long gone.

Benefits of typing and using typographer’s quotes directly in your JS/JSON/HTML/source:

1. No backslashes or other escape sequences needed!

2. WYSIWYG

3. Retina screens and gorgeous modern fonts mean that your sloppy quotes will look extra bad if you just use ASCII quotes


I would fain use the curly quotes if only Darwin's groff(1) wouldn't barf on them. For the time being, man pages for one still need to quote like ``this''.


In troff you can escape “ ” ‘ ’ as \(lq, \(rq, \(oq, \(cq respectively.

If you’re writing manpages, though, you should be using the -mdoc macros (https://manpages.bsd.lv/mdoc.html), which have “Dq” and “Sq” macros that wrap the arguments in double and single quotes respectively.


brew install groff? That’ll get you 1.22.3 instead of the default 1.19.2.


I find it interesting that the article includes a German keyboard that doesn't include the proper ,,'' (or ,') quotation glyphs. However it does include grave and acute accents as well as French primary quotations (<< and >>) though not the secondary guillemots (quotation characters < and >) none of which are used in German text.

And of course I used ascii analogues to type these into HN :-(


>And of course I used ascii analogues to type these into HN :-(

But why, though? To the best of my knowledge, HN supports unicode quite well, including the following quotes: »«›‹„“‚‘ (available with the help of AltGr and sometimes shift from keys y, x, v, b when selecting the German keyboard layout on my computer).


I'm using a travel laptop on a plane and it came with a US keyboard


> The Unix m4 macro processor is probably the only widely used tool that uses the `quote' combination as part of its input syntax; however, even that could be modified via changequote.

I remember staring for a long time at the file when I first saw an m4 macro. My brain was telling, surely this has go to be a typo, but then everything worked as expected. Then I learned that's a proper way of quoting there.


It's a little bit off-topic since the article was primarily about quotation marks and coding, but it would have been good if it mentioned that an ʻokina (as found in "Hawaiʻi") is neither an apostrophe nor a left quotation mark.

https://en.wikipedia.org/wiki/%CA%BBOkina


It's worse for other languages. Russian quotation marks are « and ». Thanks to early computers being predominantly from/designed in the US, they are now highjacked by American quotes.

Same probably goes for French and other languages with their own sets of quotation marks.


«Russian» quotation marks are actually the « French » ones with different spacing. There's another, less used set of quotes in Russian, so called „German“ ones (used as inner quotes and in handwriting). English quotes are widely accepted though.


Modern Chinese usage includes all of 《》〈〉「」『』【】“” and probably others, roughly in that order. Modern typographic convention is perhaps 《title》「quote」 but that's surely opine and debatable. Hong Kong and Taiwan have their own typesetting conventions, distinct from mainland China, and in the latter case no doubt influenced by Japanese occupation and cultural inflow (manga, etc.). Historically for most of Chinese history written language had no punctuation, and sentence endings were merely inferred from context, which was historically clearer 也. See https://en.wikipedia.org/wiki/Chinese_punctuation and https://en.wiktionary.org/wiki/%E4%B9%9F#Definitions (definition #4)


So why isn't there a straight single quotation, but there is a straight double quotation? I get it probably arose from compatibility reasons, but nowadays Unicode should be able to offer something?

P.S. Major coincide I was googling this very question yesterday?


For reference, the BIOS text-mode font included with some IBM PCs (I've observed this on NetVistas and ThinkPads myself, at least) renders ` as a nice-looking opening quote, and ' looks like a nice closing quote.


Honestly I've been seeing `quote' in bash and other CLIs for my entire career and always thought they were just funny or strange, but carried no meaning.


MRI ruby still does this in some error messages. I hate it. Always messing up my copy-and-paste into `` markdown too.


Could do with a (2007) suffix.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: