Hacker News new | past | comments | ask | show | jobs | submit login
Learning about writing by seeing only the punctuation (medium.com/creators-hub)
74 points by NaOH 48 days ago | hide | past | favorite | 61 comments

Punctuation is also an interesting way of spotting non-native English speakers.

Do they quote with « guillemets » or leave spaces before punctuation marks ? French.

Decimal comma instead of a period, spaces for thousands separators, or currency sign after the number instead of before it? I'll bet you 5 000,00€ they're European.

„First quote mark aligned to the bottom“? East European.

Group numbers larger than 1000 by twos instead of threes? 1,00,000 rupees they're Indian.

Full width letters? East Asian, likely Japanese.

Over the years, the best way to identify someone from the US is if they use an obscure acronyms (or brands) and expect others to understand it. It happens all the time on HN.

“I went to my RN”

“I always try to do things via Colosia” (made up brand)

“Our HSA allows us to go to my bank”

It's interesting that German also has widespread use of acronyms and they're totally different from the English ones (both in terms of what the acronym is, and which concepts are denoted with an acronym). E.g. WG = shared apartment (Wohngemeinschaft), LKW = truck (Lastkraftwagen), FKK = nudism (Freikörperkultur).

Also German-speaking countries seem to have standardized acronyms for pieces of legislation (XyzG), but, unlike the case in the U.S., these acronyms are never pronounceable as words or meaningful. In the U.S. there is a huge sport of making legislative titles spell out something clever. Like CLOUD Act, USA PATRIOT Act, CARES Act, CAN SPAM Act, USA FREEDOM Act, REAL ID Act, and so on.

> It's interesting that German also has widespread use of acronyms and they're totally different from the English ones (both in terms of what the acronym is, and which concepts are denoted with an acronym). E.g. WG = shared apartment (Wohngemeinschaft), LKW = truck (Lastkraftwagen), FKK = nudism (Freikörperkultur).

I would say, German has abbreviations (not acronyms) due to extremely long compound words. Hence, they are meaningful.

The American equivalent identifier is including punctuation (full stops, periods, sometimes even question marks) inside quotations, even when that punctuation was never part of the quote.

Although that seems to be much less common online than in print.

As an American, I refuse to do this. The purpose of written language is to share information. When using a quote, you'ra trying to share what was said/written. Ergo, adding arbitrary punctuation when it wasn't in the original quote defeats both the purpose of the quote and the writing.

It’s literally AP style.

And never made any sense to me.

AP exists for English teachers to be able to quantify their largely unquantifiable profession in some way. In order to keep cashing in on the sham, they continue to rewrite their rules yearly to be more bizarre and unintuitive every time.

You might be thinking of the APA style guide. I think the parent was talking about the AP Stylebook, which is the style guide used by the Associated Press.

Yes, Associated Press. And to clarify, that style (at least back when I was doing this) was typically followed by print publications, as the “standard.”

A notable exception being the New York Times, but now we are at the edge of my knowledge about this.

I’m grateful it’s falling out of favor, even when I was taught this rule and eager to follow rules I found it absurd. If everyone follows the rule I good faith it’s still easy to misconstrue or mischaracterize a quote by accident.

Dutch people are taught to do this as well.

I'm really weirded out by the lack of spaces following ".", ",", and ")", as well as preceding "(", in a lot of Indians' online writing.

You might see a parenthetical(with no space adjacent to the parentheses at all).Or a full stop,and then a sentence immediately following it with no interruption.This looks incredibly strange to me.Essentially I imagine it's as though the punctuation is felt to contain its own built-in spacing or something.

Is it taught this way in school in India? Does it appear this way in professionally published books in English?

Note: I'm from India.

Indeed, a lot of Indian people do not know the punctuation rules. I frequently educate them about it. Some however refuse to correct themselves.

It is not taught that way in India. However, I am seeing even English teachers developing these habits.

Books, print media, written by Indian authors is usually fine as of now. However, the way it is going, I would not be surprised to see this style of writing becoming a standard as a local dialect over the years. :-(

> would not be surprised to see this style of writing becoming a standard as a local dialect over the years. :-(

I want to address the ":-(" in your comment.

Languages evolve like humans decide. English is not what it was today that it was 100 years ago. It won't be the same 100years from now. I, for one, look forward to the rich diversity.

I agree overall. The diversity over the time leads to innovations too.

Weirdly this bothers me in code but in prose I just breeze past it without a concern.

If you see relative clauses, that have extraneous commas, they may be German. They're also likely, to put them before infinitives.

And in Commonwealth English you might find:—

- the 'dog's bollocks', a colon followed by a dash or hyphen.

Group numbers by the myriad? 1,0000,0000 times more likely to be East Asian.

While it is true that many East European countries use „First quote mark aligned to the bottom“, those are originally the German quotation marks, like the « guillemets » are the French quotation marks and the quotation marks normally used in English are the British quotation marks.

These 3 kinds of quotation marks have spread from their countries of origin to many other countries.

That parallels the typewriter keyboards. The American/British is QWERTY, the French is AZERTY and the German is QWERTZ.

Many East European countries had also used the German QWERTZ typewriter keyboard, before the transition to modern computers, which eventually made QWERTY more popular.

If the percent sign comes before the number as in %10, they are Turkish or Persian.

Are numbers read right-to-left in Arabic script, as in ones first, then tens, etc.?

Edit: this led me down a Wikipedia rabbit hole trying to find the history of alphabets and writing direction for Turkish and Farsi, but I still don't know the answer to the number order question.

Numbers in Arabic are still left-to-right 19 is still ١٩, not ٩١. It makes designing forms fairly annoying, because for text fields you want RTL input, but fields like phone numbers, you need to override the locale's RTL and ask for LTR input.

This is a nightmare when copying numbers that has spaces between every 3 digits, when the system has Arabic, for example a mobile number would get the digits reversed upon copying them.

I'd assumed they were right-to-left and little-endian, as opposed to left-to-right and big-endian.

So if you were reading out digits in amongst text, you'd skip to the left end of the digits and read them out left-to-right?

No, numbers read left to right, even when using Arabic numerals and when the rest of the text is RTL. Even more oddly, negative signs are typically placed on the right side.

> or leave spaces before punctuation marks

I've picked up the habit of doing this (in chat messages at least) from working with a lot of Europeans for years (I'm American). I like how it gives extra emphasis to question & exclamation marks.

This was standard practice in English-language print through about 1800, as I've discovered reading numerous older texts (in original printings). The practice seems to have been preserved longer in the case of a subset of punctuation, typically semicolons and colons.

Another practice I find ... jarring ... is of not putting spaces around em dashes in text. I find "word --- word" reads far more sensibly than "word---word". The former clearly indicates a bran in text, the latter might be mistaken for a compound word (though from context this is virtually always clearly not the case).

One more: two spaces instead of one after a . and you're most likely looking at someone old, like 60-70 years old. I am now confused as whether this is just in Italy (I was born there), just in the US (lived there for ~9 years), or both.

It was commonly taught in typing classes through the late 1990s (at least). I do it myself and am a bit younger than the age range you mention. A likelier criterion is whether someone took a formal typing class or worked in a profession related to typing before the advent of WYSIWYG word processors.

The justification (no pun intended) for the double space after period is that it looks better, or closer to traditional typesetting, when text is rendered in a monospaced font. While this is still my experience with reading and writing e-mail (and other text) in a terminal, it's not the way most people now compose text on most devices, since most typing is done in software with proportional fonts.

> Decimal comma instead of a period, spaces for thousands separators, or currency sign after the number instead of before it? I'll bet you 5 000,00€ they're European.

Most of South America also uses commas for decimal separator.

Brazil also uses commas!

> currency sign after the number instead of before it

While they use Euros, the Republic of Ireland is an exception to this.

I notice you couldn't stop yourself from inserting a space after the fullwidth question mark and comma.

(As a followup, Chinese in China are more likely to type English in NORMAL WIDTH ALL CAPS than to use fullwidth characters. Fullwidth characters don't really have a use.)

Surely you mean 10 crore rupees!

I'm a professional writer and editor. This is extremely cool to me. The comparison between Blood Meridian and Absalom, Absalom! really says all you need to know about this tool. Personally I find it to be a really fascinating and original way of looking at prose.

Link to the actual tool the author made for creating your own punctuation visualization: https://just-the-punctuation.glitch.me/

I wonder if this could be used to fingerprint an author. I'm sure word choice and structure would be a much better signal but also much harder to develop a model for (I speculate, I'm by no means an expert here). Less input tokens (just the punctuation) would reduce the model's complexity would it not?

Would be interesting to see a classifier based on this.

Researcher in authorship analysis here. Yes, punctuation specific patterns have been used in the literature for this purpose, quite a bit. Removing the words also reduces the impact of the content of an article on your results (i.e. accidentally guessing the author because the topic is baseball and you only have one writer who writes about baseball in your dataset).

Typically, this type of feature is generalised of "character n-grams", but there are lots of variants.


i also wanted to answer the same question. so i pasted two essays from one of my favorite authors in the tool. the punctuation summaries looked very different.

the two essays were by Isaac Asimov -- he does have a somewhat recognizable writing style. one essay was on physics and one of demonology.

then again, nlp has "stop words". i guess maths and physics would have math symbols stripped in that line of research?

Very interesting, thanks for the answer - strangely (but pleasantly) relevant area of study!

If I wanted to obscure my identity via text, changing my word choice, slang, colloquium, etc, would comprise my first naive attempt. I find myself changing the punctuation I use when pretending to be someone else, or trying to convey some personality traits that aren't natural to me. For example, one of my friends always puts a space before question marks and exclamation points.

I once ran Satoshi et al.'s writings through some simple stylometry service. If I recall correctly, my search pointed to Nick Szabo. A quick Google search shows that others have come up with Szabo, Finney or Dai. I am now super curious as to what the results would be if punctuation was analyzed exclusively... Unfortunately, I don't have time at the moment to pursue this tangent.

My immediate thought on this is yes, but you would also want to plug in the length of the sample in characters, word count (to get a sense of the average length of word for each author). This would actually be a useful tool in some industries I can already think of…

I know—both from personal frustration with concision, and from external critique—how much I depend (overly) on punctuation to express my thoughts; it’s gotten a lot better, but… I don’t think I’ll ever be fully satisfied.

Hey, that long hyphen with no spaces! Why is it so? It works like parentheses but I never got the nuance.

It’s an em-dash, and I think of it like parentheses but less of an aside—and implicitly closed when used at the end of a sentence.

Here's the punctuation of the comments section so far...


What does that regex match?

It is a valid regex aside from this bit [::]

That heavily depends on its flavor.

Neat, but I’m missing the part where the author learned anything beyond the obvious:

> … when you look at those writers’ punctuation, you can see, in a quick glance, how different they are.

Wild. But beyond this and pretty pictures, what have we learned?

Starting at

> So, what did I discover about my own writing?

There's multiple paragraphs on what they've learned about their own writing.

Ouch, you are absolutely right. I’d been reading the article on my phone and it cut off around the ellipses. I missed 3/4 of the article.

Would also be interesting to see the same thing, but with spaces where things other than punctuation were. Run-on sentences would stand out pretty well.

    echo $thing | tr -dc '[:punct:]'
Running it against a bunch of my writing does produce something that feels like a common pattern of symbols. Though there doesn't seem to be munch commonality between my fiction and non-fiction writing, where I'd expected at least a little bit of an overlap.

I wonder if you could use this to tell programming languages apart.

I think you absolutely could. Here's a few files I had lying around, take a guess: https://imgur.com/a/iTNJ8zw

Fair enough, I'll play :)

The first one looks like some kind of configuration language, most likely JSON due to lots of matching { and }, colons and no semicolons.

The second one I really can't tell. It's most likely some sort of curly brace language, Typescript? Java?

The third one really looks like Python: no semicolons, no braces, [ and ] for list/dict access, # for comments, @ for decorators.

The fourth is also tricky. I want to say HTML-ish because of the </> sequences, but it also looks like a curly brace language! I'm going to guess it's a JSX React file.


2. Typescript (Angular)

3. Python

4. Javascript (React)

Spot on with the guesses!

I like the idea, when I'm about to start writing I often prepare by reading an exemplary author. I focus above all on their form, how they begin, pace and punctuate sentences. But I think the tool, while interesting, isn't quite as helpful - just seeing the punctuation by itself is gobbledegook to me.

One element of a feature set that may be used as part of a broader forensic analysis of writing samples...or for obfuscating the origin of one's writing.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact