
English Syntax Highlighting - azdle
http://evanhahn.github.io/English-text-highlighting/
======
foobarbecue
It would be interesting to see the major parts of speech (nouns, verbs,
adjectives) colored. Instead this is a coloring of fairly random words. A
bunch of short words are grey, but they don't belong to any particular part of
speech. They include some articles, prepositions, conjunctions and a few
verbs...

~~~
cyphar
I think we'd need to solve computational linguistics for a completely accurate
parser to be able to tag the words properly. Although, the current state of
the science works for something like 80% of cases (which are the "easy" ones).

~~~
verroq
Sorry but POS tagging is pretty much solved already. It is already in the 97+%
[1]. Current papers are now mostly improving it by less than one percent.

1\. [http://nlp.stanford.edu/pubs/CICLing2011-manning-
tagging.pdf](http://nlp.stanford.edu/pubs/CICLing2011-manning-tagging.pdf)

~~~
Gorgor
It's true that POS tagging works fairly well. But consider that a sentence
involves more than one word. Even at 97 % accuracy for one word, the
probability of correctly tagging every word in a short sentence of only ten
words, is still as low as 0.97^10 = 0.74. And sentences are generally longer
than ten words.

And as POS tagging is usually only done as preprocessing for some other task
like syntactically parsing a text (which itself is usually preprocessing for
yet another task), 97 % accuracy per word is not as good as it sounds. Parsers
need to work with wrong data for every second or third sentence.

~~~
philh
Indeed: the first paragraph of the linked paper says "Current good taggers
have sentence accuracies around 55–57%".

(This surprises me. I would expect accuracy for different words in a sentence
to be correlated, you either make no errors or several.)

~~~
Gorgor
Whoops, I didn't even look into the paper. Kind of makes my comment
superfluous …

------
susi22
In German (and other languages) you capitalize every noun. When I was younger
I found it confusing that English didn't do that. It seems that this kind of
syntax highlighting allows your brain to read text a little bit faster since
it instantly knows that this will be a noun. It's also a little annoying if
people write German and don't properly capitalize, your brain just doesn't
expect the word to be noun if it's not capitalized.

* [http://www.ruediger-weingarten.de/Texte/Capitalization.pdf](http://www.ruediger-weingarten.de/Texte/Capitalization.pdf) (pg 4ff, last paragraph)

* [https://mindmodeling.org/cogsci2013/papers/0462/paper0462.pd...](https://mindmodeling.org/cogsci2013/papers/0462/paper0462.pdf)

* [http://linguistics.stackexchange.com/questions/699/does-capi...](http://linguistics.stackexchange.com/questions/699/does-capitalization-of-nouns-aid-reading-comprehension/700)

~~~
ubernostrum
English used to do this. You can still see evidence of it in things like the
US Constitution, where the capitalization seems random until you remember
English is a Germanic-family language and did this as recently as a couple
hundred years ago:

 _We the People of the United States, in Order to form a more perfect Union,
establish Justice, insure domestic Tranquility, provide for the common
defence, promote the general Welfare, and secure the Blessings of Liberty to
ourselves and our Posterity, do ordain and establish this Constitution for the
United States of America_

 _I._

 _All legislative Powers herein granted shall be vested in a Congress of the
United States, which shall consist of a Senate and House of Representatives._

 _The House of Representatives shall be composed of Members chosen every
second Year by the People of the several States, and the Electors in each
State shall have the Qualifications requisite for Electors of the most
numerous Branch of the State Legislature._

etc.

~~~
Al-Khwarizmi
Why is "defence" not capitalized then? A typo?

~~~
FoeNyx
That was bothering me too.

At least, it is not a typo from ubernostrum, it's spelled "defence" in the
original [1].

Moreover that's the British spelling of defense (is there a link with the lack
of capital?).

Also note that, instead of "Blessings", it is "Bleſsings" in the original. But
they are roughly equivalent if you apply a compatibility decomposition in your
unicode normalization (NFKD/NFKC).

So where is the edit button for that constitution? Or do they only accept pull
requests?

[1]
[http://www.archives.gov/exhibits/charters/charters_downloads...](http://www.archives.gov/exhibits/charters/charters_downloads.html)

~~~
Al-Khwarizmi
Someone asked the same question here

[https://www.quora.com/Why-is-the-word-defence-the-only-
uncap...](https://www.quora.com/Why-is-the-word-defence-the-only-
uncapitalized-noun-in-the-Preamble-to-the-U-S-Constitution)

but at least in my browser, I can't see the actual answer (it says "2 Answers"
but I can only see one, which just confirms that the word is not capitalized).

------
wellpast
Syntax highlighting traditionally chooses a different color for each token in
this Java statement:

    
    
        final String id = leader(NAMES_AND_SCORES);
    

If I try to translate this statement into English:

    
    
        Given a global list of names and scores, determine the leader's 
        id. (Ensure that id is a string of characters.) I'll use "id" to
        refer to that leader throughout this paragraph.
    

If our traditional highlighting approach is generally correct for the code,
shouldn't I be highlighting each sentence and/or phrase wholly with one color
and not highlight per parts-of-speech?

Or in other words, does the analogy being proposed really hold?

Or another take -- speed readers take in whole sentences at a time. Colorizing
parts-of-speech this way would only seem to slow them down whereas syntax
highlighting code speeds my reading. I'm sure there's an analysis here; final,
String, leader() are not parts-of-speech; each is a separate semantic
statement.

~~~
cjhveal
We don't highlight each line of code separately, which would be the analogue
to highlighting natural language at the sentence level. We highlight tokens
based on their syntactic type. Strings are all colored the same. Operators are
colored the same. That's pretty much the same idea as coloring all common
parts of speech the same.

Parsing is the act of taking a linear string of tokens and building a tree out
of them. That means reading in a string of tokens and applying the parsing
rules (which may be encoded as a set of fuzzy correlations when humans learn
those rules). When the rules are not solidly codified or slow to apply due to
unfamiliarity, it helps to have hints to orient/validate yourself.

You do this parsing routine with your own natural language, too. You're just
much more comfortable doing so and do not need hinting on what each word's
role is. Just like a lot of old-school unix guys of lore are more comfortable
reading code without the spectra of colors we commonly apply today. I could
see natural language syntax highlighting being very useful for language
learners, though. Color is used in Chinese language learning to indicate
tonalities for learners, since most have no native/intuitive way to transcribe
the pitch contours. I'm not convinced that the syntax highlighting presented
in the article is really what you'd want, but I'm interested in the direction
it's headed.

As an aside, speed readers don't take in a whole sentence at a time. An entire
sentence simply doesn't fit within your fovea, but they have optimized their
eye tracking to boost their speed. I do imagine that having lots of colors
would disrupt and distract from the text and harm their speed / comprehension,
but it may be possible that a different highlighting scheme could work for
them.

------
efaref
Syntax highlighters for natural language would help against garden path
sentences:

For example:

    
    
        The old man the boat.
    

... is not ambiguous if written as:

    
    
        {SUBJECT}[The old] {VERB}man {OBJECT}[the boat]
    

I think the reason syntax highlighting is important in programming is because
garden-path style sentences are more common with the pedanticly strict
grammars that programming languages require.

~~~
yorwba
The equivalent of garden-path sentences in formal languages would be shift-
reduce conflicts and most _pedanticly strict grammars_ are specifically
designed to avoid that kind of problem, so that they can be parsed
efficiently.

~~~
efaref
Of course the compiler has no problem understanding them, but humans aren't
that good as parsing.

The purpose of a syntax highlighter is so that the human knows that the
computer agrees on what the sentence/code structure is.

------
japaget
Let's have different colors for dialog spoken by different characters, such as
Holmes and Watson. That way I won't have to reread the text to figure out who
said what.

~~~
foobarbecue
That would be nice. It's often really hard to tell with long conversations in
novels who is speaking. There's a tenancy to have the speaker implied rather
than implicit for long conversations, and color would be a good solution. I
wonder if this is ever used in cinema and / or theater.

~~~
Mindless2112
It's not common, but some anime fansub groups make the subtitle text match the
hair color of the character that is speaking.

------
baby
We generally add points and comma to emphasize parts of the text, slow down,
stop the reading... Try reading the text out loud without the syntax
highlighting, then try with. See? It doesn't work, you're going to emphasize
and stop on higlighted words. What would have worked would have been to
highlight everything and split the highlight according to punctuation, not
words.

As someone else pointed out, different colors for different protagonist
talking would be neat as well. But there is not much you can't do without
deforming the original intent of the writer.

Another thing that could be interesting would be to give these tools to the
writer instead of highlighting the text automatically, but then as someone
else pointed out, these tools already exists and are rarely used because of
the noise they add (bold, italic, underlined). There are also quotes, quads,
uppercase, ... There are many ways to help the reader follow the text.

------
Pamar
Sorry, but when we speak of "syntax highlightning" for natural language what
is the actual goal?

Are we trying to color differently verbs, nouns, adverbs and prepositions? So
that the "goal" is to properly decide if "lie" is a verb or a noun?

Or are we trying to colour subject, verb, object and other elements of the
sentence? So that "Rome destroyed Carthago" will have a different color for
Rome then the sentence "Hannibal tried to destroy Rome"?

In general, code has "reserved words" and the rest is either a "name"
(variables, constants, literals) or a ... Well... "Verb". Like functions,
procedures, methods.

In some rare case (function pointers, closures) you have "verbs" that can be
used as "nouns" but you completely lack concepts like dative, accusative and
so on.

I think that this really breaks down as an analogy when you try to adapt
syntax parsing to natural language.

~~~
sound_of_basker
I can tell you right away that it would be of huge help when reading a new
language especially in a new script. And who are biggest set of users that
might benefit from this? (clue: they haven't said their first words yet).

~~~
Pamar
Which one? Sorting out names, verbs, adjectives etc. Or finding out what
"role" each word is playing (sorry, English is not my language so I do not
know the proper technical terms for this... In my culture the first is called
"Grammatical Analysis" while the latter is called "Logical Analysis")

------
sdegutis
Syntax highlighting works great for code, because code itself is highly
structured, mostly normalized data (hence BNF being a thing). Even though we
call them "languages", they're much closer to spreadsheets than natural
languages. That's why it's so useful. Code isn't meant to be read from left to
right, top to bottom. There's a lot of back-and-forth skimming to understand
what the code is doing. Natural languages are meant to be read from beginning
to end. We do skim, but it's almost entirely contextual, and only slightly
syntactic. They're just not alike enough for things like this to make sense.

~~~
jordigh
As an armchair linguist, I get frustrated by how computer nerd types keep
comparing programming languages with natural languages. The two have very
little in common other than some superficial similarities. The way each is
acquired, used, and evolved is very different from the other. One of the worst
examples of this bad analogy is sigils in Perl, which Larry insists are
supposed to be good because they mirror plurals in English.

------
lalaithion
This is interesting. I wonder if any publishing <font
color="blue">house</font> would print a book with this idea?

~~~
harlanlewis
Putting “house” in blue reminds me of House of Leaves, a book that plays with
formatting & typography in unusual ways…

 _very minor spoilers_

…as it descends into madness (blank pages, words in spirals, backwards
characters, single-character pages, overlaid paragraphs…). The very first
unusual formatting in the book, and spit-take surprising to me as I wasn't
expecting anything unusual at all, was simply printing the word “house” in
blue. A fun read, thanks for reminding me of it!

[https://en.wikipedia.org/wiki/House_of_Leaves#Colors](https://en.wikipedia.org/wiki/House_of_Leaves#Colors)

~~~
lalaithion
Yes, that was the intent.

------
slackstation
It's cute idea but, feels distracting. Maybe a nicer color scheme would work
better.

~~~
ori_b
To be fair, I've started to feel the same way about syntax highlighting in
code. I've slowly moved to only highlighting 2 things: String literals and
comments.

~~~
gkya
I'm nowadays using a theme in Emacs where everything is black on white and
keywords and comments are bold, whereas code regular. I'm happier with this
than syntax colouring nowadays. I've also removed some colour from Org, namely
headlines are bold and black. Again, I guess the less the colours the better
here. I use colours with parens, they're pale by default, then highlighted
with highlight-parenthesis mode, denoting nesting via tones of red.

I'm even doing some html-css-js-m4 work for a relatives business website
nowadays, and I did not miss highlighting even with such complex mess,
instead, I'm happier.

------
scott_s
The first book I read was "The Neverending Story", which used green for text
that took place in the "real" world, and red for text that took place in the
fantasy world in the book the protagonist was reading. That's not _syntax_
highlighting, but _structural_ story highlighting.

I would actually be interested to read a novel which did something similar:
one color for narration (probably black, as it will be most common), and then
a different color for each person speaking. That would be useful in a similar
way that I find syntax highlighting useful: I could instantly look at text,
and without even reading it, know who said it.

Narration text could also take on different colors, similar to in "The
Neverending Story". How it is done, and how obvious its meaning, could even be
a part of the art. That would be far more interesting to me than English
syntax highlighting, to the point that if anyone knows of a book that does
this, please tell me, because I would read it just to experience it. The point
here is that in fiction, it's not the parts of speech that matter to readers,
that's just a means to tell the story. What matters are the elements of the
story, and communicating those elements visually could be interesting and
useful.

------
chrismcb
Wow that was hard to read. We don't need to highlight prose. We don't read
prose the same way we read code.

------
jacobsenscott
I found that difficult to read. Most editors go way over the top doing syntax
highlighting and to me it makes the code harder to read.

I switched to a gray scale theme (emacs tao theme) two weeks ago and it is so
much better.

------
istib
Interesting, this follows the approach of sentence blocks, rather than syntax.
I've written an Emacs plugin for lisp blocks
([https://github.com/istib/rainbow-blocks](https://github.com/istib/rainbow-
blocks)) and one for English syntax ([https://github.com/istib/wordsmith-
mode](https://github.com/istib/wordsmith-mode)) using NLP tools.

~~~
SloopJon
I'm getting this error when trying to install wordsmith-mode using package-
install: `[http://melpa.org/packages/wordsmith-
mode-20140203.427.el](http://melpa.org/packages/wordsmith-
mode-20140203.427.el): Not found`

------
cosmicexplorer
I actually made a little thing for emacs that does something similar, actually
tagging on the parsed parts of speech instead of just tokens. It requires you
to select the text first instead of automatically highlighting, though. It
uses coreNLP which was pretty lovely to work with.

[https://github.com/cosmicexplorer/speech-
tagger](https://github.com/cosmicexplorer/speech-tagger)

------
joepvd
Highlighting the quotes reminds me of a gripe I have with quoting styles in
print: When a quote consists of two paragraphs, the first paragraph does not
get an ending quote:

    
    
        He said, "The first sentence.
    
        "The second sentence," he continued. 
    

Somehow my mind gets triggered pretty intensely by these unbalanced quotes.

Does anyone have some background?

~~~
skykooler
I believe the reasoning for this is that if you had two people talking, and
there were end quotes after the first sentence, it would be parsed as the
second sentence being said by the second person.

~~~
joepvd
It might be. Most of the time, however, it is just a single person getting a
longer quote.

------
azdle
This isn't actually mine, just something I stumbled on. Before I go and pull
this apart to do it myself, does anyone know of an extension that will do this
to arbitrary text in Firefox?

I found this because someone was using it as an argument against syntax
highlighting for code, but I actually find that it lets me read significantly
faster.

~~~
bjterry
You may be interested in BeeLine Reader [1]. It puts a color gradient on
alternating lines to text to hopefully let you read faster. I think it works a
little bit. When I really want to read an entire article but don't want to
invest too much time, especially if it's somewhat fluffy, I'll use Spritzlet
set to 700 wpm [2].

1: [http://www.beelinereader.com/](http://www.beelinereader.com/)

2: [http://www.spritzlet.com/](http://www.spritzlet.com/)

~~~
gnicholas
Interestingly, when people see BeeLine Reader for the first time, many of them
think (incorrectly) that it's a sentence-based algorithm instead of a line-
based algorithm. Their belief often persists even after being told (by me, the
creator) that it is in fact line-based. We've thought about doing something
that's sentence-based, or syntactically or semantically aware, but as others
have pointed out those tasks are much more complex.

------
ntoshev
I've experimented with this for a while, the goal being easier comprehension
of new text. I found highlighting parts of speech and syntactic groups didn't
work for me. The only thing that did work was highlighting keywords (that are
specific to the text) and maybe named entities that refer to the same thing.
Interestingly, there is some research indicating highlighting keywords may
help people with dyslexia.

I've written a chrome extension for highlighting keywords, too bad I don't
currently have time to give it the love it deserves:

[https://chrome.google.com/webstore/detail/highlit/cooahmcpma...](https://chrome.google.com/webstore/detail/highlit/cooahmcpmajfpnanjkdfiejkffphknnm)

------
xemoka
It'd be interesting to integrate
[http://www.beelinereader.com/](http://www.beelinereader.com/) style
gradients. I had a hard time with the quick change from gray to white.

------
smsm42
I see a lot of negative feedback here but I must say I tried reading it and it
looks like a noticeable improvement. I can't put my finger on _what_ exactly
improved but somehow it reads better I think than just text.

------
jaakl
For me the point would be to make reading main point of the text faster and
more precise. Why not render based on meaning of the words, e.g. "love" and
"sex" should be red, strong words in bold, weak in gray, "emotion" in tone
based on particular emotion etc? Grammatical terms are not even that
interesting to me, and any syntactic sugar can be made softer.

------
jolux
If you want a Mac app that does this, iA Writer has a feature called Syntax
Control: [https://ia.net/writer/updates/ia-writer-3-1-comes-in-
colors](https://ia.net/writer/updates/ia-writer-3-1-comes-in-colors)

I use it sometimes, it works pretty well but occasionally gets confused.

------
rekshaw
Interesting concept although IMHO poorly executed: I found it harder to read
than non highlighted text as critical ligature words were faded (like "and").
these should not be faded but emphasized (a bit like ampersands in programming
languages)

------
shahzeb
On a somewhat related note, has anyone on here successfully mastered the art
of speed reading? I'm a CS student at Uni right now, and compared to non-
Engineer students, I've noticed I tend to read must slower. Any tips on up-ing
my read speed?

~~~
hgh
If you try to read all the words, just faster, you won't get much farther. The
trick is to take a more active approach and optimize how you spend your time
towards the goal of comprehension.

One thing that worked well for me was shifting from "reading" to
"interrogating". Don't just try to read through a text, think carefully (and
jot down) what questions you need to answer from the text, and jump around the
text as necessary to answer the questions. If you don't have a sense of what
questions to ask, do a quick skim of the text and any other relevant material
to get the questions first, then dive in. Iterate and refine your questions
and answers.

------
andersonmvd
In the presented example, gray Highlights within white phrases negatively
affected my reading, but the white highlights within the yellow text helped I
guess. Nonetheless, congratulations for experimenting.

------
return0
Language is already highlighted. It's called Bold, Italics, Underline.

~~~
gwern
The problem is, if you use italics/bold/underline remotely as often as
necessary to convey intonation or emphasis, you read like you're a loon or
crank. Bolding is the green ink of the Internet.

------
stevebmark
This is a very interesting idea! How about syntax highlighting for proper
nouns, using a preset cycling color tables to give proper nouns consistent
colors?

------
so4pmaker
It's an interesting idea to highlight English. However, I believe it needs to
be a little more intelligent than highlight.js.

------
Animats
In some early books, the open class words are capitalized. This style lives on
in titles.

------
clux
I would totally install a sublime plugin or something like that for this.

------
chris_wot
Similar to Red Letter Bibles, I think this gets in the way of the prose.

------
pesnk
This is amazing Thanks so much author.

------
stephenitis
please make this a chrome extension that wraps all p tags in my browser.

I'm very curious.

------
Pxtl
Red periods were a bad idea. Also ew.

~~~
peterburkimsher
In British English, periods are always red. Full stops are usually black when
printed.

~~~
Pxtl
I'm wondering if the visual pun was deliberate.

------
languagehacker
This isn't particularly interesting as art, and it's beyond incorrect as far
as any formal theory of natural language goes, but you do you, playa

