> It would be interesting to see the major parts of speech (nouns, verbs, adjectives) colored. Instead this is a coloring of fairly random words. A bunch of short words are grey, but they don't belong to any particular part of speech. They include some articles, prepositions, conjunctions and a few verbs...
Because of this, unfortunately "English syntax highlighter" has two possible interpretations to me: [English [syntax highlighter]], where syntax highlighter is a noun having the idiosyncratic meaning it has in a programming context, i.e. a tool that highlights different tokens according to their category, and [[English syntax] highlighter], a tool that would presumably highlight parts of a sentence according to their syntactic function. The latter would be more exciting, but also more difficult.
† There are some exceptions to this, but it does seem to be true for most of the syntax highlighters I've used.
In the natural language context, a part of speech is actually determined syntactically. "Type" of a token in formal languages by comparison is also syntactical. Any highlighter must recognize the syntactic structure of its target language and work like a parser, that is, not like a tokenizer. You have to "push" something to the stack, if you know what I mean.
To give a bad, but intuitive example, if a highlighter assigned "gray" to every "=" character it comes across, it would not be able to highlight the same characters inside a quotation mark.
How often does syntax matter in "syntax highlighting"? Most highlighters I've used don't even try to recognise syntactic structure.
> To give a bad, but intuitive example, if a highlighter assigned "gray" to every "=" character it comes across, it would not be able to highlight the same characters inside a quotation mark.
The problem with this example is that strings are a single token.
However, you are slightly wrong, depending on the coloring engine and language.
The syntax coloring algorithm in Vim handles nested syntax. (To what extent is up to the developer of the syntax file.)
You can see some of that at work here: http://www.kylheku.com/cgit/txr/tree/tests/011/txr-case.txr
(This is being colorized by invoking Vim out of Apache on-the-fly.)
Here we can see that a Lisp dialect is embedded in a special-purpose language. The functions and operators of the Lisp dialect are colored green, but those in the special purpose language are purple. This is worked out even in situations when the are the same symbol. E.g. both languages have a "do" operator.
Vim doesn't handle nesting with a formal context-free grammar. Rather, it allows named regions to be defined, and highlighting matches can be declared as occurring only in specific regions. Regions can also be in regions. You can get very accurate coloring; but it can also be slow at times.
It does at least these correctly:
It takes many tries to succeed.
It takes only one try.
It takes a try.
I want to try and succeed.
I'm gonna try it anyway.
Man the boats!
Man and the boats!
Never hide hides.
The sound sounds sound.
It's the right right, right?
The key key is this one.
It's an objective objective.
In May, May may make out with me.
Compared with the last one, this is a fine fine.
The man we saw saw a saw.
The first second was alright, but the second second was tough.
Google "Google beats go champion".
Noun Noun Noun Verb Noun
But it would be awesome to be able to hover my mouse over a "palette" and see the matching parts of speech highlight. Or even just turn the parts on/off to see what I want to see.
Also, is there a way to run this script on a large collection of text files? Specifically, the Bible. I'd be curious to split out all the names/places/etc.
Math is an example of where this should be used so very, very much more. Why is my first grader turning in tests covering addition when a computer can literally spit back whether it is right or wrong in a fraction of a second? The learning rate is immensely better at that speed.
And this is fantastic example of how that sort of thing could be applied to parts-of-speech learning, in real sentences rather than just an artificial "show word -> demand part of speech" test style. It's the part-of-speech equivalent of the red wavy line under misspelled words, which is also a far better way to teach spelling than any number of "spelling test Tuesday returned to student Thursday" ever could be.
It's why I learn programming languages way faster than real languages.
I'm tempted to say that it's because generally when reading in English, I'm not "jumping around" as much - I'm just reading left to right and processing everything into a thought. It could easily be, though, that it's just that English is my native language and I already do a better job immediately understanding the syntax of sentences than an automated tool would anyway.
EDIT: Ah, this was it: https://news.ycombinator.com/item?id=6966528. An update on the blog says the company it was about tweeted "We will drop our patents pending. Thank you @dhh for clearing our minds."
The old man the boat.
The complex houses married and single soldiers and their families.
The man whistling tunes pianos.
Time flies like an arrow; fruit flies like a banana.
The cotton clothing is made of grows in Mississippi.
I convinced her children are noisy.
However, I get a feeling this is a sort of hype-train at the moment, because I swear I've seen three English syntax highlighters in the past week alone (being posted to HN).
Could this be the latest, albeit welcomed and intelligent, project exercise obsession à la Flappy Bird / 2048? :)
The red car
This red car
Your red car