Show HN: English syntax highlighter using part-of-speech tagging

ipsum2 · on March 18, 2016

Hi HN! This was inspired by the top comment from this thread: https://news.ycombinator.com/item?id=11294026

> It would be interesting to see the major parts of speech (nouns, verbs, adjectives) colored. Instead this is a coloring of fairly random words. A bunch of short words are grey, but they don't belong to any particular part of speech. They include some articles, prepositions, conjunctions and a few verbs...

photon_off · on March 18, 2016

Could you please add a legend to indicate what the colors mean?

ipsum2 · on March 21, 2016

Sorry for the slow response, I've implemented this already.

gyhchang · on March 18, 2016

It would be nice if you put the color of the words in a separate key, rather than in the example box.

jjp · on March 18, 2016

Actually put all the instructions outside of the example box. First pass I assumed everything in the box was just sample so I dived straight into typing and then didn't get any colours. Then I had to reload to find out I had to tab out.

TazeTSchnitzel · on March 18, 2016

It's strange to me that we call these things "syntax highlighters", given that they don't really seem to colour things differently according to syntax,† but rather by the category of token. Perhaps "lexical highlighter" would be a better term for these?

Because of this, unfortunately "English syntax highlighter" has two possible interpretations to me: [English [syntax highlighter]], where syntax highlighter is a noun having the idiosyncratic meaning it has in a programming context, i.e. a tool that highlights different tokens according to their category, and [[English syntax] highlighter], a tool that would presumably highlight parts of a sentence according to their syntactic function. The latter would be more exciting, but also more difficult.

† There are some exceptions to this, but it does seem to be true for most of the syntax highlighters I've used.

raldu · on March 18, 2016

Your observation seems plausible at first. However when I gave it a second thought I have concluded otherwise.

In the natural language context, a part of speech is actually determined syntactically. "Type" of a token in formal languages by comparison is also syntactical. Any highlighter must recognize the syntactic structure of its target language and work like a parser, that is, not like a tokenizer. You have to "push" something to the stack, if you know what I mean.

To give a bad, but intuitive example, if a highlighter assigned "gray" to every "=" character it comes across, it would not be able to highlight the same characters inside a quotation mark.

TazeTSchnitzel · on March 20, 2016

> Any highlighter must recognize the syntactic structure of its target language and work like a parser, that is, not like a tokenizer. You have to "push" something to the stack, if you know what I mean.

How often does syntax matter in "syntax highlighting"? Most highlighters I've used don't even try to recognise syntactic structure.

> To give a bad, but intuitive example, if a highlighter assigned "gray" to every "=" character it comes across, it would not be able to highlight the same characters inside a quotation mark.

The problem with this example is that strings are a single token.

kazinator · on March 18, 2016

"Lexical coloring" or "Lexical color-coding" would be even more appropriate.

However, you are slightly wrong, depending on the coloring engine and language.

The syntax coloring algorithm in Vim handles nested syntax. (To what extent is up to the developer of the syntax file.)

You can see some of that at work here: http://www.kylheku.com/cgit/txr/tree/tests/011/txr-case.txr

(This is being colorized by invoking Vim out of Apache on-the-fly.)

Here we can see that a Lisp dialect is embedded in a special-purpose language. The functions and operators of the Lisp dialect are colored green, but those in the special purpose language are purple. This is worked out even in situations when the are the same symbol. E.g. both languages have a "do" operator.

Vim doesn't handle nesting with a formal context-free grammar. Rather, it allows named regions to be defined, and highlighting matches can be declared as occurring only in specific regions. Regions can also be in regions. You can get very accurate coloring; but it can also be slow at times.

GolDDranks · on March 18, 2016

I totally expected it to be "just" a part-of-speech-tagger.

It does at least these correctly:

    It takes many tries to succeed.
    It takes only one try.
    It takes a try.
    I want to try and succeed.
    I'm gonna try it anyway.
    
    Man the boats!
    Man and the boats!

What you'd want it to do? To show the parse tree? If I understand correctly, even the non-computational linguistics have had (and still have) trouble to come up with a system that parses all the natural language correctly – or even a system that could represent all the parses of natural sentences correctly.

GolDDranks · on March 18, 2016

Btw. to elaborate: there seems to be some troublesome sentences where a parse tree isn't just enough – you've got to have a graph. Yet, most of the linguistic phenomena can be parsed into tree structures. That's troublesome. It's like... there's a certain class of a parser in the brain, but then some curious sentences show the ability of overcome the "limitations" of that parser, and still they are considered grammatical.

legulere · on March 18, 2016

"Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo." is just orange :/

jychang · on March 18, 2016

It does ok at other phrases, though.

    Never hide hides.
    The sound sounds sound.
    Flies fly.
    It's the right right, right?
    The key key is this one.
    It's an objective objective.
    In May, May may make out with me.
    Compared with the last one, this is a fine fine.
    The man we saw saw a saw.
    The first second was alright, but the second second was tough.

stared · on March 18, 2016

My standard test is "A machine learning to do machine learning." And it works here.

taneq · on March 18, 2016

Sentences like these lead you down the garden path sentence.

stared · on March 18, 2016

This was just "verb vs a part of a compound noun". For a garden path sentence I would go with "A machine learning machine learning.".

taneq · on March 18, 2016

But that can be validly parsed two ways: "A machine-learning machine, which is learning" or "A machine which is learning machine-learning".

brudgers · on March 18, 2016

But not the imperative:

  Google "Google beats go champion".

Which it interprets as:

  Noun Noun Noun Verb Noun

smilekzs · on March 18, 2016

This is impressive! Wonder where are the limits of the algorithm...

TazeTSchnitzel · on March 18, 2016

Well, Buffalo, the proper noun, and buffalo, the noun, are both going to be highlighted as nouns. That leaves buffalo, the verb, which is slang and so is unlikely to be recognised as not being the noun, ignoring the difficulty of parsing the sentence anyway.

taneq · on March 18, 2016

To be fair, relatively few humans would parse that in any meaningful way either.

trebor · on March 18, 2016

My only request would be to make the highlighted parts of speech configurable. For me it's too active with 11 different colors/things demanding attention.

But it would be awesome to be able to hover my mouse over a "palette" and see the matching parts of speech highlight. Or even just turn the parts on/off to see what I want to see.

peterburkimsher · on March 18, 2016

That's beautiful! I wish that someone could make that for Chinese. I'm trying to learn Chinese now, and I think it would be easier in colour.

Also, is there a way to run this script on a large collection of text files? Specifically, the Bible. I'd be curious to split out all the names/places/etc.

jerf · on March 18, 2016

One of the things that machine learning teaches us is that the cycle between running a test on the environment and getting feedback is critical. One of the major problems with the school system is how often the feedback loop is "take test Tuesday, teacher grades it Thursday night, student gets it back on Monday". This is useful for assessment but useless for learning.

Math is an example of where this should be used so very, very much more. Why is my first grader turning in tests covering addition when a computer can literally spit back whether it is right or wrong in a fraction of a second? The learning rate is immensely better at that speed.

And this is fantastic example of how that sort of thing could be applied to parts-of-speech learning, in real sentences rather than just an artificial "show word -> demand part of speech" test style. It's the part-of-speech equivalent of the red wavy line under misspelled words, which is also a far better way to teach spelling than any number of "spelling test Tuesday returned to student Thursday" ever could be.

stared · on March 18, 2016

It's why I like computer games.

It's why I learn programming languages way faster than real languages.

wodenokoto · on March 18, 2016

There are actually a lot of people who color code Chinese according to tones.

gbraad · on March 18, 2016

This is possible and is done using pinyin. The tone of the character can be highlighted in a different colour, such as CCEDICT and/or Pleco does. I have worked on this before, but the sandhi and polyphonic characters kinda complicates it a little.

cabalamat · on March 18, 2016

I imagine people learning Latin might find it useful as well. And have a mouseover for conjugation / declension.

milesokeefe · on March 18, 2016

I made a tool that does a form of that:

http://latin.milesokeefe.com/?s=disce+quasi+semper+victurus+...

cgm616 · on March 18, 2016

I can really see this being useful for any language.

no1youknowz · on March 18, 2016

Are there plans to open source this or does anyone know of OS projects that does this? Thanks!

x1798DE · on March 18, 2016

It's very odd - I can barely work without syntax highlighting when programming, but this does absolutely nothing for me in English. I wonder why this is.

I'm tempted to say that it's because generally when reading in English, I'm not "jumping around" as much - I'm just reading left to right and processing everything into a thought. It could easily be, though, that it's just that English is my native language and I already do a better job immediately understanding the syntax of sentences than an automated tool would anyway.

laurieg · on March 18, 2016

Very interesting. I've toyed around this idea myself and come to the conclusion that coloring on part of the speech level is not that useful for the educational aspects I'm interested in. If you could take this to the next level and visually display the grammatical sentence structure then I think this would be a really useful tool for people studying English as a second language.

wodenokoto · on March 18, 2016

There are plenty of tools that can draw a variety of tree structures over English text.

https://displacy.spacy.io/displacy/index.html?full=Click+the....

sethjgore · on March 18, 2016

Green-bridge.org

melloclello · on March 18, 2016

I think this is kind of neat for structured English as opposed to prose. Prose tends to look like vegetable soup, but I pasted a very regularly phrased and structured todo list in here and it looks great. Things like the standard Agile "as a user, I want X so I can Y" formalised story descriptions just make sense with this kind of syntax highlighting.

mrspeaker · on March 18, 2016

I made a proof-of-concept of this a while ago (http://www.mrspeaker.net/2012/03/24/syntax-highlighting-for-...) - my "writer-friends" weren't into it (what do they know!)... but then there was a post on HN a few years later where someone had patented the idea! I can't dig up - does anyone else remember this?

EDIT: Ah, this was it: https://news.ycombinator.com/item?id=6966528. An update on the blog says the company it was about tweeted "We will drop our patents pending. Thank you @dhh for clearing our minds."

ipsum2 · on March 21, 2016

I'm not sure why you were downvoted, your demo is really cool.

ipsum2 · on March 18, 2016

Server load too high! I'm going to restart the server. Apologies for the downtime. In the meantime, here's a screenshot: http://i.imgur.com/493smt7.png

sethjgore · on March 18, 2016

Hello! Are you open sourcing this? If so- can I use your library?

JulianMorrison · on March 18, 2016

I actually really like this, and I wonder, if it were made an extension and made to do more subtle color shifts of the underlying font, whether it might be a very useful tool for people with some types of learning difficulties, and aid effective reading in general.

chronial · on March 18, 2016

I think a highlighting that helps parse the sentence structure would me more useful.

a3_nm · on March 18, 2016

On Iceweasel (Firefox) 44.0.2 I only see "undefined" in the text area.

ipsum2 · on March 18, 2016

Thanks for the report. It seems like Firefox 45 has implemented the innerText function (https://developer.mozilla.org/en-US/docs/Web/API/Node/innerT...). I'll see if I can polyfill it.

dvh · on March 18, 2016

use textContent instead of innerText, works both on chrome and ff

ipsum2 · on March 18, 2016

textContent doesn't preserve newlines correctly. I found a polyfill, so Firefox <45 should work now.

a3_nm · on March 18, 2016

I confirm it does. Thanks!

skykooler · on March 18, 2016

It does seem to have issues wit garden path sentences - in "The old man the boat", it colors "man" as a noun rather than a verb.

thylacine222 · on March 18, 2016

Yup.

  The old man the boat.

  The complex houses married and single soldiers and their families.

  The man whistling tunes pianos.

  Time flies like an arrow; fruit flies like a banana.

  The cotton clothing is made of grows in Mississippi.

  I convinced her children are noisy.

Gets all of 'em wrong.

sciencerobot · on March 18, 2016

You should post some examples of highlighted literature to see if different writing styles look different because of the syntax highlighting.

mchahn · on March 18, 2016

This seems like it would be useful for education. I don't see how it helps reading, which is the purpose of highlighting source code.

mkrecny · on March 18, 2016

Source?

sarreph · on March 18, 2016

It's really great to see advances in English syntax highlighting (an oft-neglected 'language' when it comes to code editors).

However, I get a feeling this is a sort of hype-train at the moment, because I swear I've seen three English syntax highlighters in the past week alone (being posted to HN).

Could this be the latest, albeit welcomed and intelligent, project exercise obsession à la Flappy Bird / 2048? :)

jychang · on March 18, 2016

Well, OP said posted in a comment that it was inspired by the top comment from the post from 2 days ago...

im3w1l · on March 18, 2016

Very cool! Could you change the colors so as to make adverb vs noun easier to tell apart?

cabalamat · on March 18, 2016

I tried it with:

    The red car
    This red car
    Your red car

And it marked "Your" like "red" and not like "The" or "This". Surely "your" is a determiner and not an adjective? I mean, you can't say *"The your car".

amelius · on March 18, 2016

A case against syntax highlighting: [1]

[1] http://www.linusakesson.net/programming/syntaxhighlighting/