Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: English syntax highlighter using part-of-speech tagging (edward.io)
147 points by ipsum2 on March 18, 2016 | hide | past | favorite | 57 comments

Hi HN! This was inspired by the top comment from this thread: https://news.ycombinator.com/item?id=11294026

> It would be interesting to see the major parts of speech (nouns, verbs, adjectives) colored. Instead this is a coloring of fairly random words. A bunch of short words are grey, but they don't belong to any particular part of speech. They include some articles, prepositions, conjunctions and a few verbs...

Could you please add a legend to indicate what the colors mean?

Sorry for the slow response, I've implemented this already.

It would be nice if you put the color of the words in a separate key, rather than in the example box.

Actually put all the instructions outside of the example box. First pass I assumed everything in the box was just sample so I dived straight into typing and then didn't get any colours. Then I had to reload to find out I had to tab out.

It's strange to me that we call these things "syntax highlighters", given that they don't really seem to colour things differently according to syntax,† but rather by the category of token. Perhaps "lexical highlighter" would be a better term for these?

Because of this, unfortunately "English syntax highlighter" has two possible interpretations to me: [English [syntax highlighter]], where syntax highlighter is a noun having the idiosyncratic meaning it has in a programming context, i.e. a tool that highlights different tokens according to their category, and [[English syntax] highlighter], a tool that would presumably highlight parts of a sentence according to their syntactic function. The latter would be more exciting, but also more difficult.

† There are some exceptions to this, but it does seem to be true for most of the syntax highlighters I've used.

Your observation seems plausible at first. However when I gave it a second thought I have concluded otherwise.

In the natural language context, a part of speech is actually determined syntactically. "Type" of a token in formal languages by comparison is also syntactical. Any highlighter must recognize the syntactic structure of its target language and work like a parser, that is, not like a tokenizer. You have to "push" something to the stack, if you know what I mean.

To give a bad, but intuitive example, if a highlighter assigned "gray" to every "=" character it comes across, it would not be able to highlight the same characters inside a quotation mark.

> Any highlighter must recognize the syntactic structure of its target language and work like a parser, that is, not like a tokenizer. You have to "push" something to the stack, if you know what I mean.

How often does syntax matter in "syntax highlighting"? Most highlighters I've used don't even try to recognise syntactic structure.

> To give a bad, but intuitive example, if a highlighter assigned "gray" to every "=" character it comes across, it would not be able to highlight the same characters inside a quotation mark.

The problem with this example is that strings are a single token.

"Lexical coloring" or "Lexical color-coding" would be even more appropriate.

However, you are slightly wrong, depending on the coloring engine and language.

The syntax coloring algorithm in Vim handles nested syntax. (To what extent is up to the developer of the syntax file.)

You can see some of that at work here: http://www.kylheku.com/cgit/txr/tree/tests/011/txr-case.txr

(This is being colorized by invoking Vim out of Apache on-the-fly.)

Here we can see that a Lisp dialect is embedded in a special-purpose language. The functions and operators of the Lisp dialect are colored green, but those in the special purpose language are purple. This is worked out even in situations when the are the same symbol. E.g. both languages have a "do" operator.

Vim doesn't handle nesting with a formal context-free grammar. Rather, it allows named regions to be defined, and highlighting matches can be declared as occurring only in specific regions. Regions can also be in regions. You can get very accurate coloring; but it can also be slow at times.

I totally expected it to be "just" a part-of-speech-tagger.

It does at least these correctly:

    It takes many tries to succeed.
    It takes only one try.
    It takes a try.
    I want to try and succeed.
    I'm gonna try it anyway.
    Man the boats!
    Man and the boats!
What you'd want it to do? To show the parse tree? If I understand correctly, even the non-computational linguistics have had (and still have) trouble to come up with a system that parses all the natural language correctly – or even a system that could represent all the parses of natural sentences correctly.

Btw. to elaborate: there seems to be some troublesome sentences where a parse tree isn't just enough – you've got to have a graph. Yet, most of the linguistic phenomena can be parsed into tree structures. That's troublesome. It's like... there's a certain class of a parser in the brain, but then some curious sentences show the ability of overcome the "limitations" of that parser, and still they are considered grammatical.

"Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo." is just orange :/

It does ok at other phrases, though.

    Never hide hides.
    The sound sounds sound.
    Flies fly.
    It's the right right, right?
    The key key is this one.
    It's an objective objective.
    In May, May may make out with me.
    Compared with the last one, this is a fine fine.
    The man we saw saw a saw.
    The first second was alright, but the second second was tough.

My standard test is "A machine learning to do machine learning." And it works here.

Sentences like these lead you down the garden path sentence.

This was just "verb vs a part of a compound noun". For a garden path sentence I would go with "A machine learning machine learning.".

But that can be validly parsed two ways: "A machine-learning machine, which is learning" or "A machine which is learning machine-learning".

But not the imperative:

  Google "Google beats go champion".
Which it interprets as:

  Noun Noun Noun Verb Noun

This is impressive! Wonder where are the limits of the algorithm...

Well, Buffalo, the proper noun, and buffalo, the noun, are both going to be highlighted as nouns. That leaves buffalo, the verb, which is slang and so is unlikely to be recognised as not being the noun, ignoring the difficulty of parsing the sentence anyway.

To be fair, relatively few humans would parse that in any meaningful way either.

My only request would be to make the highlighted parts of speech configurable. For me it's too active with 11 different colors/things demanding attention.

But it would be awesome to be able to hover my mouse over a "palette" and see the matching parts of speech highlight. Or even just turn the parts on/off to see what I want to see.

That's beautiful! I wish that someone could make that for Chinese. I'm trying to learn Chinese now, and I think it would be easier in colour.

Also, is there a way to run this script on a large collection of text files? Specifically, the Bible. I'd be curious to split out all the names/places/etc.

One of the things that machine learning teaches us is that the cycle between running a test on the environment and getting feedback is critical. One of the major problems with the school system is how often the feedback loop is "take test Tuesday, teacher grades it Thursday night, student gets it back on Monday". This is useful for assessment but useless for learning.

Math is an example of where this should be used so very, very much more. Why is my first grader turning in tests covering addition when a computer can literally spit back whether it is right or wrong in a fraction of a second? The learning rate is immensely better at that speed.

And this is fantastic example of how that sort of thing could be applied to parts-of-speech learning, in real sentences rather than just an artificial "show word -> demand part of speech" test style. It's the part-of-speech equivalent of the red wavy line under misspelled words, which is also a far better way to teach spelling than any number of "spelling test Tuesday returned to student Thursday" ever could be.

It's why I like computer games.

It's why I learn programming languages way faster than real languages.

There are actually a lot of people who color code Chinese according to tones.

This is possible and is done using pinyin. The tone of the character can be highlighted in a different colour, such as CCEDICT and/or Pleco does. I have worked on this before, but the sandhi and polyphonic characters kinda complicates it a little.

I imagine people learning Latin might find it useful as well. And have a mouseover for conjugation / declension.

I can really see this being useful for any language.

Are there plans to open source this or does anyone know of OS projects that does this? Thanks!

It's very odd - I can barely work without syntax highlighting when programming, but this does absolutely nothing for me in English. I wonder why this is.

I'm tempted to say that it's because generally when reading in English, I'm not "jumping around" as much - I'm just reading left to right and processing everything into a thought. It could easily be, though, that it's just that English is my native language and I already do a better job immediately understanding the syntax of sentences than an automated tool would anyway.

Very interesting. I've toyed around this idea myself and come to the conclusion that coloring on part of the speech level is not that useful for the educational aspects I'm interested in. If you could take this to the next level and visually display the grammatical sentence structure then I think this would be a really useful tool for people studying English as a second language.

There are plenty of tools that can draw a variety of tree structures over English text.



I think this is kind of neat for structured English as opposed to prose. Prose tends to look like vegetable soup, but I pasted a very regularly phrased and structured todo list in here and it looks great. Things like the standard Agile "as a user, I want X so I can Y" formalised story descriptions just make sense with this kind of syntax highlighting.

I made a proof-of-concept of this a while ago (http://www.mrspeaker.net/2012/03/24/syntax-highlighting-for-...) - my "writer-friends" weren't into it (what do they know!)... but then there was a post on HN a few years later where someone had patented the idea! I can't dig up - does anyone else remember this?

EDIT: Ah, this was it: https://news.ycombinator.com/item?id=6966528. An update on the blog says the company it was about tweeted "We will drop our patents pending. Thank you @dhh for clearing our minds."

I'm not sure why you were downvoted, your demo is really cool.

Server load too high! I'm going to restart the server. Apologies for the downtime. In the meantime, here's a screenshot: http://i.imgur.com/493smt7.png

Hello! Are you open sourcing this? If so- can I use your library?

I actually really like this, and I wonder, if it were made an extension and made to do more subtle color shifts of the underlying font, whether it might be a very useful tool for people with some types of learning difficulties, and aid effective reading in general.

I think a highlighting that helps parse the sentence structure would me more useful.

On Iceweasel (Firefox) 44.0.2 I only see "undefined" in the text area.

Thanks for the report. It seems like Firefox 45 has implemented the innerText function (https://developer.mozilla.org/en-US/docs/Web/API/Node/innerT...). I'll see if I can polyfill it.

use textContent instead of innerText, works both on chrome and ff

textContent doesn't preserve newlines correctly. I found a polyfill, so Firefox <45 should work now.

I confirm it does. Thanks!

It does seem to have issues wit garden path sentences - in "The old man the boat", it colors "man" as a noun rather than a verb.


  The old man the boat.

  The complex houses married and single soldiers and their families.

  The man whistling tunes pianos.

  Time flies like an arrow; fruit flies like a banana.

  The cotton clothing is made of grows in Mississippi.

  I convinced her children are noisy.
Gets all of 'em wrong.

You should post some examples of highlighted literature to see if different writing styles look different because of the syntax highlighting.

This seems like it would be useful for education. I don't see how it helps reading, which is the purpose of highlighting source code.


It's really great to see advances in English syntax highlighting (an oft-neglected 'language' when it comes to code editors).

However, I get a feeling this is a sort of hype-train at the moment, because I swear I've seen three English syntax highlighters in the past week alone (being posted to HN).

Could this be the latest, albeit welcomed and intelligent, project exercise obsession à la Flappy Bird / 2048? :)

Well, OP said posted in a comment that it was inspired by the top comment from the post from 2 days ago...

Very cool! Could you change the colors so as to make adverb vs noun easier to tell apart?

I tried it with:

    The red car
    This red car
    Your red car
And it marked "Your" like "red" and not like "The" or "This". Surely "your" is a determiner and not an adjective? I mean, you can't say *"The your car".

A case against syntax highlighting: [1]

[1] http://www.linusakesson.net/programming/syntaxhighlighting/

Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact