> It would be interesting to see the major parts of speech (nouns, verbs, adjectives) colored. Instead this is a coloring of fairly random words. A bunch of short words are grey, but they don't belong to any particular part of speech. They include some articles, prepositions, conjunctions and a few verbs...
Actually put all the instructions outside of the example box. First pass I assumed everything in the box was just sample so I dived straight into typing and then didn't get any colours. Then I had to reload to find out I had to tab out.
It's strange to me that we call these things "syntax highlighters", given that they don't really seem to colour things differently according to syntax,† but rather by the category of token. Perhaps "lexical highlighter" would be a better term for these?
Because of this, unfortunately "English syntax highlighter" has two possible interpretations to me: [English [syntax highlighter]], where syntax highlighter is a noun having the idiosyncratic meaning it has in a programming context, i.e. a tool that highlights different tokens according to their category, and [[English syntax] highlighter], a tool that would presumably highlight parts of a sentence according to their syntactic function. The latter would be more exciting, but also more difficult.
† There are some exceptions to this, but it does seem to be true for most of the syntax highlighters I've used.
Your observation seems plausible at first. However when I gave it a second thought I have concluded otherwise.
In the natural language context, a part of speech is actually determined syntactically. "Type" of a token in formal languages by comparison is also syntactical. Any highlighter must recognize the syntactic structure of its target language and work like a parser, that is, not like a tokenizer. You have to "push" something to the stack, if you know what I mean.
To give a bad, but intuitive example, if a highlighter assigned "gray" to every "=" character it comes across, it would not be able to highlight the same characters inside a quotation mark.
> Any highlighter must recognize the syntactic structure of its target language and work like a parser, that is, not like a tokenizer. You have to "push" something to the stack, if you know what I mean.
How often does syntax matter in "syntax highlighting"? Most highlighters I've used don't even try to recognise syntactic structure.
> To give a bad, but intuitive example, if a highlighter assigned "gray" to every "=" character it comes across, it would not be able to highlight the same characters inside a quotation mark.
The problem with this example is that strings are a single token.
(This is being colorized by invoking Vim out of Apache on-the-fly.)
Here we can see that a Lisp dialect is embedded in a special-purpose language. The functions and operators of the Lisp dialect are colored green, but those in the special purpose language are purple. This is worked out even in situations when the are the same symbol. E.g. both languages have a "do" operator.
Vim doesn't handle nesting with a formal context-free grammar. Rather, it allows named regions to be defined, and highlighting matches can be declared as occurring only in specific regions. Regions can also be in regions. You can get very accurate coloring; but it can also be slow at times.
I totally expected it to be "just" a part-of-speech-tagger.
It does at least these correctly:
It takes many tries to succeed.
It takes only one try.
It takes a try.
I want to try and succeed.
I'm gonna try it anyway.
Man the boats!
Man and the boats!
What you'd want it to do? To show the parse tree? If I understand correctly, even the non-computational linguistics have had (and still have) trouble to come up with a system that parses all the natural language correctly – or even a system that could represent all the parses of natural sentences correctly.
Btw. to elaborate: there seems to be some troublesome sentences where a parse tree isn't just enough – you've got to have a graph. Yet, most of the linguistic phenomena can be parsed into tree structures. That's troublesome. It's like... there's a certain class of a parser in the brain, but then some curious sentences show the ability of overcome the "limitations" of that parser, and still they are considered grammatical.
Never hide hides.
The sound sounds sound.
Flies fly.
It's the right right, right?
The key key is this one.
It's an objective objective.
In May, May may make out with me.
Compared with the last one, this is a fine fine.
The man we saw saw a saw.
The first second was alright, but the second second was tough.
Well, Buffalo, the proper noun, and buffalo, the noun, are both going to be highlighted as nouns. That leaves buffalo, the verb, which is slang and so is unlikely to be recognised as not being the noun, ignoring the difficulty of parsing the sentence anyway.
My only request would be to make the highlighted parts of speech configurable. For me it's too active with 11 different colors/things demanding attention.
But it would be awesome to be able to hover my mouse over a "palette" and see the matching parts of speech highlight. Or even just turn the parts on/off to see what I want to see.
That's beautiful! I wish that someone could make that for Chinese.
I'm trying to learn Chinese now, and I think it would be easier in colour.
Also, is there a way to run this script on a large collection of text files? Specifically, the Bible. I'd be curious to split out all the names/places/etc.
One of the things that machine learning teaches us is that the cycle between running a test on the environment and getting feedback is critical. One of the major problems with the school system is how often the feedback loop is "take test Tuesday, teacher grades it Thursday night, student gets it back on Monday". This is useful for assessment but useless for learning.
Math is an example of where this should be used so very, very much more. Why is my first grader turning in tests covering addition when a computer can literally spit back whether it is right or wrong in a fraction of a second? The learning rate is immensely better at that speed.
And this is fantastic example of how that sort of thing could be applied to parts-of-speech learning, in real sentences rather than just an artificial "show word -> demand part of speech" test style. It's the part-of-speech equivalent of the red wavy line under misspelled words, which is also a far better way to teach spelling than any number of "spelling test Tuesday returned to student Thursday" ever could be.
This is possible and is done using pinyin. The tone of the character can be highlighted in a different colour, such as CCEDICT and/or Pleco does. I have worked on this before, but the sandhi and polyphonic characters kinda complicates it a little.
It's very odd - I can barely work without syntax highlighting when programming, but this does absolutely nothing for me in English. I wonder why this is.
I'm tempted to say that it's because generally when reading in English, I'm not "jumping around" as much - I'm just reading left to right and processing everything into a thought. It could easily be, though, that it's just that English is my native language and I already do a better job immediately understanding the syntax of sentences than an automated tool would anyway.
Very interesting. I've toyed around this idea myself and come to the conclusion that coloring on part of the speech level is not that useful for the educational aspects I'm interested in. If you could take this to the next level and visually display the grammatical sentence structure then I think this would be a really useful tool for people studying English as a second language.
I think this is kind of neat for structured English as opposed to prose. Prose tends to look like vegetable soup, but I pasted a very regularly phrased and structured todo list in here and it looks great. Things like the standard Agile "as a user, I want X so I can Y" formalised story descriptions just make sense with this kind of syntax highlighting.
I made a proof-of-concept of this a while ago (http://www.mrspeaker.net/2012/03/24/syntax-highlighting-for-...) - my "writer-friends" weren't into it (what do they know!)... but then there was a post on HN a few years later where someone had patented the idea! I can't dig up - does anyone else remember this?
EDIT: Ah, this was it: https://news.ycombinator.com/item?id=6966528. An update on the blog says the company it was about tweeted "We will drop our patents pending. Thank you @dhh for clearing our minds."
Server load too high! I'm going to restart the server. Apologies for the downtime. In the meantime, here's a screenshot: http://i.imgur.com/493smt7.png
I actually really like this, and I wonder, if it were made an extension and made to do more subtle color shifts of the underlying font, whether it might be a very useful tool for people with some types of learning difficulties, and aid effective reading in general.
The old man the boat.
The complex houses married and single soldiers and their families.
The man whistling tunes pianos.
Time flies like an arrow; fruit flies like a banana.
The cotton clothing is made of grows in Mississippi.
I convinced her children are noisy.
It's really great to see advances in English syntax highlighting (an oft-neglected 'language' when it comes to code editors).
However, I get a feeling this is a sort of hype-train at the moment, because I swear I've seen three English syntax highlighters in the past week alone (being posted to HN).
Could this be the latest, albeit welcomed and intelligent, project exercise obsession à la Flappy Bird / 2048? :)
And it marked "Your" like "red" and not like "The" or "This". Surely "your" is a determiner and not an adjective? I mean, you can't say *"The your car".
> It would be interesting to see the major parts of speech (nouns, verbs, adjectives) colored. Instead this is a coloring of fairly random words. A bunch of short words are grey, but they don't belong to any particular part of speech. They include some articles, prepositions, conjunctions and a few verbs...