Hacker News new | past | comments | ask | show | jobs | submit login
Proselint (proselint.com)
398 points by g1n016399 on Mar 7, 2016 | hide | past | web | favorite | 137 comments

This sounds interesting. As a bit of constructive criticism, please put some examples high up.

You tell me it does cool things. Great, show me. I've looked about on the various pages and can see only one example and I don't understand it:

    text.md:0:10: wallace.uncomparables Comparison of an uncomparable: 'unique' can not be compared.
What's the context of this, what's the error it would have caught in my writing?

The tool is in a perfect place to show this off as it's text.

Good idea. If you run `proselint` without specifying a document, it'll run on the demo text, which you can also access here: https://gist.github.com/suchow/c7856f21128aee89ad55. Also, there's a live demo available at: http://proselint.com/write. It's been tested only on the latest version of Chrome, and I doubt it will handle the load here, but give it a try.

This would catch something like "even more unique". In fact, looking at the code (https://github.com/amperser/proselint/blob/master/proselint/...) it would even catch something like "extremely unique", which I've been guilty of using.

But yes, there should be examples on the front page.

So this program is like having some insufferable pedants arguing over your language? Great!

Does it accept 'nearly unique' ?

It accepts "kind of unique" and "hardly unique" but also "greatly unique" which in my mind is very close to "very unique."

This tool is a blunt instrument. Writing is an art.

What exactly is the argument or implication in the last two sentences? There are many works of art that use "blunt" instruments. Say the Venus of Willendorf or a Serra sculpture, though that depends on what you mean by blunt. Even in literature blunt tools are use such as a pencil.

So, I think english is an "analytical language", although I wouldn't know what it means, it inspires me to assume, that by analyszing the sentence you can make out that extremely could refer to something else than the uniqueness, isn't it? EG the phrase could mean, something was unique because of one of any number of extremes, unique by extremity. Sure, that should be uniquely extreme, but what I said shows something else. If there are different qualities that could be unique, wouldn't it make sense to quantify that? Of course, if a logic is so weak it cannot have the peano axioms, you cannot advance beyond uniqueness. (What about missing all-quantor in propositional logic and uniqueness in predicate logic? I'm just stabbing in the dark, really.)


One cannot have gradations of uniqueness


Imagine an object that is unique in exactly one respect, and another in two. Obviously, the other has more uniqueness. Now, if a the first has precisely only the one characteristic of having not really any characteristic at all, then that's a totally different degree of uniqueness. So either is more unique in a specific respect. But Arguably, the nothingness is most unique, if not the only really unique thing. So if we can ignore that, because no one in his right mind would talk about nothing, we can most of always readily conclude, that the first type of a collection of unique'ish things is concerned.

Is this regression ad absurdum or argumentum ad silencium?

I like your thinking.

The link I posted really concerns the insufferability of someone who corrects technicalities of language rather than a discussion on whether uniqueness is a countable property.

Some people feel you should never ever say things like "more unique", "most unique" etc

Which I think is equally as misguided as trying to force "data" to be plural, and that "less than 3" is wrong

> Some people feel you should never ever say things like "more unique", "most unique" etc

I am among them. Here's why:

(1) There are already other words that express related concepts that are subject to gradation: "rare", "special", "unusual", and "extraordinary" come to mind.

(2) The original meaning of "unique", namely "one of a kind", is an important concept. If we let the word's meaning get lost, we will not be able to express that meaning as easily.

In a mathematical context, something is either "unique" or it is not. There is no in-between state.

But you can easily define it to mean something else. And you can even make "uniqueness" comparable.

I think it's obvious that what people mean is that "more unique" = "unique in more dimensions" or "the degree to which this differs from the norm is greater".

E.g. (2, 4, 7) (2, 4, 8) (2, 8, 4) (2, 4, 7) (1, 4, 8) (0, 9, 3) (987, 4, 7)

When asked what are the "most unique" sets in that list, you'd probably be acting deliberately obtuse if you chose anything but one of the last two.

I guess it's understandable but why not choose an appropriate word like "different" and keep "unique" a very strong word?

Well, you don't really get a choice. I'm just describing how the word is used, not suggesting an alternate meaning for it.

Math has long tradition of generalizing definitions to make sense for larger domains, when it's useful. See fractional powers, complex numbers, quaternions, and many more.

I can imagine defining uniqueness as a function returning real number from <0; 1> instead of a boolean value. For example:

    let U(p, x, X) be the uniqueness of property(function) p(x) for element x of set X 

    U(p, x, X) = 1.0 - (size of X')/(size of X\{x}), where
    X' = set of all elements x' of X such that p(x')=p(x) and x' != x
Property p of element x of set X is strictly unique when, and only when U(p, x, X) = 1

When it's useful? For example for speaking about minimizing collisions of hashes for given data.

Another way of thinking about it: uniqueness is 1-probability of uniformly randomly finding element with same value of p in X as x after removing x from X.

In a mathematical world you operate with abstract objects. In the real word - you need to abstract things; before that everything is unique; after that - well, depends on your abstraction. So unless you talk about mathematics, things can be more or less unique.

This is a excellent and subtle comment. You seem like someone with a tolerance for philosophical nit-picking. Please forgive me if I'm mistaken.

Instead of saying everything is unique we could simply say that there is nothing. A thing is itself an abstraction. The concrete world is without inherently distinct things. We must abstract things for "unique" to describe something at all. As you implied, this process is arbitrary. Every way in which you could abstract things implies a distinct notion of "uniqueness". To simply select one "uniqueness" (like mathematics) is arbitrary. But to consider every possible "uniqueness" equally is also arbitrary. Without prioritizing forms of "uniqueness" we can only construct a partially ordered set. So when you void a fixation on mathematics, things can be more, less or "incomparably" unique.

I suspect most pairs of things are incomparably unique. Further, I suspect most binary qualities are predominantly incomparable. I don't know that you should never say things like "more unique" but it might be fair to issue a warning in a prose linter. Any binary quality used as a continuum requires an arbitrary combination of it's distinct forms. If this isn't specified then it only has meaning for those who already know what it is.

Some philosophers, thinking especially of Graham Harman, have started reacting against the now sort of commonplace idea that "there are no things (or objects) in reality."

From a common sense perspective, it's obvious that there are things. Sure, you can point out the flux and decay of all entities, but still, this table here is a coherent thing even if it's made from parts in a temporary arrangement.

In some sense, philosophy itself is destroyed when you go down the path of denying objects, since philosophy crucially deals with concepts, and concepts are "thought objects."

Harman describes two modes of denying objects: undermining and overmining. Undermining is the tendency to say "really, this object is just a composition of these other particles," while overmining is the tendency to say "this object is just a modulation in a grand monistic entity."

Instead of that, he recommends an ontology of objects that's pretty interesting and fun to read about. He would, I think, agree that objects are unique in that they are (in programmer jargon) "pointer equal" to only themselves... and each real object, for that reason, has an infinity of potential that's never exhausted by any "arbitrary" perception of it... yet still, we perceive other objects not directly, but through aesthetic caricatures, and on that level you might have different degrees of uniqueness.

Thank you very much for this comment. I'm an armchair philosopher and I hadn't heard of Graham Harman. His notion of objects is beautiful. In one motion nihilism both compels me to accept my sins and deprives me of any path to salvation. Harman's objects capture the essential impetus of nihilism without ultimately voiding conception. In fact, they even capture the paradox of nihilism. The denial of objects necessarily implies an objective system: dualism. First there is an object contriving infinitely varied "caricature objects". Then there must be another object that is (infinitely) not any of those. This expression of our relationship to The Great Unknowable Reality is much saner. It doesn't overmine. It doesn't undermine. It doesn't leave me oscillating between affirmation and denial. Also, most importantly, I'm given a clue to further knowledge. I am that contriving object. This is just a caricature of reality. My participation in it's consideration is entirely arbitrary. I'm haunted by the concern that knowledge exists which cannot be captured by this freedom. But for now these objects certainly get us further than nothing. ;)

If world is a set - everything is unique by definition.

You need to define some relation on that set to get classes of abstraction. And that's exactly what abstraction means :)

Even math has an infinity of infinities - Cantor findings etc.

It's a good thing then that most people don't use it in its technical sense

An easy fix for this not problem is to use distinct instead of unique

Curiously (but perhaps only to me) all things are comparable - able to be compared.

If you mean similar, may I commend the word similar? ;-)

It's a linter, it's going to have some kind of "false positives." Maybe you could put an annotation that tells the linter you're sure that you mean it.

Semi-off-topic, but the notion of "more unique" reminds me of Sapolsky's TED talk about humans as the "uniquiest" animal.


I'm a writer and editor, and I dislike the idea of this tool quite a bit.

1. Writing isn't coding. In coding, you can do various types of "cargo cult programming" and "copypasta" and what-have-you -- in other words, as long as the code runs you don't necessarily have to know why or how a programming idiom or convention works, or how/why expressing it one way in code is better than expressing it another way in code. This definitionally untrue with writing. If you don't know the why/how of something, then it's better for you to botch it and let the reader attempt to parse it so at least they know what they're dealing with and how to interpret it ("oh, this guy's a non-native speaker, so I'll adjust my reception accordingly" or "ah, this person is kind of clueless about the whole sexist language thing, which is good info for me.").

2. 90% of writing style advice falls into one of two categories: a) hotly debated, and b) totally wrong. Most of it is in the latter category, and this includes Strunk & White (just use google for numerous takedowns of that text). I looked through the PR queue and saw that it consists of eager coders finding style advice from various sources and trying to work that into the tool. That is terrible, terrible, terrible... This will guarantee that the tool will represent a collection of awful writing advice gleaned from dubious sources and wielded with unforgiving ignorance.

This tool may be a terrible idea, but the idea of automated prose linting is not terrible. Most beginner to intermediate writers have tics, and as an editor I often have a couple of writer-specific find/replace things I do when I get a new piece from a particular writer (e.g. "this person uses 'however' when she means 'but', and this person overuses these four business jargon terms, etc.). If editors were able to easily compose and execute writer-specific linters from within something like Wordpress, that would probably be pretty great.

But this particular command line tool is destined to be either totally unused or massively abused.

I'm sorry, I hate to be mean... or, actually, there is a small part of me that enjoys playing Mr. Party Pooper when I see a mob of enthusiastic programmers trying to tie down some great cultural Gulliver with a thousand tiny little automated, black-and-white rules.

Thanks for the feedback. These are issues we've thought about, and we came to different conclusions:

re 2, you'll see at http://proselint.com/approach/ that one of the guiding principles of Proselint is that we defer to experts. In practice, that's meant almost all the advice comes from Bryan Garner's usage guide, Garner's Modern American Usage. He is a careful compiler of advice and you'll find that he is almost never "totally wrong", and when his advice is debated, he knows it, notes it, and provides a thoughtful discussion.

re 1, we think of Proselint as eventually being useful as a training tool, a way to learn the conventions. Note that natural languages are large, with so many low-frequency terms that nobody can learn the whole language. Why err if an automated tool can help? Consider for example demonyms, what you call people from a certain place. How many people know, for example, that people from Manchester are Mancunians, not Manchesterians? Rather than call someone by the wrong name, with Proselint the voice of an expert gently corrects you, and you learn a cool new word.

We aren't a mob of programmers, we are three people who love language, respect it, and think we're 2% of the way to making a great tool, one that The New Yorker could run over its stories to flag issues that its own editors would flag anyways. (In fact, we've done this, running Proselint over a corpus of highly vetted text, and have found numerous issues.)

Calling someone from Manchester a "Manchesterian" instead of "Mancunian" is not wrong, or even necessarily bad. Rather, it communicates something to the reader. Depending on the context, it could mean this person doesn't know that the correct term is "Mancunian", and did not look it up or even know that it should be looked up, all of which gives me useful info and context about the writer and their education level and the amount of effort they put into the piece and the amount of editing it underwent and so on. At the very least I can surmise that the writer is not a Mancunian. Or, it could mean that the writer is attempting to be clever.

Widespread use of proselint to correct this type of thing wouldn't improve writing. Rather, it would just add another interpretive option to the above range of scenarios, i.e. "ah, I can tell that this writer did or did not run that proselint tool before submission, because their text is or is not littered with boilerplate proselintisms."

The way to improve genuinely bad writing is not with rules and tools -- it's with lots of reading, a little mentorship, and lots and lots and lots of practice.

> Calling someone from Manchester a "Manchesterian" instead of "Mancunian" is not wrong, or even necessarily bad. Rather, it communicates something to the reader. Depending on the context, it could mean this person doesn't know that the correct term is "Mancunian", and did not look it up or even know that it should be looked up, all of which gives me useful info and context about the writer and their education level and the amount of effort they put into the piece and the amount of editing it underwent and so on. At the very least I can surmise that the writer is not a Mancunian. Or, it could mean that the writer is attempting to be clever.

If the only goal of writing were to allow accurate assessment of the writer, then I would agree. But there are other reasons for writing — informing, persuading, clarifying, &c. — where writing clear, consistent, and idiomatic prose can help. Yours is a condemnation at all attempts to improve writing beyond the first-draft capabilities of the author.

> The way to improve genuinely bad writing is not with rules and tools -- it's with lots of reading, a little mentorship, and lots and lots and lots of practice.

Agreed, Proselint is not the right tool to improve genuinely bad writing. Reading great authors and sweating through drafts is what we'd recommend to get better at the craft, too.

> all of which gives me useful info and context about the writer and their education level and the amount of effort they put into the piece and the amount of editing it underwent and so on.

From a reader centrist point of view, I can understand lamenting the loss of this information channel. From the author's stance, I can imagine wanting to tighten up alternate channels of information and present a clearer message. The author always has this ability, through natural circumstance, effort or research, so this tool would do nothing but make it easier. As a reader, it may change the assessment to whether they ran a proselint-like tool or not, but in the end those are just assumptions. The writer could be making specific choices to disregard the linting tool on purpose. In the end, reading is still an interpretive experience, this just allows authors more options.

> The way to improve genuinely bad writing is not with rules and tools -- it's with lots of reading, a little mentorship, and lots and lots and lots of practice.

Generally good advice for any thing, but I think it's worth noting that different people learn in different ways, and providing more methods for learning is generally an improvement, and opens the field to more people. Tools that look to circumvent historical methods for achieving skill often face an uphill battle from those that used those historical methods. It's easy to see why, as it looks like it has devalued much of the hard work they put into their skills. This may be true to an extent, but the gains often far outweigh this, as making a skill accessible to more people has wide ranging benefits for society in general.

In more concrete terms, I see no reason why a tool like this can't be a multiplier for mentorship and practice. At the very least it enables exposure to ideas that might not have been encountered before.

Felt like someone should say this in this thread, but calling someone a "Manchesterian" offers no insight into anyone's education level, and I honestly don't even think it's something that we should be focusing corrections on. If anything, it would probably be nice if everyone started using "Manchesterian" instead of "Mancunian" because that seems a hell of a lot more clear to me ;)

To the library authors, Proselint looks very cool!

Do you have any linguists consulting / on staff?

Bryan Garner might be a careful compiler but doesn't seem to be a linguist and seems to be a traditionalist who makes simple errors.

e.g. http://itre.cis.upenn.edu/~myl/languagelog/archives/001869.h...

"His chapter is unfortunately full of repetitions of stupidities of the past tradition in English grammar — more of them than you could shake a stick at."


"So why did Bryan Garner, a highly intelligent and insightful person, make this elementary error?"


"A good editor should know that Bryan Garner’s take on the subject is misleading and incorrect. It’s become apparent to me that many of the self-appointed guardians of the language don’t even know what it is they’re guarding."


You're implying that there is some kind of well-accepted notion of Bryan Garner being a poor guide to usage, but you link to some articles that are just nitpicking small terminology differences.

The second link in particular is tendentious. It claims Garner gives "a savage indictment of the behavior and character of those who use Stage 1 words [new usages]" in his book MAU.

But if you follow to the linked page from MAU, you read that Garner is, in an appendix, giving a series of wry analogies for the process of acceptance of new terms -- not a savage indictment at all. In other words, Garner is not himself saying all new usages have "a grade of F", etc., he's saying that is how some new usages will be perceived, in a very gross and qualitative sense, by a strict static conception of the language.

Since Garner comes right out and explicitly says all of the above, the link you cite comes off as picking a fight. There's nothing there.

Having read MAU (back in its first edition), I have to say that Garner strikes me as a very good guide to usage. I still enjoy perusing the book.

Taken as a whole, do you really have significant issues with MAU as a usage guide?

> some kind of well-accepted notion of Bryan Garner being a poor guide to usage

Wasn't my intention - merely pointing out that he's not a linguist and making simple errors should give anyone using him as an "authority" considerable pause.

> do you really have significant issues with MAU as a usage guide?

I am neither an American nor a linguist - which makes me doubly unqualified to comment. That I leave to experts.

^^ This right here, is exactly what I'm talking about.

Again, the idea of prose linting is not terrible, and in fact I do a hacked up version of it with a set of standard "find/replace" operations for specific writers who have specific issues. But a giant, general-purpose ball of rules of dubious provenance applied to a generic abstraction called "prose", is what I take issue with.

Garner's focus is on usage, not grammar, so for a usage linter, this doesn't seem like a big problem.

Is there an accessible, comprehensive, easy-to-read guide like Garner's Modern American Usage that's considered more accurate? There don't seem to be many options.

(I have a copy of GMAU and enjoy it, but mostly for discussion of usage, not the details of grammar)

> I dislike the idea of this tool quite a bit.

> This tool may be a terrible idea, but the idea of automated prose linting is not terrible.

So which is it? The idea of the tool is prose linting, and you've now stated both that you dislike it quite a bit, and that it's not terrible.

Part of what I think you may be missing, is that it doesn't need to be an all inclusive set of generally terrible, conflicting suggestions. With code style checkers we've already mostly solved this problem, by both storing metadata regarding the source of the rules, and allowing this metadata to be referenced when making custom rulesets. Perl::Critic[1] is a good example of this. It allows you to use the default ruleset and select a severity of criticism, or it allows an organisation (or individual) to create their own custom ruleset to enforce how they want their code to look.

Keeping this in mind, what if the default ruleset was curated to have select rules from multiple sources, but allowed you to easily take a source and use its rules? For example, if I want to write using Strunk & White today, that might be as easy as a command line flag, or downloading a specifically compiled ruleset. If I want to use something else, the same. If I want to make my own custom ruleset based on rules from multiple rulesets and a few of my own thrown in, that should be possible too.

1: https://en.wikipedia.org/wiki/Perl::Critic

It may not be universally applicable, and it may not be helpful to you in your work, but there is a spectrum of writing output requirements, and the tool (if well done) could be helpul in many situations.

1) Editing fiction by Terry Pratchett -- not too useful. 2) Editing a newspaper article -- maybe it would catch a few typo-level issues that crept in under deadline pressure, but a professional writer wouldn't lean on it. 3) A non-native speaker of English running meeting minutes through it before blasting out the e-mail -- that has a lot of utility. (Actually, the "I went to engineering school because I dislike writing." native speaker of English would benefit from linting that e-mail, too.)

>I'm a writer and editor, and I dislike the idea of this tool quite a bit.

You dislike this tool the same way welders dislike computer welding or the same way truck drivers will dislike automated driving.

Everyone wants to believe their job is so complex that a computer will never be able to perform the same task adequately. Is critiquing a sentence really as complex as driving a car in heavy traffic? Or playing chess? Or finding faces in photographs? Or winning on Jeopardy?

I don't believe that at all. That is total nonsense, in fact. My criticism is of this linting tool, not of artificial intelligence. I take it for granted that anything I can do, an AI will eventually be able to do much better. A linting tool is not an AI.

According to some definitions, AI is just "the theory and development of computer systems able to perform tasks that normally require human intelligence, such as visual perception, speech recognition, decision-making, and translation between languages." So by some definitions, a linter is absolutely an AI.

Semantics aside, it's not important to slm_HN's point. We can call it an AI, an algorithm, or just a computer, and in any case it's still possible for it to find errors beyond spelling ones.

"... there is a small part of me that enjoys playing Mr. Party Pooper when I see a mob of enthusiastic programmers trying to tie down some great cultural Gulliver with a thousand tiny little automated, black-and-white rules."

I'd reexamine that part, if I were you. I suspect it may be bigger than you think it is, especially since you've already pigeonholed the creators.

I'm a foreigner, who speaks English as a 3rd language, and I like the idea of this tool quite a bit.

You look like my wife complaining of GPS devices because sometimes they error or take us to dangerous places. It is just a took, you can just ignore its recommendations.

On the other hand, it could be used quite effectively as a "sanity check" occasionally. Just because it flags certain things doesn't mean you have to take its advice.

How would you suggest budding writers improve their skills then? This tool seems useful for that purpose to me.

I can see a lot of value for this sort of tool, and might even play with it myself, for sake of evaluating whether or not to incorporate its suggestions into my writing. At the same time, however, I have some wariness that its widespread use could actually have a shaping, and, specifically homogenizing, effect on language. For me, a large part of the beauty of language is how facile it is, how judiciously breaking its rules can create a more artful and compelling means of expression than linted — if you will, "prosaic" — prose seems likely to offer.

> I have some wariness that its widespread use could actually have a shaping, and, specifically homogenizing, effect on language.

This could be a benefit in industries where the goal is to have homogenous writing that meets a given set of specifications/standards. Some ideas:

1. Peer-reviewed scientific writing and/or abstracts

2. Manuals

3. Materials written for a subset of language (EFL, pidgin, children's books)

4. Documentation

I agree!

But still, it corrects incorrect things that my spell checker doesn't see, like inconsistent spacing and 'goofy approximations' like (R) for ®. (Depends on your definition of incorrect, but I personally would not mind at all if these things were homogenized for everyone, it would not take any richness out of the English language).

What I'd like (--help doesn't list such an option) would be to be able to enable some checks with a flag while disabling other parts (the ones that contain suggestions you can elect to break).

That's cool but it sounds like this tool is way oversold. It namedrops DFW and other great authors then shows examples of it correcting spacing and "brb." This isn't stylistic revising that takes you closer to those writers, it's just simple corrections.

This is a fair concern of style recommenders in general. Yes, we want to shape text. And what follows is merely a partial response, but it should address some of your concerns.

First, much of the advice is that certain word sequences are problematic without suggesting any particular replacement text. There are a few reasons for this (including the computational natures of error-detection vs. solution-recommendation problems). The reason most relevant to your concern is that solution-recommendations are more likely to produce a homogenizing effect because they have a driving effect, wherein using a particular set of words is deemed superior to another set of words. Much in the way that the diversity of life-forms has arisen because of selective pressures, by eliminating the least fit combinations of words, the native variation in writing can flourish all the more readily.

The goal is not to homogenize text for the sake of uniformity, but rather to identify those cases that have been identified by respected authors and usage guides as being specifically problematic. Any text that is sufficiently artful and compelling to have not been specifically addressed by these sources should not be able to be caught by the linter. Novelty will continue to introduce new usages, and some of them will be poor. Authors identified as trustworthy may point these out, but this will only be in retrospect. If you do not trust a guide's point of view, our strongest recommendation would be to turn off the modules associated with that guide. You can see some of the module names and a high-level description here: http://proselint.com/checks/.

Finally, I will modify a quote in the Foreword[^fn2] by Robert Bringhurst in The Elements of Typographic Style (version 3.2, 2004) > [Language usage] thrives as a shared concern — and there are no paths at all where there are no shared desires and directions. A [language user] determined to forge new routes must move, like other solitary travelers, through uninhabited country and against the grain of the land, crossing common thoroughfares in the silence before dawn. The subject [of proselint] is not [stylistic] solitude, but the old, well-traveled roads at the core of the tradition: paths that each of us is free to follow or not, and to enter and leave when we choose — if only we know the paths are there and have a sense of where the lead. That freedom is denied us if the tradition is concealed or left for dead. Originality is everywhere, but much originality is blocked if the way back to earlier discoveries is cut or overgrown.

[^fn2]: Only because we are on the topic of historical traditions and stylistic guides, it should be mentioned that a foreword – according to book design tradition – would be written by an individual other than the author about the author, the book, and usually the relation between them. In this case, the section in Bringhurst's masterpiece labeled "Foreword" would likely be better described as "Preface" or "Introduction". Given his knowledge of book design, I shall assume that this was a conscious departure from the road of tradition, even if I cannot appreciate the new view that it offers.

This sounds promising, but I think a lot of potential users would be deterred by the lack of examples.

This positively screams for a online interface to test drive.

Are you claiming you can paste in your own copy, and it will run against it? I see no text area in Chrome or Safari, what am I missing?

You're missing a stupid CSS trick; apparently everything should be flat flat flat nowadays, even if that means throwing UX out of the window. The sample text is editable, even though it looks as if it's not.

Thanks, I feel dumb and I feel it is dumb design. I should've known, and did figure it out eventually, but that won't pass the grandma test, and yes, it is a CLI tool, but so what. If a wanna be developer can't figure it out, the cli tool may have just as many different conventions from other cli tools as well.

Doesn't work at all in Firefox... it's just an editable page with no underlines or annotations. I was really confused until I tried with Chrome.

The entire page is editable for me in Chrome and Safari

Probably a stupid nitpick, but this bothers me:

> detecting grammatical errors is AI-complete, requiring human-level intelligence to get things right.

(emphasis mine)

First, there's a problem of usage. When in CS we say that a problem is class-complete (like NP-complete), we mean that the problem belongs to the class (which in this case is true, because human-level intelligence can check grammar), but also that it is class-hard, which informally means "at least as hard as the hardest problems in class", and more formally means that any other problem in class can be cheaply reduced to the problem, and so finding a suitable solution to the problem is identical to finding a suitable solution to all other problems in class. Not only checking grammar not known to be "AI-complete" then, we don't even know that human-level intelligence is necessary to solve it.

But the reason this bothers me even though I fully understand the statement was made informally, is a little deeper than that: we don't even know what "human-level intelligence" (or intelligence in general) is, let alone what AI means. That people refer to AI as if it's a thing rather than a very vague notion, clouds how people think of AI research as well as intelligence. I would have simply said "we don't know of good algorithms to dependably check grammar, and this appears to be a very hard problem that may require intelligence".

If you're on Ubuntu, you want to run 'pip3 install proselint' rather than 'pip install proselint'.

I ran it on a couple 800 word emails and it didn't catch anything except me using 2 spaces instead of 1 in one place. I also ran it on my city's sidewalk maintenance ordinance, and it didn't report anything.

Part of the goals of proselint is to minimize the number of false positives that traditionally clutter the results of style checkers, resulting in users ignoring the changes when they see them. We want to be reasonably certain before raising an alarm. You can read more about the precise metric[^fn1] we use here: http://proselint.com/lintscore/.

And yes, `python3` for the win. :)

[^fn1]: If you wanted to be truly precise, it's a parametric family of metrics.

Does anyone know about similar tool for scientific papers? Specifically to help non native English speakers to write high quality scientific papers?

While the idea is interesting, I do worry about the proliferation of linting to prose. Especially the hint about authoritative near the end of the article. Linters turn guidelines into steadfast rules in programming, removing all ability to use judgement if you want your PR merged. I personally want less of that, not more.

How is standardization a bad thing in programming? in prose I can see the argument, but in programming you should always aim for standardization for code maintenance.

For example, the Python best practices document recommends 1 blank line after functions and 2 after classes. Linters enforce this. However, this can be a detriment to readability in some cases, such as closures or classes that have no body, only superclasses.

Some might say you can mark lines as not being linted, but that then makes the change vulnerable to bikeshedding. For some people, being able to force the conversation to not happen because the linter is authoritative might be good, personally I prefer to follow the guidelines but be aware of the fact that they are there to aid in understanding for future coders not to adhere to a standard.

Ah, another part of my brain I can offload to an external source. It will be interesting when we get to "social-lint", so those of us that are no good at social interactions (through lack of ability or lack of willingness to spend the effort to combat that with ) or that feel they spend far too much brainpower on social interactions to make up for lack of natural ability can benefit.

Can someone explain in layman's terms how this is any better from an app like the Hemmingway Editor [0]? Both analyses the text and makes suggestions to make it better.

[0]- http://www.hemingwayapp.com/

See our discussion of this at http://proselint.com/approach/. I'll note that we do not consider Proselint a complete product — it's in its earliest stages, perhaps at 2% of its final capacity. That number has steadily decreased as we learn more, which we take to be a good sign.

Hemingway is an editor while Proselint is a tool. The latter can be integrated in any editor. That’s the main reason I ditched Hemingway (the editor) because I couldn’t just copy/paste text in it to get some suggestions.

In what way were you not able to copy/paste into Hemingway to get suggestions?

I was; it was just tedious.

I question how useful a tool like this is for a skilled writer.

Prose isn't code.

Many key elements of good writing are based around the idea of knowing the rules, and then carefully breaking them.

A linter doesn't prevent breaking its rules, it just notifies the writer of which rules are being broken.

I was writing some C earlier and my linter warned me about "incrementing a void pointer". However, I understood the context better than my linter, knew that I'd be compiling with gcc (which allows void pointer arithmetic), so I ignored the warning and carried on. My code compiled and ran nicely.

When it comes to static analysis, I think (creative) writers, like programmers, wouldn't care about warnings. This is already true of spell-checkers (e.g. my letter-writing character is English, but my text-editor's yelling about "colour").

Sounds like your system is in US English rather than the variant of English you are used to?


Sorry, I guess that example was too terse.

I was referring to a hypothetical American creative writer, writing a scene in which a British character writes a letter. In this hypothetical work, written in US English, there would then be a section of text that used UK English spellings. The naive spell-checker would not understand the context, and would flag these as misspellings.

This was meant to be analogous to my "incrementing a void pointer" example; the static analysis tool produces warnings which the author knows to ignore. In the C programming case, my function was passed the size of the objects comprising the array pointed to by the void pointer, so the linter was wrong to tell me I was making a mistake. Similarly, the spell-checker was wrong to say "change this instance of 'colour' to 'color'".

Similar considerations apply to prose linters.

Polonius would be a lesser character if shed of cliches, and a good writer would know to ignore the linter's opinions on the matter.

> I question how useful a tool like this is for a skilled writer.

For a skilled writer who takes time to write "proper" prose, probably not very useful.

But for me, as a non-native english speaker who writes a lot of short english texts (emails, documentation, HN comments and so on), it could probably help.

For example, since I write both US and British English every day, a consistency warning is certainly helpful. I would also like a linter to help the flow of text, for example by pointing out when you aren't mixing up your sentence lengths in a good way. Oh, in that last sentence I accidentally missed that I used first person in the sentence before that! A linter as a chrome plugin would have pointed that out.

That really depends on the kind of writing. For things like journalism and technical writing there are rules that need to be followed and you're not allowed to color outside the lines very much. The really, really good writers learn to be creative within these more restrictive styles of writing. It's no coincidence that many great creative writers had copywriting jobs earlier in their careers.

I can imagine a tool like this making it much easier for journalists to follow a newspaper's style guide or something similar.

You're right there.

My impression of this might have been different if the list of rules included CMOS instead of something that tells me not to use the term "jump the gun" because it's a cliche.

I'd love to have something like this for day to day work emails. They aren't beautiful prose, and they shouldn't be.

That said, I think there's a better way to approach this. Rather than linting based on a list of rules, I'd prefer a more technical approach that highlighted actual issues, such as garden path sentences, ambiguous pronouns, doubled words, etc.

Can someone who has tried this share their experience?

It sounds really awesome but it's very hard to tell if it's going to be more annoying or more useful. Maybe it would be useful to have some example linting errors on the homepage.

Either way, I really love the idea!

Hmm, I tried it out. Doesn't seem too useful yet and there is some polishing to be done so hopefully this continues to go through further development!

One needed improvement: display the offending line on errors. Then you don't have to toggle between file and console to contextualize the errors.

I ran some of my recent emails through it. It picked up my overuse of exclamation marks and my use of "all of the time" instead of "all the time." It definitely doesn't seem to sensitive - I would lint all of my emails with it if it were easy to do so.

Is it already in Atom or Sublime Text?

EDIT: I must be blind - they say about ST plugin (although they don't link to it). https://packagecontrol.io/packages/SublimeLinter-contrib-pro...

"There’s a plugin for Sublime Text." Didn't see anything about Atom though.

Here's a suggestion...

Have copy on web site be intentionally incorrect, red-underlined with (small modals? tooltips?) that show what's been corrected/suggested by the tool.

Like http://proselint.com/write/ ? ... which is also editable

Gitbook has open sourced their proofreader at https://github.com/GitbookIO/rousseau

Looks really interesting. I'd done some preliminary investigation into whether this kind of concept might work for the style guide at my company, but I never got time to take it further.

Is there any word on business model / the intentions of the developers? Is it something that's being open sourced and then integration assistance would be commercialised?

This is very cool and needed, thank you.

Could you include a sample .proselintrc? rc files tend to have very different opinions on how to be formatted: dictionaries, JSON, bash-argument syntax, and so on. (EDIT: Ah, found one: https://github.com/amperser/proselint/blob/cd428bb0ecc5530c1.... Can’t quite get it to ignore butterick, though.)

I find it a little curious that you use a Markdown example and lint for curly quotes and unicode ellipses by default (butterick), since Markdown discourages such pre-formatting in its syntax, but that’s just hairsplitting, of which I can tell by your swelling Issues count that you have plenty of as it is. :)

Looking forward to some formatting/syntax highlighting in the CLI output, but I know you have your hands full as it is.

Tried it with "I'm better then you" and it didn't complain.

Nice idea, but you need to catch homophone errors.

Are there any plans to support rules for texts written in other languages (e.g., German)? Would a set of such rules fit within the scope of this project or is proselint purposely or inherently limited to English prose? (@suchow)

It's out of scope for now, but only because we don't have any native speakers of other languages helping us out with the project, and this stuff is hard enough to get write in your native tongue; otherwise it's on the table. Interested?

I'd certainly contribute a few rules for German prose. Actually, I'm even more interested in using proselint with custom rules for theater plays (e.g., check for unneccessary repetitions, word combinations that are (acoustically) hard to understand).

As czechdeveloper has pointed out in this thread, it would also be nice to have a set of rules specifically for academic writing and/or for non-native speakers (e.g., Asian scientists seem prone to overuse "the").

I guess, a first step would be to have an extensible set of tags for the rules - both language-specifying ones (i.e., any_language, american_english, british_english, german, ...) and genre-specifying ones (any_genre, prose, poetry, academic, technical, ...). Furthermore, an easy way to select a subset of rules by tag (e.g., british_english and academic) would be neccessary.

Would that fit within your goals for proselint?

> this stuff is hard enough to get write in your native tongue

Was that deliberate?

"get write" is an error proselint could easily catch.

The main problem with a tool like this it that it needs to understand sentence structure in order to find a lot of common anti-patterns. Without some natural language processing, it's just going to be able to scan for word usage and simple things that you can catch with a regex. You could probably build something a lot more sophisticated on top of something like Apple's NSLinguistic​Tagger and related APIs.

After testing this against a dozen of my blog posts, I'm not terribly impressed with the output. I get more immediate value out of MarkedApp's keyword drawer and word repetition visualization.

You're right, but the problem is much worse than that. Examining 200 entries from Garner's Modern American Usage at random reveals that half of them are easy to implement, the kind of thing that could be assigned as a homework problem (e.g., recognizing that “$10 USD” is redundant, that “very unique” is comparing an uncomparable adjective, or that people from Michigan are called “Michiganders”, not “Michiganites”). Thirty percent are moderately challenging, requiring a week’s effort. Fifteen percent are hard — they are entire projects, requiring advances in AI. And the remaining advice (around five percent), the best kind, is AI-complete. Consider, e.g., "John hit Peter only in the nose". Does this mean that, of all Peter's body parts that could have been hit, John hit only Peter's nose? Or is it a grammatical error that was suppose to convey that, of all the people John could have hit, it was only Peter who he did hit.

We're interested in incorporating deeper NLP. In particular, we've been eyeing https://github.com/spacy-io/spaCy.

Furthering the complexity of this topic...

While "$10 USD" may be redundant in a newspaper published in the USA, it's immensely useful and arguably preferable when writing blog posts, emails and other text destined for the "World Wide" Web. While USD is commonly used as and many are comfortable with its use as a "common denominator" when pricing something on the Internet, it's still very important to be clear "what dollars do you mean" in this context.

If you are going to specify a currency, write USD 10 (though spoken, it's 10 USD).

If the context is explicitly local (such as a local newspaper, menu), then $10 is sufficient in the United States.

I used to do "10 USD" or "USD 10" until I got sick of hearing responses like

"USD 10 looks weird, why did you do that', or 'that on the pricing page looks funny, can you fix it up a bit'

It seems $ (or the equivalent currency symbol for other currencies) has a place in many peoples minds implying that the number it is next to is currency, and they seem to find it weird when things involving currency are 'written correctly' without the symbol that the numbers mean currency.

> recognizing that “$10 USD” is redundant

People in Australia might disagree. As might people in Bermuda, Colombia, Canada, Hong Kong, Argentina, ...


Either "$10" or "10 [USD|AUD|etc]" are correct. It is unequivocally incorrect to use both symbols. use the first when it's clear in context what kind of dollar is being referred to, otherwise use the second.

> It is unequivocally incorrect

That's going to need a citation. To be sure there's plenty of style guides which say "don't do that, do [this other variant instead]" but where's the standard that makes this unequivocal?

Also really cool that a library like NSLinguistic​Tagger is included in OSX.

Interesting that there's a NBSP in NSLinguistic​Tagger

Will this be used by automated content creators? For example, lots of articles on some of news websites (including wikipedia) are written by bots. So the bot would write an article, invoke proselint and correct, if required?

Related: artbollocks-mode https://github.com/sachac/artbollocks-mode

I was skeptical that it would only detect obvious issues, but the sheer number of built-in checks is surprising. I'll try this on the next large text I write.

I've been interested in linters and style checkers for English prose for a while, and I'm excited to try this out!

To the author(s): Your website, as far as I could tell, doesn't tell me how to install it; I had to go to GitHub to realize it was pip-installable. You should consider adding that to the main page.

The authors probably aren't reading HN, best submit a PR.

We are. Even so, opening issues on Github and submitting PRs is appreciated.

Nice idea.

Bug report — it told me I had too many exclamation marks in a Markdown file with a number of images in it.

Sounds like a feature request ("recognize and support markdown"). Open an issue at https://github.com/amperser/proselint/issues/new

Going through the example, it comes up with:

> Get that off of me before I catch on fire! > Needless variant. 'catch fire' is the preferred form

I don't think I've ever heard anyone say "catch fire" rather than "catch on fire".

From the UK if that changes anything.

"To catch fire" is a relatively common term, at least in the USA. "To catch on fire" probably equally so.

Ha ha, slightly related fun snippet I wrote:


Very interesting, and I'm looking into integrating it to http://WritingOutliner.com (or as a separate Word addin) :)

Thank you for working on this project and sharing it.

One of the more challenging sections in the GMAT entails sentence correction. A proselint-enabled GMAT prep for sentence correction would be very valuable.

What kinds of NLP technique does this system use?

Is it possible to specify new rules in a high-level way?

Can it learn from examples?

Does it work on a sentence-by-sentence basis only, or does it "grasp" complete paragraphs?

Rules are defined in Python scripts which can have arbitrary complexity. However, it seems like most rules are just string or regex matching:



> What kinds of NLP technique does this system use?

It depends on your interpretation of NLP. In a sense, all of the rules are hard coded, and so it does string token processing that happens to be informed by contributed interpretations of style guides' rules for usage. Thus, most of the NLP has been performed by the human programmers interpreting those rules.

Though we are interested in extensions in the direction of robust machine NLP approaches able to meet the other goals of proselint, that presents many challenges (including some I mention in response to your third question). Nonetheless, this is an active area of research.

> Is it possible to specify new rules in a high-level way?

In short, no, but it is an area of active research on our part to develop a rule-templating engine for exactly this purpose. "High-level" is subjective though, so there may always be someone who intends to ask about a level higher than the interface that we provide at the time that this question is asked.

> Can it learn from examples?

In a sense, yes, all of the rules have been learned by people from the example text in guides and translated to linting rules. But I do not think that was your intended question.

If instead you mean: you would provide it a set of examples of your writing and it would induce a rule, no it does not do that currently, and may not for quite some time.

Stylistic rule induction is a difficult – though interesting – problem (as is rule induction more generally). It is not something we are intrinsically opposed to, but the simplest version of learning from examples would violate two core principles of the design of proselint.

First, our rules are taken from and organised around the advice provided by respected authors in their writing on linguistic style.

Second, any inductive method will be intrinsically uncertain about the rules that it induces. This uncertainty will always be opposed to our aim of having a low false alarm rate, making inductive methods possible but subject to extensive tuning and testing. This suggests that further development of a test set outside of the examples provided would be needed, to ensure coverage of any of the rules that the examples would suggest inducing.

Additionally, almost all state-of-the-art machine learning systems would require a set of relevant labeled examples of usage errors and non-errors that would somehow generalise to the examples that you would like to provide it. Even specifying the data format would be difficult; if you have any insights as to how this would be done, please develop them below, it can only be helpful and aid progress in this direction.

> Does it work on a sentence-by-sentence basis only, or does it "grasp" complete paragraphs?

I think the easiest way for you to answer this question is for you to see it in action at this website: http://proselint.com/write/

I should mention that longer range dependencies require greater computational power which brushes up against another aim of proselint, to be fast enough to run on reasonably large files as a real-time linter. This may not always be the case in all instantiations of proselint, but for now this is true.

If you have paragraph level rules that you might want to suggest (like the issue I just created when writing this response: https://github.com/amperser/proselint/issues/310), please do! It is even more helpful if you can find an authoritative reference to include as part of your issue, because that will be needed to incorporate the rule into proselint.

It would be interesting to run this against campaign speeches as a unbiased way of judging the quality of prose. Surely content is more important but still it would be fun.

Its a python module? I'm looking forward to making a Pelican plugin so my mate can start checking his blog for glaring errors before he posts! :)

I'm curious is this just a grammar checker? Or does it do spell checking too like aspell?

Most important question - How many linguists are on the team developing this?

Can I use this with latex?

Just tried it; you can. Seems like it strips markup characters so it should work well with most markup languages.

FYI, seems to work perfectly find in Safari on Mac OS X Desktop.

What is wrong with "very smart"? (line 86)

"avoid using the word ‘very’ because it’s lazy. A man is not very tired, he is exhausted. Don’t use very sad, use morose. Language was invented for one reason, boys - to woo women - and, in that endeavor, laziness will not do." - Dead Poets Society

I worked w/ a guy who was good at editing my manuscripts. His opinion (which I agree with) was that the word "very" was almost always superfluous. You can delete it without affecting your message.

Took me a while to see that very comes from veritas and doesn't mean much. At first I wrongly thought, I knew what the word means. Now I do know verily.

Microsoft Word had something like this round about 1999

Yeah, there's the squiggly line; same thing, right?

Similarly, where Tesla Model S is concerned: Ford Motor Company had something like this round about 1908. (Where "something like this" is "has four wheels and no horses")

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact