Hacker News new | comments | show | ask | jobs | submit login
Linguistics and Programming Languages (250bpm.com)
77 points by rumcajz 9 months ago | hide | past | web | favorite | 50 comments

"We have again the popularity of "Wouldn't it be nice if our machines were smart enough to allow programming in natural language?". Well, natural languages are most suitable for their original purposes, viz. to be ambiguous in, to tell jokes in and to make love in, but most unsuitable for any form of even mildly sophisticated precision. And if you don't believe that, either try to read a modern legal document and you will immediately see how the need for precision has created a most unnatural language, called "legalese", or try to read one of Euclid's original verbal proofs (preferably in Greek). That should cure you, and should make you realize that formalisms have not been introduced to make things difficult, but to make things possible. And if, after that, you still believe that we express ourselves most easily in our native tongues, you will be sentenced to the reading of five student essays."

E. W. Djikstra, EWD952 [1]

[1] https://www.cs.utexas.edu/users/EWD/ewd09xx/EWD952.PDF

I think Perl is interesting here.

When we're talking about computer code, we humans say things like "Read in a line of text. If it ends in a newline, remove it." But we can't program computers that way. The compiler says, in effect, "Read in a line of text from where? And put it where? If it ends in a newline? If what ends in a newline?" And so on.

But Perl actually lets you program that way. Perl says, "Read in a line of text? You didn't say from where, so I'll assume the default place, which is the files that were named as arguments in the program invocation. You didn't say where to put it, so I'll put it in the default variable. If it ends in a newline? You didn't say if what ends in a newline, so I'll assume that you're talking about the default variable, which happens to contain the line we just read in." And so on.

Effectively, you can use "it" in your conversation with Perl, and it will do reasonable things. This is one of the places that it shows that Perl was designed by a linguist.

By the way, an important problem in natural language processing, which is related to this at a high level, is anaphora resolution. When we use pronouns to refer to people and things we've mentioned before, there are often ambiguities about what the referent of a particular pronoun is, but native speakers almost never have to consciously think about this question. But in practice, resolving these references correctly may require sophisticated reasoning about our knowledge of the world in order to determine which interpretation is plausible.

An example adapted from the Winograd Schema Challenge is:

The file couldn't fit on the hard drive because it was too big.

The file couldn't fit on the hard drive because it was too small.

In the first sentence, "it" refers to the file; in the second sentence, "it" refers to the hard drive. Native speakers who know what files and hard drives or (or even who don't) should have no trouble understanding the references and might not even have noticed that there was any ambiguity (!), even though resolving the ambiguity requires bringing to bear specific knowledge about the world.

There is a whole family of AI language understanding problems based around this, such as


Looking over some examples shows just how challenging this can be, because of the way the sentences can require people to know arbitrary things about the world. ("The atom emitted a photon because it was entering a lower energy state.")

> even though resolving the ambiguity requires bringing to bear specific knowledge about the world

Actually, in this example knowledge of the world (I'm interpreting this as referring to knowing the details of how "files" and "hard drives" relate to each other) is not necessary. The ambiguity disappears as soon as you know the meaning of "fit" -- fitting requires a large thing to contain a small thing. Therefore if something is too large, it must be the contained object, and if something is too small, it must be the container, and those roles are marked directly within the syntax of the sentence.

You can easily see this experimentally by asking people about sentences with nonsense words:

1. The glirp couldn't fit on the vell because it was too small.

2. The glirp couldn't fit on the vell because it was too big.

Then again, I see you've already noted that speakers who don't know what files or hard drives are should have no trouble with these sentences. Is the lexical meaning of "fit" "knowledge about the world" to you?

When you say "glirp couldn't fit on the vell", do you mean to say that the glirp was placed onto the veil, or that the glirp placed the vell on the glirp itself?

The melon could not fit on the hat because it was too small.

The melon could not fit on the hat because it was too big.


The first. If the glirp was trying to wear the vell, I'd say "the glirp couldn't put on the vell" or something else idiomatic, like "the glirp couldn't get the vell on". I cannot use "fit on" in the sense you're going for.

You can investigate this yourself at http://corpus.byu.edu/coca/ ; the first hundred results for "fit on" contain, by my eyeball estimate, more than 90 of the sense I describe, zero of the sense you insinuate, and a few spurious hits (such as "to spend as it sees fit on government services", "kept himself fit on a rowing machine", and my favorite, "I've worked with schools such as the Pratt Institute and FIT on developing eco-friendly vegan design programs").

There are six results, out of over 500 million words, for "fit it on", of which one matches your pattern. ("She passes it under the running tap and hikes her tank up to fit it [a strip of nylon] on around her rib cage")

That's a good point, and this might be an example of the selectional restrictions issue


or a similar kind of inference that could be considered easier than expected.

I'm not sure exactly what I would consider "knowledge about the world" in this setting. :-)

> 1. The glirp couldn't fit on the vell because it was too small.

Situation 1: vell=table, glirp=dish. "It" refers to the vell.

Situation 2: vell=bolt, glirp=nut. "It" refers to the glirp.

Replace "on" with "in", and maybe there's less ambiguity.

Yep, default variables and context are important to Perl. I like a lot of the design, but there is a lot of cruft that has piled up in 30 years. Perl6 addresses that while adding support for FP, OO, array programming, grammar based programming...etc. I find it quite pleasant to read and write. The problem is the implementation is immature and I'm not sure it will ever be as fast as we need for a production language.

The topic variable came from awk; most of the whipitupitude variables are direct ports of awks.

I completely forgot about that, good point!

The investigation here goes in the opposite direction. It's not "wouldn't it be nice if computers understood natural language" but rather "wouldn't it be nice if people understood programming languages". The former is just silly. The latter is a call to create, in programming languages, constructs that can be parsed by the language machinery in our brains.

If the goal of programming languages was to instruct the computer, after all, we'd be all writing machine code.

Humble submission to the cause - using "the" and "it" in a scripting language to avoid having to come up with variable names all the time.

Way back, I'd designed a scheme-based scripting language called "muSE" [1] in which we created specifications for video editing styles. I also toyed with the language. One such experiment was having a way to refer to values already computed using "the" and "it". [2]

[1] https://github.com/srikumarks/muSE [2] https://github.com/srikumarks/muSE/wiki/TheAndIt

edit: This project isn't maintained, but the latest version 711 still works on macosx [3]

[3] https://code.google.com/archive/p/muvee-symbolic-expressions...

There are already constructs like this. In all programming languages, you see variable names like data or result all over the place. Additionally, when using the python interpreter, there is a special variable _ for the last expression evaluated.

"it" is common in that sense, but muse additionally models "the" as a function that turns verbs into nouns - like we do. For example,

    (open-file "blah" 'for-writing)
    (write (the open-file) "hello world")
    (close it)

    (length (list "milk" "pudding"))
    (print "number of items is " (the length))

Either Kotlin or eXtend already have "it" for the default variable name of single-parameter lambdas.

Perl has this, in $_, the default/implicit argument.

This has already been done. It's known as Non-Aristotelian General Semantics. It's even possible to teach yourself how to speak this way. Intelligence agencies sometimes train operatives in it, because then the reports they make are more useful for intelligence analysis. http://esgs.free.fr/uk/art/sands.htm

Another related EWD is 667[1]

The paragraph that stuck out to me was: "Instead of regarding the obligation to use formal symbols as a burden, we should regard the convenience of using them as a privilege: thanks to them, school children can learn to do what in earlier days only genius could achieve."

[1] https://www.cs.utexas.edu/users/EWD/transcriptions/EWD06xx/E...

Specificity is always on a scale. In most cases, you don't want to tell the computer exactly how to lay out the memory (common in imperative programming). In other cases, you don't want to tell the computer exactly what sequence to perform the operations in (common in functional programming). In other cases, you allow for contradictory information to be provided to the computer (common in constraint based programming).

There's plenty of room for ambiguity or contradiction in programming, and good things happen on either end of the spectrum.

Love this, and agree. Though I would love to see a computer system which can resolve ambiguity in natural languages and formalize it for us. It would get it wrong often, but if it is good at it, no more so then a programmer gets the requirements wrong.

We may not be able to do this automatically, but you can use a kind of refinement calculus to accomplish this.

Start off with vague requirements from your client:

  We need a machine which can wash clothes at various
  temperatures and speeds. The user should have options
  to control the temperature and speed since they'll have
  different types of materials. Maybe we can give them
  some preset options?

  What are the temperature options?
  Hot, warm and cold; but it's in two cycles and each can
  have a different temperature.
  How long should it wash the clothes for?
  That depends on the material and how dirty it is.
  What if it's really dirty?
  They should be able to do an extra rinse cycle.
  How fast should it spin?
  That depends on the material and how dirty it is.
Refined (control panel only):

  Dial presenting options for:
  Buttons allowing further refinement:
    Water: Hot/Cold, Warm/Warm, Warm/Cold, Cold/Cold
    Extra Rinse: (yes or no)
    "Dirtiness": light, normal, heavy (impacts wash duration)
If you use a proper requirements management tool, you can trace every one of those final system requirements back to some (slightly editorialized) conversation or initial (ambiguous, plain language) requirement description. Of course, this will, in reality, be an iterative process as you discover ambiguities in the specification and code and go back to the client for feedback and clarification.

You could potentially have a compiler that would infer meaning & only fail compilation if the fuzzy natural language terms were ambiguous in context — just like languages like Scala currently infer types when unambiguous.

Also, feels like Inform7 should be mentioned in any discussion on NLP and programming: http://inform7.com

First thing to mind based on the article's premise and hits a lot of the article's desired points already.

Interesting. This feels like a great way to prototype functional requirements. Express the full process as an interactive fiction novel, and let people play it. Get feedback, iterate on the novel, until they are satisfied, then implement.

One thing that would be interesting is if you could write a compiler front end that was interactive so that it could give you better error messages.

Note that Martin Sústrik is a programmer, not a linguist. Larry Wall (creator of perl) is a linguist and a programmer, and has written on the topic[0]

[0]: http://world.std.com/~swmcd/steven/perl/linguistics.html

Curiously I find myself disagreeing quite a bit with Larry.

> If a language is designed so that you can "learn as you go", then the expectation is that everyone is learning, and that's okay.

It's okay if we never reach understanding or agreement on what Faulkner intended by a particular sentence (we can still grasp most of the whole). For a programming language this is explicitly not okay! This goes for ambiguity as well.

> Multiple ways to say the same thing

> This one is more of an anthropological feature. People not only learn as they go, but come from different backgrounds, and will learn a different subset of the language first.

This increases cognative load with no particular benefit.

For the former, I think Larry does not imply a post-modernist “death of the author” position: there is still an objectively correct interpretation of the code as the author intended, but it may be understood differently by people of different experience levels. For example, a map with reference to a sub can be though of as a loop that calls the sub.

With perl, the objective truth is opcodes, which are well-understood by a small group. Everyone else bases their understanding on heuristics and analogies, and the goal is to write your code to trigger the same heuristics/etc. in the reader.

For the latter, you are declaring your opinion as fact. Every language allows redundancy and variation of expression; if it truly provided no benefit, why have we not seen a popular language that only allowed a single expressive style?

> For the latter, you are declaring your opinion as fact. Every language allows redundancy and variation of expression; if it truly provided no benefit, why have we not seen a popular language that only allowed a single expressive style?

It is indeed my opinion (as prefaced by 'I find myself disagreeing..') but the premise that a programming language benefits from resembling or mimicking features of natural languages is also an opinion.

Also I should note that your phrasing of the question isn't quite correct ("if it truly provided no benefit"), something that provides no benefit is unlikely to be excluded from a language (e.g. double negatives "there ain't nothing here to see!") unless there is a clear benefit to doing so, and in fact there are times that there is.

Particularly I would draw your attention to for example more limited lexicon sets (such as those used by dispatchers, rescue workers, climbers, EMT professionals etc.). Those explicitly exclude variation of expression, since that leads to an increased risk of being misinterpretation in often critical situations. It is my inclination that while interpretation of code isn't time critical, reducing variance (e.g. a common coding style does this as well) reduces cognitive load (makes comprehension more efficient).

Short-order fry cooks, too. That's a different kettle of fish though.

What do you think regarding my position on heuristics/forming an idea in the mind of the reader?

Learn Lojban today: https://mw.lojban.org/papri/la_karda

Lojban is based on predicate and relational logic, and parses unambiguously for both humans and computers, meaning that we can get straight to semantics instead of faffing about with syntax.

Anytime Lojban comes up I recommend taking a look at linguist Arika Okrent's book "Into the Land of Invented Languages". She learns several conlangs (constructed languages) during the book and has several chapters devoted to the history and her learning of Lojban and some of its pitfalls (some things take a lot of words to say). There's a reason why natural languages have so much ambiguity, so you don't have to write a novel to have a basic conversation. If you go to wikitongues on YouTube you'll see a prominent community member who can only read it and can't speak it...it isn't easy.

Edit: Lojban is ambiguous with how something is said, so when a speaker says something you should never misinterpret the words, but the meaning could be misinterpreted as I understand it. I'm no expert, but would love to see something like this take off.

"""I recommend taking a look at linguist Arika Okrent's book "Into the Land of Invented Languages"""

I used to have that book... I didn't think it was very good. I wanted to learn more about the various conlangs, but instead the author mostly focused on their creators, and then especially their idiosyncrasies and quirks.

For those who want to know more about conlangs, I would recommend "The Language Construction Kit" by Mark Rosenfelder.

The Language Construction Kit is great from everything I've read. Yes her book is more on the history & pro/cons of various conlangs mixed with examples...etc. It's not supposed to teach you Klingon, Lojban, Esperanto, and several others all in one book.

The Language Construction Kit is phenomenal and played a huge part in developing my love of conlangs as well as linguistics in general.

An aside - the brilliant Star Trek episode Darmok [1] shows the importance of the shared cultural context in which communication happens. [2] is a snippet of the episode where the crew is perplexed by the "language" of the "Tamarians".

[1] https://en.wikipedia.org/wiki/Darmok [2] https://www.youtube.com/watch?v=3-wzr74d7TI

When it comes to book recommendations I would recommend Umberto Eco's The Search for the Perfect Language.

Speaking of Larry Wall (creator of Perl) above, "the search for the perfect language" is his favorite book I believe.

But it is only the syntax that is unambiguous in Lojban -- the semantics (which in the real world are the cause of most misunderstandings) are pretty much open to as much ambiguity as any other language. The Lojban word for love, "prami" covers love of one's spouse, love of one's children and love of one's pets, much as "love" does in English. These are obviously very different feelings.

> parses unambiguously for both humans and computers

I suspect this is only because there are so few speakers of Lojban. As soon as a language is in common use, it will be extended naturally by the users. I doubt any level of consistency can be enforced over time in that case. Popular slang and idioms generally make it into a language if they persist long enough. Good luck keeping unambiguous parsing in that case.

Lojban has a central planning committee, the BPFK, so there is no danger of "Proper Lojban" falling prey to this, although certainly there have been dialects of Lojban in the past which deliberately were not parseable. They don't last long; none of the Lojban tools work with their experiments, so network effects tend to keep unparseable dialects from becoming popular.

From the beginning, Lojban's design has included this feature. It would be pointless to learn as a language if you intend to extend it in that way.

There are, of course, ways to experimentally extend the language grammatically, as well as a process for changing the grammar, which has happened before via the BPFK. It evolves, but with planning and consideration.

I think the value of Lojban/Loglan in this space is having a language that a human can use to have a more natural interaction between themselves and computers. And also as a layer for translating formulaic text between natural languages.

Exactly. When you have a language no one actually realistically speaks, you can stuff whatever highfalutin ideas you want into it.

Just look at JavaScript

JavaScript cannot be ambiguous because there's a constant acid test: Does the machine find it ambiguous? If it does, it fails to parse. (If it parses, it wasn't ambiguous, but it still might mean something the programmer didn't intend.) That property will hold until we do something like embedding AI in the compiler.

Humans can deal with ambiguity. There are various reasons for humans to want to be ambiguous, and any natural language has to support that to be useful for day-to-day conversation. Loglan and Lojban might well escape that, if they're only ever used in contexts where ambiguity is not desired and will be repaired if it is found.

Think formal specifications, not love letters.

Isn't the automated insertion of semicolons to your detrimemt a result of ambiguity in the language, and the compiler forced to decide

The rules for automatic semicolon insertion are defined by the language spec.

Semantics is never fully unambiguous, because of its interface with pragmatics, which Lojban doesn't deal with at all. There is no systematic way to unambiguously convey tone, register, etc.

English ≠ Computerish


English = C || English = Pascal || English = SQL .......

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact