Hacker News new | past | comments | ask | show | jobs | submit login
Arthur Whitney's One-page Interpreter (1992) (jsoftware.com)
59 points by chrisdew on Oct 30, 2014 | hide | past | favorite | 50 comments



I am at a vacation and bored, so I decided to try to actually understand this. I do not know much APL/K/J, but for a start, here is a version with more traditional line breaking:

https://gist.github.com/anonymous/f72e5c4a432492abce59

Some basic observations:

- He uses all the obscure C features, like the ability to not declare the type of an argument or return value and let the compiler fill in, and also the old school style of declaring functions:

  foo(x,y) int x, y { return 0; }
Instead of:

  int foo(int x, int y) { return 0; }
Since the compiler will infer the int, and since he uses R for return this example eventually becomes:

  foo(x,y){R(0);}
- Variables, or really registers, are only accessible as letters from a to z, and the "st" array stores the values of all registers. Numbers entered in the REPL have to be between 0 and 9, and hence he avoids the dirty work of making a proper lexer. It's also super easy to trigger a segfault since any error handling is non-existent.

- DO(n,x) is a C macro that evaluates the given expression "x" for all numbers between 0 and "n"

- V1 is a C macro that defines unary operators for the interpreted language, and V2 defines binary operators. In V1 definitions the operand is called "w", in V2 the operands are called "a" and "w".

- For example ",", which calls the cat function, is a binary operator that creates vectors:

  1,2,3,4
  4 
  1 2 3 4
- The vt, vd and vm arrays map ascii symbols to the functions defined with V1 and V2. { is the second symbol in vt, so when used as a unary operator it calls "size" (second non-null element of vm):

  {5,6,7,8
  
  4
and when used as a binary operator it calls "from" (second non-null element of vd).

- wd is a parser that goes from the original input string to a weird intermediate form that is an array of longs. Each input character gets mapped one-to-one to an item in this intermediate form.

  If the input character was a number between 0 and 9:
    Value type instance gets allocated
    Intermediate form for this input character consists of the address of the allocated instance

  If the character is a letter between "a" and "z":
    Intermediate form consists simply of this character

  If the character represents an operator
    Intermediate form consists of the index of the operator in the vt array
In other words, the intermediate form is an array where some elements are ascii characters, others are memory addresses and yet other indices into some array. This part is really something.

- The ex function executes the intermediate form. Since everything in the input is fixed length, and there is no syntax checking, it just indexes into the intermediate form assuming everything is well formed, while the parser did not check that so it's not really guaranteed - again a source of easy segfaults. The execution goes from left to right and consists of looking at the first position in the intermediate form and then making recurrent calls if necessary (let X be the current item in the intermediate form):

  If X is a character
    Lookahead one item
    If it is a '=' char
      Assign the result of executing everything after the '=' to the register indicated by X
    Assign to X the value of the register named by X
  If X is not a character and is a small integer
    We are applying a binary operator
    X is the index into the "vm" array
    Fetch the function from "vm", apply it to the result of executing the rest of the intermediate form
  Otherwise:
    If there is any more input remaining other than the current item, we are applying a binary operator
    Lookup the function in "vt", apply it to the result of executing the intermediate form to the left and to the right of the operator
- I have the biggest problem with understanding that "a" struct, that represents all values in the interpreted language, which are arrays. ga is clearly the basic allocation function for it, "plus" obviously adds two arrays, so it's clear the "p" field holds the actual contents, but that's where things get very shady.


"He uses all the obscure C features, like the ability to not declare the type of an argument or return value and let the compiler fill in"

This isn't an obscure feature, the default C89 type is "int". Leaving out "int" has long been considered bad form, and since C99 omitting return type has not been allowed (but generally supported in nonstrict modes).


Didn't look at this piece, but more recent J interpreter is here - https://github.com/openj/core . I think the main idea for "a" struct (it is here - https://github.com/openj/core/blob/master/jt.h) is that it has "rank" - the single integer which is the number of dimensions, then array of dimensions - "shape" of length equal to "rank", then "data" part which has the size equal to multiplication of all elements in "shape".

Edit: the structure is at https://github.com/openj/core/blob/master/jtype.h .


Thank you, I was wondering about the purpose of the tr function, which calculates the product of all elements in the vector, now it's clear that it goes from the array of lengths in each dimension to the total "flat" number of elements.

So the "r" field in the "a" struct is the number of dimensions of the array that "a" holds, and the "d" field is the array that holds the length of the array "a" in each dimension.


Uuh. This is ancient K&R C. Here's a version that is somewhat closer to todays C. https://gist.github.com/jpfr/560c861cd3eb76700a54

Can we replace some of the one-letter codes with better namings?


It still doesn't compile, though, does it? When I try, the expression pr(w->p[i]) doesn't work because the argument is a long and pr is declared to take an A *.


Sure it compiles.


This is a perfect illustration why normal people (like me, unlike Arthur) have problems working with his apparently brilliant creations, such as the k language. The amount of information required to efficiently read (let alone write) such texts just does not fit into normal people's working memory. One-letter names and the absence of comments don't help either.

(WRT weird names: IIRC, some time ago kdb featured a number of functions named with digits, like '2' being the function to connect to a socket or something.)


As a guy with a terrible memory, I think of it the opposite way. For Q (and more so K), the actual number of distinct items to remember is very small, as basically anything complex is created by composition of the simple functions/variables. This way you don't need to remember nearly as many libraries/variables/function names, and also the function itself becomes its own description.

That being said, for more production-heavy applications and interfacing with other languages, variable/function naming could be more descriptive :)


Agreed. The small(ish) vocabulary of operators makes K easier for me to use.

Would you rather write/read:

DIVIDE X BY 5 GIVING Y

or

y=x/5;

The first is COBOL (designed to make code easier for "normal" people to read. The second is C (which looks more like the math that we learn in grade school).

k/APL/J simple moves further in the direction of math.

As a far as descriptive names go, in k, names can always be aliased with longer names (assuming no conflicts). I do this when exposing a library k code as a web interface.


I want to offer a bounty for a version that compiles with a contemporary C compiler. There's some magic going on with longs being treated as pointers here.


That's a rather normal situation - when your long is 32 bits long and the pointer is also 32 bits long, to satisfy a strict C compiler you just need a cast - which will be a no-op in machine code. This approach was used quite a bit...


Still segfaults with gcc -m32. There's some seriously nonportable, noncompliant type abuse going on in there.


Try 3+5. Worked for me. I think it's very rudimentary and doesn't handle very much.

I agree, I would very much like a ansified very that doesn't crash on unsupported features, but still in this style and in a single file.


3+5?


My point was that very few things work and that was an example. Multi digit numbers will fail. Here's a few examples that doesn't segfault:

3+7

a=3+7

a,a,a

b=a,a,a

{b

b+9

p=2,3,5,7

2{p

{p

~8

<~8

#~8

4#3

c=b#2

c,c


This transcript of Roger Hui's talk casts more light on the general idea: http://archive.vector.org.uk/trad/v094/hui094_85.pdf

It's interesting that what he calls "Parsing" is more typically known as term reduction by pattern matching in the functional world.


He "cheats" by putting multiple statements on one line:

noun(c){A z;if(c<'0'||c>'9')R 0;z=ga(0,0,0);z->p=c-'0';R z;}

would be

  noun(c){
   A z;
   if(c<'0'||c>'9')
     R 0;
   z=ga(0,0,0);
   *z->p=c-'0';
   R z;
  }
In this forms, it's still cryptic, but not quiet as inscrutable.


Except this isn't him "cheating", this is his preferred coding style, for readability!


To avoid scrolling, which he says he hates.


What you refer to as "cheating" is simply writing C like k/APL/J (i.e. functional versus imperative). Language implementations have this tendency. I've seen the same effect with Lisp implementations.

N.B. - k/APL/J functions are often written on one line because that are often very short.


> What you refer to as "cheating" is simply writing C like k/APL/J (i.e. functional versus imperative).

Mashing things up without line breaks isn't a functional vs. imperative thing [1]. Its a density vs. negative space thing. There's a quite a bit of research in many contexts indicating that effective use if negative space generally improves readability of most things for most readers, so while I know Whitney and his disciples like to say that J is "more readable" because of its density, I don't think that's generally the case for most readers.

[1] idiomatic Haskell is definitely functional, but does not eschew line breaks; mashing C programs into cryptic single-line monstrosities doesn't make it any less imperative, it just makes it unreadable.


Heck, short awk programs are often written as one-liners that are invoked from the command line. I sometimes do this, but use a program file ("-f hithere.awk") for anything beyond semi-trivial.

And I always use "--lint" with "-f".


"noun" in C: noun(c){A z;if(c<'0'||c>'9')R 0;z=ga(0,0,0);z->p=c-'0';R z;}

"noun" in k: noun:{[c]$[(c<"0")|c>"9";0;c-"0"]} / $[boolean; true; false]

The C version is effectively the k version expanded with declarations, memory allocation and initialization. I would not be surprised if the interpreter was written in APL and transliterated to C.


I'm not saying the C isn't in functional style.

I'm saying that it being in functional style has nothing to do with the absence of whitespace/newlines. Both the use of the functional style and the absence of whitespace no doubt reflect Whitney's language-independent programming preferences, but they are distinctly different things.


Given his obvious facility with both languages I think you are right but that both those steps occurred simultaneously in real time.

Arthur had learned how to think a problem in APL and was a wizard at transliteration.


It's often considered much more understandable to define the distance between two points as a short mathematical expression - with sigma, x_n - y_n, second powers, square root - that actually explain what it is in non-mathematical terms.

Similarly, J programmers prefer to write

a =: a + b

instead of

(for i=0; i<a.size(); ++i) a[i] += b[i];

which obscures the idea quite a bit. Or saying

(# %: */) a

to calculate geometric mean of vector a and use that in next expression.


> Similarly, J programmers prefer to write

> a =: a + b

> instead of

> (for i=0; i<a.size(); ++i) a[i] += b[i];

Well, sure, most people prefer not to write code with syntax errors.

But, nitpicks aside, your post is mostly frequently-made commentary in favor of functional programming (both FP languages and functional style -- which, again, has nothing to do with avoiding line breaks or whitespace -- in other languages), but really has nothing to do with the characterization uphtread of the mass of minimally-whitespace, non-line-broken C code presented upthread as being a matter of "functional vs. imperative", which remains completely wrong. So its a complete nonsequitur to the post it "responds" to.

Not using whitespace, including line breaks, has nothing to do with functional style, and using it has nothing to do with imperative style. (If it did, your C++ example, minus the syntax error, would be just as "functional" as the J example.)


The more relevant part is comparison with math notation. No whitespaces and line breaks there.


Actually, the usual presentation of math notation in textbooks, papers, etc., while dense, does use whitespace -- and often large whitespace or line breaks for factoring out definitions used in the main equation for readability and ready comprehension. [1] Let-bindings and similar constructs in other functional languages are, in fact, largely driven by this kind of factoring-out in mathematical notation (not just in their use of whitespace, but in general.)

[1] For an example of the large whitespace for factoring-out, see the factoring out of J-sub-k on the top of p. 460 of http://titan.fsb.hr/~venovako/dist/tensor.pdf


When code is bulky, you can excuse your ego because your visibility is limited in a literal sense. But when it's dense like this you actively know, as you try and fail to skim it, that you can only usefully focus on a small segment at a time. So you blame the code instead of yourself.


I think it's fair to blame the code in this case. It's not written at all for readability (by anyone other than the author while he was writing it). Of course he knew what those 1 or 2 letter variable names meant when he came up with them, so it's a lot easier to load into memory. But I doubt Whitney could easily coming back to this code after any length of time.


He is famous or notorious for rewriting from scratch every few years. So, your impression sounds plausible.


You think Arthur doesn't look to his code back for those few years between rewritings?


    BC: Do you ever look at your own code and think, "What the hell was I doing here?"

    AW: No, I guess I don't.
Of course, maybe he was answering the first 8 words.

http://queue.acm.org/detail.cfm?id=1531242


Logically, he could be answering the entire question truthfully, except that he doesn't look at old code at all. He'a clearly somebody very focused on the current "working set", with his rewrites and one-page files. As opposed to the kind of programmer who curates a sprawling historic code base.


Well, he does have to maintain it (apparently it needs a lot of bugfixes, according to discussion on Reddit) but, unlike the notorious perfection and immutability of the syntax of APL, he does have a habit of throwing old code away.


Isn't it wonderful that many wall st firms use k/q to write some of the most complex analytics ?

Of course, it is understandable. NOT.


http://kaleidic.com/2009/composing-j-in-my-mind/

"J is conducive to thinking about computation because a lot is expressed by combining even a few components. With J it is natural to think in code, not just about code."


Interpreter of what? "J", apparently, from context and inference, but that should have been in the post title.

What's an "interpreter fragment"? Most of the Google results for that term point to the same story.

I know this is Hacker News and everyone is supposed to know everything about everything already, but on obscure topics it's nice to take a moment and explain what you're talking about.


Complete shit. C was a mature language in 1989. Even for banging out a quick hack, and forgiving the gets into a 99 byte stack array, this is crap by any standard.

I can compile and run pretty much any K&R code from the C Programming Language in 1978 with no problem. Finding the appropriate compiler, I can do the same with BCPL, the Language and its Compiler (1981). And the code is actually understandable.

This, on the other hand, is an obfuscated mess that segfaults on any input, J or otherwise. I'll be damned if I spend 5 minutes debugging it, ex() segfaults with infinite recursion. It's not worth the trouble.

Not saying this isn't worth sharing as a curiousity or an artifact, but this is not a work of genius and it is not defensible or to be emulated. Do your colleagues a favor and don't write like this.


It's different. But it is definitely not shit. The majority of use shouldn't write code like this. But those who can create amazing things.

Here's a working adaptation that is quite close to the original. https://github.com/tangentstorm/j-incunabulum/blob/master/mj...

A current version of the k language still has traces of it inside. And there are still people on Wall Street running mission critical systems with it. http://kx.com/q/c/c/k.h

The kona project attempts an open source implementation. And they give some rationale on the paricular coding style. https://github.com/kevinlawler/kona/wiki/Coding-Guidelines


> Do your colleagues a favor and don't write like this.

I'd second this - in any normal situation.

If you're really aspiring to form the future, strive to be extremely productive, demands that from your colleagues - that's another thing.

Sigh - industry as a whole isn't nearly on the level to use those advanced tools.


>Sigh - industry as a whole isn't nearly on the level to use those advanced tools.

Is this a parody?

Is J a language for writing personal code with personal meaning only that you can only run in your head?

There is nothing professional about this. Sounds more like a mental illness.


I once needed to develop an algorithm for a underspecified task. Something like "generate a real-looking map of the floor - with walls, windows, doors, corridors..." . Took me 40 minutes to try several variants in J. Then some 3 hours more to complete C# code. I don't know how long it would took if I'd experiment directly in C#. Probably not too long - but iterations would surely be more painful; too much boilerplate code.

In my experience, J is closer to thought process than many other languages. Really. You just have to try - and get used a little bit.


There's a reason that APL never gained a foothold and J has not been adopted by software engineers. You can program at the speed of thought in TECO as well, but the reasons for its limitations and peculiar history vanished long ago.

There are far better tools available these days for rapid prototyping; interpreted languages with REPLs and libraries for just about everything--without resorting to a write-only line-noise language and actually ending up with a usable product.

For that matter, I suppose it's possible to write understandable, maintainable TECO, APL and J with some effort if you don't value obscurity and cleverness over simplicity, correctness, maintainability and understandability.

If your business logic or product is written in line noise written by some lone J genius, your company is fucked when your guy inevitably gets hit by a bus, goes nuts, or leaves for greener pastures.


The same could be said for writing a document in Chinese, Arabic or Hebrew. They all look like "line noise" until you make an effort to learn them.

"For that matter, I suppose it's possible to write understandable, maintainable TECO, APL and J with some effort if you don't value obscurity and cleverness over simplicity, correctness, maintainability and understandability."

Yes. I do separate long functions in k onto separate lines. However, long functions in k are rare (and usually) an indication that I haven't arrived at a simple solution.

Writing in J/k/APL allows to think in chunks where the chunk is an algorithm - not all of the tiny steps in an algorithm. It is a higher level of abstraction - not unlike programming in C versus assembly language or machine code.

E.g.

The sum of a list (of any type) in k (APLis similar):

sum:+/x / read "plus over x" or "reduce x with plus"

versus (C code fragment) which computes the sum of integers only:

int sum=0: for(int i=0; i<n; ++i) { sum = sum + x[i]; }

Btw, John Von Neumann once chastised a grad student for writing an assembler. Perhaps real software engineers only write in machine code.

"If your business logic or product is written in line noise written by some lone J genius, your company is fucked when your guy inevitably gets hit by a bus, goes nuts, or leaves for greener pastures."

I think the larger issue is documentation. Undocumented code (of any kind) is a liability. When I write in k, Every line of k is documented (as text past column 80). In addition, there is header documentation that described the purpose and approach of the code in the file/module. The number of lines of code in a file is typically no more than 10 lines.


> There's a reason that APL never gained a foothold

APL was widespread in operations research in the 80s, and since the early 90s is usually found in and around finance, even though it isn't really common anymore.

Part of it is a mindset thing; People used to consider longer employment periods, both from the company's perspective and from the employee's perspective. Although APL is extremely useful when you use it properly, it is definitely not a "programmer-is-a-replaceable-cog" language that Java strives to be and the most firms now assume.


Perhaps "Notation as a Tool of Thought" - http://www.jsoftware.com/papers/tot.htm - could help you, if you want to understand why people keep writing APL and its successors? Like Mathematica, for example?..

Do you think mathematical articles are understandable and maintainable?

What if there is not a lone J genius, but a whole - small - department of people, who spend time actually thinking and talking about computations, not the ways to express it?


I've been screwing around with it in my spare time. J is quite logical and understandable. Anyone can understand it, but they have to make an effort to do so, just like they would if they were learning some other "weird" language like Lisp or Forth.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: