* He joined Princeton’s CS department in 2000 but taught at least one class there as early as 1993 while still at Bell Labs Research (on sabbatical?)
* One of his students regularly brought a 386sx laptop (running pre-1.0 Linux) to class and when Brian was asked more obscure questions about what awk did which he couldn’t remember, the student would run commands in awk and feed Brian the definitive implementation answer. So Brian had some exposure to Linux moderately early on.
I'm not sure what you're thinking of, but there at least 3 prominent and still-relevant awk implementations:
nawk (One True Awk / AT&T awk / BWK awk): the original awk, still maintained; is used by macOS, FreeBSD, OpenBSD, and NetBSD https://github.com/onetrueawk/awk
There were some differences even in the mid 90s between AT&T-style awk which was then considered the original/definitive/most-backward-compatible and gawk which was new/shiny (although not /that/ new).
I believe both were available in Linux back then. I believe in the above anecdote both were consulted with the emphasis on AT&T-style awk for obvious reasons.
Lex seems to love prominent tech bros like Zuckerberg and Musk, and has also done a softball interview of Joe Rogan, in which Rogan was allowed to spew unsubstantiated alt-right statements. If you don't like any of those people, Lex might be a turn-off.
His attitude seems to be that he will interview anyone without pre-judgement, and try to find common ground with them or something. "Woke" people like myself view that as enabling bad actors.
I have listened to some great interviews from him, so it is a mixed bag.
1) I don't need to hear anything from Ivanka Trump ever, thank you.
2) Platforming dishonest hacks like Tucker Carlson does the nation a disservice, unless he is going to call out all the lies as they happen. While I have not listened to the Carlson interview, based on what I heard from Joe Rogan, I do not believe Friedman will significantly push back on right wing misinformation, for whatever reason.
Why are we still using "alt-right" in 2024 to describe people who do not self-describe as such, just because they espouse beliefs outside the mainstream liberal orthodoxy? 2016 was a long time ago, and Richard Spencer has faded into irrelevant obscurity, so why do people still insist upon using this ill-defined and incendiary term?
How else do you want to define the modern GOP? Their policies and societal views are markedly different from Bush-era. It’s bewildering to think of Romney and Trump being in the same party.
The only one I've listened to was him interviewing Donald Knuth which was worth listening to, tho from time to time I got the impression of an interviewer that didn't quite know enough to do the interview.
If you like podcasts of a very intelligent person interviewing very accomplished guest thinkers, Sean Carroll's Mindcast is slightly higher level: https://www.preposterousuniverse.com/podcast/ (tho I don't think Donald Knuth has been on that one, lots of other computer folks have).
i've been reading the unix programming environment from 01983 this week (the old testament of kernighan and pike), about 35 years later than i really should have. awk is really one of the stars of the book, the closest thing in it to currently popular languages like js, lua, python, perl, or tcl. (what awk calls 'associative arrays' js just calls 'objects'.)
the 39.7-kilobyte turbo pascal executable doesn't include the (modest) pascal runtime library, because it is itself written in assembly, not in pascal. awk, being an interpreter, does have to include the runtime library in the same executable as the parser. awk is also written in c rather than assembly, and moreover using lex and yacc; lex compiles the lexer into a deterministic automaton written in c for speed, and all of this is compiled with pcc, whose optimizations were relatively weak by today's standards
i haven't compiled version 7 myself so i don't have a clear idea of how big it is
a slight correction to my above comment! on running `file *` in bin/, i see, among other things
at: PDP-11 pure executable
awk: PDP-11 separate I&D executable
bas: PDP-11 pure executable
basename: PDP-11 pure executable
bc: PDP-11 pure executable
if i understand correctly, this means that this awk executable is built with what we used to call the 'small memory model' on the 8088, using separate instruction and data segments. that way, awk could have data stored at the same address as executable code, so you had 64 kibibytes available for just the data. you didn't have to squinch your awk program and all its data into the 65-46 = 19 kilobytes left over after the awk executable was loaded
this, in effect, allows you 128 kilobytes for the functionality of your program: when you call an awk builtin function (usr/src/cmd/awk/proc.c lists seven named functions: substr, sindex, sprintf, printf, print, split, and instat) you're invoking code in the instruction segment, while the logic of your interpreted code shares the data segment with the data you're manipulating (strings or arrays or whatever)
so it's at least misleading that i said it 'ran in the pdp-11's 64k address space'
not all pdp-11s supported separate instruction and data segments (i think the more expensive pdp-11/45 did, but the pdp-11/40 and pdp-11/20 didn't) and most of the 7th edition executables didn't require that. but awk, adb, dcheck, dd, fgrep, icheck, lex, ncheck, refer, restor, sort, tar, tbl, uucp, uulog, uux, and yacc did. i'm not sure how you were supposed to run a unix system without dd, icheck, tar, uucp, restor, or adb (sdb and dbx didn't exist), so i think maybe at this point they had given up on supporting the smaller models
> the closest thing [...] to currently popular languages like js
awk is more than that: having introduced the "function" keyword, for e in a, semicolon-less syntax, stringly-typing, delete x[e], etc, etc, it's quite obviously the base for JavaScript, and Brendan Eich said as much [1].
I read the same book and got the exact same conclusion. Then I read The UNIX—Haters Handbook and thought that Unix tools (like awk) are sadly not good enough anymore.
making everything a string, for example. representing string length with null termination. splitting shell variables into words after variable substitution (rather than having a list type). a zillion shell-script bugs when filenames, directory names, search strings, etc., contain spaces. not having user-defined subroutines in sh or awk. no command-line editing. command names and options that read the same as you type them (thus forcing you to choose between unreadable commands and commands that are too unwieldy to type comfortably). no tab completion. more generally, no interactive feedback on what is going to happen when you submit your command. no job control. having to retype command output as new input all the time. table layout that only works with fixed-width fonts. filenames limited to 14 characters. no text search through previous output. hardlinks rather than reflinks. an inflexible security policy defined in the kernel. the reification of . and .. in the filesystem. exposing the layout of filesystem directories to every program that enumerated directory entries. no way to undo rm or mv. no screen editors. arbitrary line-length limits all over the place. . in the path. textual macro expansion as the primary form of higher-order programming. backslashed backquotes. backslashed backslashes. total precedence orderings on infix operators. slightly different regular expression syntaxes for each program. self-perpetuating at jobs. etc.
a lot of these have been fixed, but some are hard to fix
I think Wikipedia has it wrong actually after reading more closely. It’s not a way to solve the year 10,000 problem, as in a date formatting issue akin to Y2K, as Wikipedia suggests.
Instead it’s to get you thinking of the long term future.
The Wikipedia article does say that is the goal of the Long Now Foundation, but not how the leading 0 is related.
That's also idiotic because we don't put a leading zero in front of 1, 2 or 3 digit years, and neither will future historians put one in front of 4-digit ones - it's just gratuitously annoying.
Charlemagne was crowned in 800 AD, not in 0800 or 00800.
Pompeii was destroyed by the Vesuvius in 79 AD, not 079 or 0079 or 00079.
The battle of the Teutoburg Forest was in 9 AD, not 09, 009, 0009 or 00009.
Also why stop at 5 digits while you're being ridiculous?
I'm happy to expand for others; regardless of my feelings on the matter the Long Now use of a leading zero creates questions and engagement, thus spreading their message under the mantra of "there's no such thing as bad press" | "get the public talking".
It's an Advertising 101 strategy and not at all idiotic save perhaps at the surface of first exposure.
I've been reminded before that comparing hn commentary to reddit or other communities is itself a violation of hn guidelines. You can just upvote/downvote or "tap the sign" by posting the hn guidelines. No need for insinuating the redditization nor twitterization of hn commentary.
what distracts from my technical content is vacuous, aggressive replies to it by people who ask me to take responsibility for their behavior. no, thank you
to you too i ask: which of these two kinds of discourse are you engaging in right now?
if you're capable of the other one, i suggest you demonstrate that
your comment purports to be concerned with the objective qualities of mine: 'pointless', 'distracting', 'provocative'. but actually all of these terms are attempts to recast your subjective preferences as objective realities; all you're really writing about is how you feel when you read my comments. the implicit premise here is that other people ought to be writing their comments in order to satisfy your preferences
why? what makes you so admirable? what reason does anyone in the world have to value your opinion or consider your preferences important? if you want comments written to satisfy your preferences, write them yourself
as far as i can tell, like 38, gruturo, and spencerflem, you're just a person who harasses strangers on internet fora if you think they talk funny, then blames your own behavior on them
you want to know what i feel?
because it sure isn't admiration and eagerness to satisfy you
Dude, they've got a point, your style is annoying and not even right in a "technically correct" sorta way. You don't write things that happened 5 AD as 00005 AD.
Personally, I can take the 0abcd style of years without going crazy, but it definitely derails my reading flow a bit. I would compare it to writing personal names without capitals, like donald trump.
But I also know that it is kragen commenting without looking at the header :)
Years ago, somebody had as his Usenet .sig line "perl is margarine, awk is butter" (unless the order was reversed). Whatever Perl's many faults, I used awk a good deal less after I discovered Perl.
Two things to keep in mind: Pre-internet, it was quite easy to be skilled and knowledgeable and yet never have been exposed to something another community somewhere would consider basic, or for it to take many many years to percolate across. I remember getting into programming even in the 1990s and I can tell you a lot of the C world would still consider associative data structures like a hash table something to be a bit exotic and only to be pulled out if you had no other choice. It was very easy to not be very up to date on these things.
Second, while the programming world fancies itself a very fast mover, it really isn't, and it especially wasn't back then. 20 years isn't necessarily as long as you would think in 2024. We're still trying to convince people in 2024 that you really don't need a language that will happily index outside of arrays, and how long has that been a staple in existing languages? (You may think that's silly, but we've had that fight on HN, yes, yea verily, this very year. Granted, this is winding down, but it's still a live argument.)
There's a lot of things that showed up in some language in the 1960s or 1970s and wasn't widely adopted for another few decades.
(Though I suspect one problem with Lisp is that back in the day, its performance was pretty awful relative to the other things of the day. It was not always the best advertisement for its features. A bit too ahead of its time sometimes.)
>I remember getting into programming even in the 1990s and I can tell you a lot of the C world would still consider associative data structures like a hash table something to be a bit exotic and only to be pulled out if you had no other choice. It was very easy to not be very up to date on these things.
That's a little surprising, considering that the classic first book about C, K&R was first published in 1978 and again in 1988, the first and second edition, respectively.
One of the editions, maybe the first, had a simple hash table implementation, that just summed up the ASCII values of the characters in the given string, modulo some value, and used that as the key for the string being hashed.
I remember being taught about associative caches in CPU architecture, and later on programming the same in C. So I'm not sure about the 90's and hash tables being "exotic".
My first real use of hashes was in perl4; with perl5 released in 1996 (https://dev.perl.org/perl5/news/), surely it meant people were also well aware of the usefulness of hash arrays even back then.
the picture you're painting here is absurdly far from reality
— kernighan definitely did not think hash tables were 'a bit exotic' in 01978 —
hash tables (or sometimes other associative retrieval structures) were standard fare in computer science degree programs and were a standard feature of compilers, assemblers, and linkers, of which kernighan had written several at that point; associative retrieval structures are also fundamental to databases and filesystems. in 01979, 7th edition unix (in which kernighan published awk) barely had date formatting but included an on-disk hash table library called 'dbm'
hash tables are the exclusive topic of pp. 506–549 of knuth's taocp volume 3, a standard textbook which came out in 01973. after recommending you some hash functions, on p. 512 (525/739) knuth recommends you check out a survey of hash functions published in cacm by lum, yuen, and dodd in 01971, and he cites papers on hashing in cacm as early as 01959 and in an ibm journal as early as 01957. in the 'history' section (pp. 540–542 (553–555/739)) he traces it back to an internal ibm report by luhn in 01953, and its first report in the open literature to a paper by dumey in 01956. most of the rest of the 739-page book is about how to implement associative arrays of various kinds, although some parts instead concern the string search problem, and the first half is about sorting, and a sorted table can be used for purposes other than associative retrieval
knuth also mentions a 'very influential survey' of hashing published in cacm in 01968 by robert morris, kernighan's coworker at bell labs, and one of the very first users of unix and a major contributor to it
but perhaps kernighan hadn't read cacm? on the contrary, he and his colleagues regularly published in it; he published in cacm once in 01972, once in 01973, and once in 01975, and his colleagues such as stephen johnson, mike lesk, al aho (his coauthor on awk), and, as i said above, robert morris, published there regularly as well
in the practice of programming, kernighan and pike claim that almost all of the time, the only data structures you need are arrays, linked lists, trees, and hash tables (though they wrote this in 01999)
what kernighan is talking about here is including them as a basic language construct, a design choice his 01978 tech report on awk likens to that of snobol4, which was a well-known language but not a mainstream one. in the context of the above cacm articles, you can perhaps see that the daring choice was to offer a single 'good enough' hash table implementation rather than one optimized for a given purpose—and to use it even in place of conventional arrays
— kernighan was not isolated like you were —
kernighan was at bell labs, which was probably the most connected place in the software world at the time
when awk was released in 01979 the internet was already ten years old, and at&t (the parent company of bell labs) had installed all its long-distance links
in 01976 they'd also developed their own 'internet' called uucp, which at the time of awk's release was interconnecting 80 different unix sites, interchanging software updates on a daily basis, over 300-baud and 1200-baud modems. they were also in contact with ken thompson's alma mater berkeley; thompson took a sabbatical year to teach there in 01975. berkeley got a darpa contract to add tcp/ip support to the bell labs operating system 'unix', which would become the majority of the internet during the 01980s; berkeley participated actively in the uucp network
bell labs was also working on the so-called 'advanced communication system', its own misconceived version of the internet, where data would be routed using phone numbers, because data communications had become the fastest-growing source of call traffic. (colin berkshire has written a fantastic overview of this forgotten history.)
"Your comments are ridiculous! One of the smartest, most connected people in the world was smart and connected! And papers were being published all over the place!"
The computer world of the time was not clones of Kernighan a thousand times over, nor was everyone reading every single paper any more than people do today. You can't prove the normal world of computing was all aware of all these things by pointing at the very pinnacle of it. I'm talking about the real world, not the best of the best of the best.
But, like, as a teen in the 80s by the time I learned C from Kernighan & Ritchie (not an exotic path back then), where a hashmap was one of the examples and the subject of a couple of the problems, it wasn't remotely new to me. The real world had plenty of others learning these things from Byte magazine and such even with no CS degree.
The big difference was in C you had to implement it yourself. Choosing a hash function and managing memory with malloc() was a hang up for a lot of people. It was quite common to say hashes and the various tree data structures weren't worth the complication if you didn't absolutely need to have a hash table or tree.
In later languages like perl, the syntax for hashes was built into the language.
That was a huge step forward.
Perl was slowly adopted for other reasons (illegible write once syntax). Other popular scripting languages like PHP adopted hashes from perl. This greatly helped make associative arrays common knowledge among the majority of programmers.
also dr. dobbs. https://archive.org/details/dr_dobbs_journal_vol_03/page/n29... is a presentation of 'sam76', an 8080 reimplementation of mcilroy & morris's m6, probably built around an efficient associative array of defined macros. and assemblers built around symbol tables were commonplace at the time, too
in the interview, kernighan said, 'The main idea in Awk was associative arrays, which were newish at the time, but which now show up in most languages either as library functions (hashmaps in Java or C++) or directly in the language (dictionaries in Perl and Python). Associative arrays are a very powerful construct, and can be used to simulate lots of other data structures.'
zambyte expressed surprise that kernighan referred to 'associative arrays' as 'newish' and said that lisp had had them 20 years before. what lisp had 20 years before (presumably they meant in https://www-formal.stanford.edu/jmc/recursive.pdf in 01959, not 01957, when mccarthy was trying to figure out how to add linked lists to fortran) was this line of code:
assoc = lambda x, y: y[0][1] if y[0][0] == x else assoc(x, y[1:])
and which allows you to use a list of 2-tuples such as [('red', '#f00'), ('green', '#0f0'), ('white', '#fff')] as an associative array
you replied, saying, 'it was (...) easy to be (...) knowledgeable and yet never have been exposed (...) in the 1990s (...) I can tell you a lot of the C world would still consider associative data structures like a hash table something to be a bit exotic (...). It was very easy to not be very up to date on these things.'
that comment is only relevant to zambyte's comment—to which you posted it as a reply—if you had the implicit assumption that what kernighan meant by 'associative arrays' was 'associative data structures like a hash table'. but associative data structures like hash tables were not newish; hash tables in particular were 22 years old in 01979, and were already the most commonplace of all data structures except for strings and arrays. i listed 13 articles in cacm in 01975 and 01976 about associative data structures. cacm only published about 6 articles per issue (excluding the news and opinion departments), so that's roughly 10% of everything they published
you say 'you can't prove the normal world of computing was all aware of all these things', but the everyday world of computing was, if anything, even more preoccupied with associative data structures than the academic world. ibm's most lucrative product was ims, an online database, which is to say, a large on-disk associative data structure. filesystems (including dectapes), variable namespaces in interpreters and compilers, symbol tables (in assemblers, linkers, and compilers), and databases are associative arrays. even in 01979 it was already hard to find programmers who did not use any of filesystems, assemblers, interpreters, or databases (though they did exist)
so it is completely implausible that what kernighan was describing as 'newish' (in 01979) was associative data structures like a hash table. i'm not doubting you when you say you knew c programmers in the 90s who considered them 'exotic', perhaps the province of the wizards who wrote compilers and filesystems, but those c programmers were not 'the normal world of computing' or 'experienced and knowledgeable'; they were beginners
the 01978 edition of k&r presents binary search (using a sorted array as an efficient associative array) in chapter 3 on p. 54 (62/236), before presenting such advanced language features as the 'break' statement, '#define', structs, non-integer function return values, and pointers. in chapter 6 on p. 125 (133/235), they use such an associative array to count the number of times each c keyword occurs in the input. §6.5 (pp. 130–134 (138–142/235)) demonstrates a more scalable associative array using a binary search tree. §6.6 (pp. 134–136 (142–144/235)) gives another program using a hash table
so for k&r in 01978, hash tables were more basic than features such as unions, typedef, string formatting (%.10s, etc.), opening files, and listing directories, all of which are relegated to later in the book
please don't tell me 'a lot of the c world' in the 01990s hadn't heard of k&r. next you're going to be telling me a lot of the c world hadn't heard of c
getting back to the interview, what kernighan was evidently referring to was not associative array data structures in general, but having a general-purpose associative data structure library always at hand. lisp's assoc (or assq) isn't general-purpose because it's too slow for large associative data structures. lisp plists aren't general-purpose because they leak memory (you can't remove a property from all atoms, and each atom you put a property on continues to occupy memory even if you remove the property from it) and because they get too slow if you have more than a few properties. ims isn't general-purpose because you have to go talk to the database administrator to create a new file; you can't create a temporary one in ram. the example hash table in k&r isn't general-purpose because it uses a static array, a fixed size of 100 hash buckets, has no deletion, associates specifically a string value with each (string) key, and is quite inefficient; it's a reasonable starting point, but you have to tweak it for your application
java hashmaps, awk arrays, and python dicts are general-purpose as long as your data fits in ram. (you could gripe that awk arrays aren't first-class objects and can only have string and numeric keys, but that's just because awk doesn't have first-class objects at all in the lisp sense.)
precisely as a result of this, hashing (like sorting) has ceased to be a beginner-programmer topic; if you need an associative array or a sorted list, you can get quite far just by invoking the built-in facilities of your programming language. sooner or later you will probably want one optimized for some specific application, but it's no longer something you have to do all the time. but to any working programmer in the 01970s, it would have seemed absurd to propose linear searches or hash tables were 'newish'
I totally get the first point, but this comment from Brian was made when the internet had been available for many, many years. It seems more like it felt newish at the time, and that is how he has seen it since then. I guess that's more why I find it to be an interesting comment.
Regarding the second point, when I hear the term "associative arrays" or "associative lists" I think of them in a pretty high level of abstraction. Like really: a C struct is an association abstraction. A simple array of structs is not often considered an "associative array", but it can easily be used for the same things.
Maybe it would be more accurate to say that the way they were using associative arrays way newish, rather than the construct itself.
This is from the first edition free PDF online of The AWK Programming Language:
"Associative arrays were inspired by SNOBOL4 tables, although they are not
as general. Awk was born on a slow machine with a small memory, and the
properties of arrays were a result of that environment. Restricting subscripts to
be strings is one manifestation, as is the restriction to a single dimension (even
with syntactic sugar). A more general implementation would
allow multi-dimensional arrays, or at least allow arrays to be array elements."
Common Lisp, of course, has built in hash tables. MacLisp doesn't seem to - I didn't see anything about them in the Pitmanual. ZetaLisp/LML and InterLisp have them. So it seems like built in lisp associative arrays (I'm not counting alists, or property lists, though the latter goes back to the early days and maybe should be) are roughly contemporary to the birth of awk. Maybe a bit newer.
in a sense lisp alists are 'associative arrays', sure. but, although they were used in the metacircular interpreter that bootstrapped lisp and still inspires us today, they aren't a language feature, and the lisp ones aren't efficient enough for the things you use them for in awk
you can of course build an associative array in any language, especially if you're satisfied with the linear search used by lisp alists. you can hack one together in a few minutes almost without thinking:
.intel_syntax noprefix
lookup: push rbx
push rbp
mov rbx, rdi # pointer to alist node pointer
mov rbp, rsi # key
2: mov rdi, [rbx] # load pointer to alist node
test rdi, rdi # is it null?
jnz 3f # if not null, skip not-found case
mov rax, rbx # pointer to null pointer is return value
xor rdx, rdx # but associated value is null (sets zero flag)
jmp 4f # jump to shared epilogue
3: mov rdi, [rdi] # load pointer from key field of alist node
mov rsi, rbp # reload key from callee-saved register
call comkey # sets zero flag if rsi and rdi are equal keys
jne 1f # skip following return-value case code if !=
mov rax, rbx # pointer to node pointer is return value
mov rdx, [rbx] # also let’s follow that pointer to the node
mov rdx, [rdx + 16] # and return its value field too in rdx
test rax, rax # clear zero flag (valid pointer is not null)
4: pop rbp
pop rbx
ret
1: mov rbx, [rbx] # load pointer to alist node again
add rbx, 8 # load pointer to next-node pointer field
jmp 2b
(untested, let me know if you find bugs)
but that's very different from having them built in as a language feature, like snobol4, mumps, awk, perl, python, tcl, lua, and js do. the built-in language feature lisp had for structuring data 20 years earlier was cons, car, and cdr, which is not at all the same thing
other programs on unix that contained associative arrays prior to awk included the linker (for symbols), the c compiler, the assembler, the kernel (in the form of directories in the filesystem), and the shell, for shell variables. none of these had associative arrays as a programming language feature, though. the bourne shell came closest in that you could concatenate variable names and say things like
eval myarray_$key=$val
here's the implementation of binary tree traversal for setting shell variables in the v7 bourne shell from 01979
NAMPTR lookup(nam)
REG STRING nam;
{
REG NAMPTR nscan=namep;
REG NAMPTR *prev;
INT LR;
IF !chkid(nam)
THEN failed(nam,notid);
FI
WHILE nscan
DO IF (LR=cf(nam,nscan->namid))==0
THEN return(nscan);
ELIF LR<0
THEN prev = &(nscan->namlft);
ELSE prev = &(nscan->namrgt);
FI
nscan = *prev;
OD
/* add name node */
nscan=alloc(sizeof *nscan);
nscan->namlft=nscan->namrgt=NIL;
nscan->namid=make(nam);
nscan->namval=0; nscan->namflg=N_DEFAULT; nscan->namenv=0;
return(*prev = nscan);
}
if you're not sure, that's c (now you know where the ioccc came from)
The "killer feature" of alists is that they are persistent data structures; when we extend a shorter alist to make a longer one, the shorter one is unaffected. Moreover, if the program has only a reference to the tail portion of a longer alist, the head portion can be garbage-collected.
With this, we can correctly implement lexical closures: we simply take the current environment (head of the alist) ssociate it with the lambda body and arguments, and we are done.
yes! also when you only have two or three things in them, they are faster than hashing, especially on machines without a data cache. clojure uses hamts for this, thus getting this fp-persistence† as well as efficient lookup in large dictionaries, and chris wellons posted a concise, fast c hamt implementation last year at https://nullprogram.com/blog/2023/09/30/
______
† 'persistent' unfortunately also means 'stored in nonvolatile memory', so i feel the need to disambiguate
I’d say plists have a stronger claim on being “real” associative arrays. Lisp never made much distinction between “built in” and “library” to begin with, you're using the same syntax to use the feature either way.
i think it's true that, much of the time that you'd use a dict in python, you'd use a property in 01960s lisp, because it let you take advantage of the read-time implicit hashing of symbols; you probably only had one or two properties on most of your symbols, so getting from the interned symbol to the property was a matter of a handful of instructions---probably faster than a hash function. but conceptually i feel like they're pretty different ideas
I'm very taken with the regex to lex to yacc to awk sequence of developments. There's a very convincing sense of building on prior work to achieve more general results.
People will hate me for saying this, but I bet there are many many repos on github written by programmers that are better than all these coding legends. Back then, it was a much smaller pool.
Right, the headline gave the appropriate context for the interview in a short sentence. Not everyone knows Kernighan created awk, as opposed to a respected person that happens to have opinions on awk.
Aho, one of the authors of the dragon books. When I studied CS 40 years ago it was given standard literature on compiler construction, a mandatory course.
No idea whether it is still used today. Well, no idea whether there is anything fundamentally new in compilers such an old book would not cover.
There are plenty of advanced compiler techniques, but you can't understand them until you understand the fundamentals. There will always be a place for the purple dragon book.
Most implementations of that second group of languages are written in C, and run on operating systems written in C. It's really no contest that C has had far more of an influence on computing. So much software would not exist, or not in the forms we recognize, without it.
or at least in c++, c#, or java, sure, smalltalk being the main exception
but the layers are somewhat orthogonal. at the verilog level you can't really tell if your hardware is built out of luts, cmos standard cells, hand-laid-out nmos, ttl, or vacuum tubes; at the operating system level you can't really tell if your cpu was designed in vhdl, verilog, chisel, or hand-drawn schematics; at the interpreter level you can't really tell if your operating system was written in c, pl/x, assembly, or lisp; and at the python level you can't really tell if you're running on cpython (which is written in c) or pypy (which is written in python)
i mean there are small clues that make it hard to completely hide, but they're subtle
so, to my way of thinking, saying 'c has had far more of an influence on computing than awk', on the basis that awk was written in c, is similar to saying 'transistors have had far more of an influence on computing than c'. it's sort of true, but it's also sort of false
Not sure, but by the context of the conversation, it seems he complained people keeps saying lisp was an inspiration for Python, when in fact he was not even thinking about it by the time Python was created. Then it appears someone asked him what he actually based on then, and he answered abc, shell, awk. But given the lost message in the thread, perhaps you can ask him for clarification, to set the record straight.
yeah, i think it's well-known that lambda, map, filter, and reduce (the lispy bits of python) weren't in the earliest versions of python; they were contributed later on by a lisp fan
at a larger remove, of course, lisp was an inspiration for abc (via lcf and setl), shell (via algol), and awk (in a smaller way)
but it was also several years ago, which is when I first started playing around with python, before starting to do production work with it a few years later.
so in one sense, i.e. compared to the present, 1.5 is an early version.
> The main idea in Awk was associative arrays, which were newish at the time, but which now show up in most languages either as library functions (hashmaps in Java or C++) or directly in the language (dictionaries in Perl and Python). Associative arrays are a very powerful construct, and can be used to simulate lots of other data structures.
awk was released in 01979, and the authors published this paper in sp&e that year: https://plan9.io/sources/contrib/steve/other-docs/awk.pdf but you see this report version is dated september 01978. but i don't think the report was widely circulated until the next year, when it was included in 7th edition unix as /usr/doc/awk (sudo apt install groff; groff -ms -Tutf8 v7/usr/doc/awk | less -r). it explains:
> Array elements may be named by non-numeric
values, which gives awk a capability rather like the
associative memory of Snobol tables. (...) There is an alternate form of the for statement
which is suited for accessing the elements of an associative array:
this pdf has evidently been retypeset from the troff sources from the open-source 7th edition release, but without the correct bibliographic database, so the references are missing. a comment in the troff source says:
> ....It supersedes TM-77-1271-5, dated September 8, 1977.
but possibly that reference is inaccurate
python goes beyond merely having dicts 'directly in the language'. python's primary data structure is the dict; among other things, it uses dicts for modules, (most) class instances, associating methods with classes, the locals() user interface to the local-variable namespace, and passing keyword arguments to functions and methods. that is, it uses associative arrays to simulate lots of other data structures, as you are obliged to do in awk, lua, and js. so where did python get dicts?
python got dicts (and tuples) from abc, a teaching language which wikipedia claims was started in 01987, 8 years after awk's release, and added conventional arrays (lists) back in. the five data types in abc https://homepages.cwi.nl/~steven/abc/language.html are numbers, strings, compounds (called tuples in ml), lists (really multisets because they're implicitly sorted), and tables (dictionaries), which are awk's 'associative arrays'—and, as in awk, js, lua, and tcl, they're used to provide the functionality of conventional arrays as well
however, lambert meertens credits the use of tables in abc to jack schwartz's setl https://inference-review.com/article/the-origins-of-python rather than to awk. he says of the addition of tables to b (the early name for abc, not to be confused with the b that was an earlier version of c)
> Having coded a few algorithms in SETL, I had experienced its power firsthand—a power that stemmed entirely from its high-level inbuilt data types. Particularly powerful were sets and maps, also known as “associative arrays,” containing data that can be indexed not only by consecutive integers but by arbitrary values. A programmer could introduce a simple database of quotations named whosaid, in which the value ”Descartes” could be stored in the location whosaid[”cogito ergo sum”]. These high-level types made it possible to express algorithms that required many steps in B1 using just a few steps in SETL. In a clear violation of the Fair-Expectation Rule, B1 allowed only integers as array indices. This design decision had been driven by fear: we had been concerned that aiming too high would make our language unimplementable on the small personal computers that were starting to appear on the market. But Dewar, in particular, convinced me that this meant we were designing for the past, not the future. This led us to redesign the system of data types for our beginners’ language. This time we used only the criteria of ease of learning and ease of use to select candidate systems. The winner turned out to be remarkably similar to the data type system of SETL. The set of possible data type systems to choose from was very large, and to make the process more manageable I had written a program to select the competitive (Pareto-optimal) candidate systems. Interestingly, but quite incidentally, that selection program itself was written in SETL. The winning type system became that of B2, and remained unchanged in the final iteration, released in 1985 under the name “ABC.”
'associative arrays', of course, is the term used by awk
this story of adding associative arrays to abc only for b2 is somewhat complicated by the fact that the version of b (b1?) in meertens's 01981 'draft proposal for the b programming language' https://ir.cwi.nl/pub/16732 already includes tables, three years after the release of awk as part of 7th edition; p. 6 (9/91) says,
> Tables are somewhat like dictionaries. A short English-Dutch dictionary (not sufficient to maintain a conversation) might be (...) Table entries, like entries in a dictionary, consist of two parts. The first part is called the key , and the second part is called the associate. All keys must be the same type of value, and similarly for associates. A table may be written thus: {[’I’]: 1; [’V’]: 5; [’X’]: 10}.
> If this table has been put in a target roman, then roman[’X’] = 10.
note that this is also awk's syntax for indexing an associative array, though it doesn't have a syntax for writing one down.
(to be continued; hn says, 'that comment included too many facts')
> The set of possible data type systems to choose from was very large, and to make the process more manageable I had written a program to select the competitive (Pareto-optimal) candidate systems. Interestingly, but quite incidentally, that selection program itself was written in SETL.
then you just need some way to calculate various merits of the different designs. one merit is simplicity, which you could maybe try to operationalize as something like this:
comparing two merit-maps such as {'simplicity': 8} and {'simplicity': 6} for pareto optimality is simple enough with some thought, especially if we assume the keys are the same:
>>> some_way_better = lambda a, b: any(a[k] > b[k] for k in a)
>>> defeats = lambda a, b: some_way_better(a, b) and not some_way_better(b, a)
>>> defeats({'simplicity': 9}, {'simplicity': 8})
True
>>> defeats({'simplicity': 8}, {'simplicity': 9})
False
>>> defeats({'simplicity': 9, 'turing-complete': 0}, {'simplicity': 8, 'turing-complete': 1})
False
>>> defeats({'simplicity': 9, 'turing-complete': 1}, {'simplicity': 8, 'turing-complete': 1})
True
then it's easy enough to calculate which of a set of candidate designs can be eliminated because some other design defeats them
the bogus part is how you automatically calculate the various merits of a hypothetical collection of language features
a more recent set of slides on the relation between abc and python is https://www.cwi.nl/documents/195216/Meertens-20191121.pdf which describes again how abc was started in 01975. this helpfully clarifies the timeline: b0 was 01975; b1 was 01978; b2 was 01979; and b∞ = abc was 01985. so specifically the point at which setl inspired the replacement of conventional arrays in b1 with associative arrays in b2 was 01979, which was the year 7th edition unix was released and the aho, weinberger, and kernighan paper was published in sp&e
a question of some interest to me here is what platform they were developing abc on in 01979. clearly it couldn't have been the ibm pc, which wouldn't come out until 01983 (and as far as i know abc on the ibm pc only runs under cygwin or 32-bit microsoft windows), or macos (which came out in 01984) or atari tos, which wouldn't come out until 01985. and so far i haven't seen any mention in the history of abc of other operating systems of the time like cp/m, vm/cms, dg rdos, tenex, or tops-20. the most likely platform would seem to have been unix, on which awk was one of the relatively few programming languages available. perhaps at some point i'll run across an answer to that question in the abc papers
python adopted awk's syntax for putting 10 into roman['x'], which was `put 10 in roman['x']` in abc, but `roman['x'] = 10` in awk and python. abc's syntax is uppercase, presumably case-insensitive, separates words with apostrophes, and departs widely from conventional infix syntax. python's syntax is case-sensitive, mostly lowercase, and conventionally infix, features that have become common through the influence of unix. python's control structures are for, while, and if/elif/else, as in algol and in abc, and indentation-sensitive as in abc, but uses a conventional ascii syntax rather than abc's scratch-like syntax-directed editor
abc was statically typed with a hindley-milner type system ('the type system is similar to that of lcf', p. 15 (18/91) of the draft proposal), while python is dynamically typed, like smalltalk, lisp, and awk
if meertens got his daring notion of storing everything in associative arrays from awk, he certainly doesn't mention it. instead he mentions setl a lot! the draft proposal doesn't cite awk but it also doesn't cite setl; it cites the algol-68 report, milner's lcf typing paper, a cleaveland and uzgalis paper about grammars, gehani, and three of his own papers, from 01976, 01978, and 01981. unfortunately i can't find any of those earlier meertens papers online
the wikipedia page about setl says
> SETL provides two basic aggregate data types: (unordered) sets, and tuples.[1][2][5] The elements of sets and tuples can be of any arbitrary type, including sets and tuples themselves, except the undefined value om[1] (sometimes capitalized: OM).[6] Maps are provided as sets of pairs (i.e., tuples of length 2) and can have arbitrary domain and range types.[1][5]
but it's citing papers about setl from 01985 there, well after awk had supposedly popularized the notion of associative arrays
> Jacob T. Schwartz. ON PROGRAMMING: An Interim Report on the SETL Project. Part I: Generalities; Part II: The SETL Language and Examples of Its Use. Computer Science Department, Courant Institute of Mathematical Sciences, New York University, revised June 1975.
this discusses how setl represented data in memory starting on p. 57 (57/689). it used hash tables to represent sets, including sets of tuples, rather than the ill-advised balanced-tree approach used by abc. (python, like awk and setl, uses hash tables.) on pp. 62–63 (62–63/689) it explains:
> The hash code of a tuple is taken to be the hash code of
its first component, for reasons that will become clear in the
next section. The hash code of a set is the exclusive or of
the hash codes of all its members. (...)
> — Tuples in Sets —
> Though expressible in terms of the membership test, "with", and
"less" operations, functional evaluation plays so important a role
in SETL algorithms that we treat it as a primitive.
> SETL makes three types of set-related functional evaluation
operators available:
> - f(x)
> - f{x}
> - f[s]
> The most fundamental of these is f{x}, which invokes a search
over f for all n-tuples that begin with x (n ≥ 2), and which
yields as result the set of all tails of these n-tuples. More
precisely, in SETL:
> f{x} = if #y eq 2 then y(2) else tℓ y, y ∈ f | type y eq tupl and #y ge 2 and hd y eq x}
> The operation f(x) has a similar definition but includes a single valuedness check:
> f(x) = if #f{x} eq 1 then ∋f{x} else Ω
> The operation f[s] is adequately defined in terms of f{x}:
> f[s] = [+: x ∈ s] f{x}
i am fairly confident that the f{x} definition translates into current vernacular python as the set-comprehension {y[2] if len(y) == 2 else y[1:] for y in f if type(y) == tuple and len(y) >= 2 and y[0] == x}.
so, it becomes clear that already in 01975 setl treated sets of tuples as maps, which is to say associative arrays, but it didn't use the 'associative array' terminology used by meertens in 01981, or for that matter 'maps'. to look up an element in the map, it didn't use the f[x] notation used by python, awk, and abc; instead it used f(x). further explanation on pp. 64–65 (64–65/689) clarifies that really it is more accurate to think of 'sets of tuples' as trees; each item in the tuple entails following an additional level of pointers to a further hash table
(in a number of other notational details, python and presumably abc follows setl: start: or start:end for array index ranges, + for string concatenation, * for string repetition, boolean operators spelled out as and, or, and not. but indexing maps is far from the only difference)
abc (including b as described in the 01981 report) also seems to lack the f{x} operation and its possibility of associating an arbitrary-sized set of values with each key. this is a nontrivial semantic divergence
so if abc got its idea of tables from setl, but used awk's terminology, notation, and semantics for them (and its own ill-conceived balanced-tree implementation, used by neither), and decided to adopt the table idea in the year when awk was released, probably on the platform that awk was released on, i think it's reasonable to assign some share of the credit for abc's tables to awk? even if not all of it
but if that's so, then why didn't meertens credit aho, weinberger, and kernighan? i don't know. maybe awk's loosey-goosey nature was repugnant to him. maybe weinberger is jewish and meertens is secretely anti-semitic. maybe meertens thought that awk's loosey-goosey nature would be repugnant to the dijkstra-honoring dutch computer science establishment. maybe aho insulted meertens's favorite beer one time when he visited the netherlands. or maybe he thought it would be unfair for aho, weinberger, and kernighan to steal the thunder of schwartz, who did after all precede them in developing a general-purpose language whose memory was based entirely on hash tables. from a certain point of view that would be like crediting carl sagan with the theory of relativity because he explained it on nova
I read this comment in the spirit of "you might not have known that K did much more that write awk, but he's a genius." Which I didn't know, so I appreciate the context.
Just downvote or flag. It looks like you only have 17 karma so I don't think you can do either (I know you can't downvote, I'm not sure what, if any, threshold there is for flagging), so the best thing to do is to ignore it. Contribute to the site until you get enough karma, and then downvote/flag this kind of nonsense.
* He joined Princeton’s CS department in 2000 but taught at least one class there as early as 1993 while still at Bell Labs Research (on sabbatical?)
* One of his students regularly brought a 386sx laptop (running pre-1.0 Linux) to class and when Brian was asked more obscure questions about what awk did which he couldn’t remember, the student would run commands in awk and feed Brian the definitive implementation answer. So Brian had some exposure to Linux moderately early on.
* Here’s a writeup from him on putting AT&T’s toll free phone directory on the internet back in fall 1994: https://www.cs.princeton.edu/~bwk/800.html