Hacker News new | past | comments | ask | show | jobs | submit login
Parsing Awk Is Tricky (raygard.net)
110 points by oliverkwebb 16 days ago | hide | past | favorite | 90 comments



Brian Kernighan sent Gawk maintainer Arnold Robbins an email linking to this blog post with the comment "Hindsight has a lot of benefits, it would appear."

Peter Weinberger (quoted with permission) responded:

> That's interesting, Here's some thoughts/recollections. (remember that human memory is fallible.)

> 1. Using whitespace for string concatenation, in retrospect, was probably not the ideal choice (but '+' would not have worked).

> 2. Syntax choices were in part driven by the desire for our local C programmers to find it familiar.

> 3. As creatures of a specific time and place awk shared with C the (then endearing, now irritating) property of being underspecified.

> I think that collectively we understood YACC reasonably well. We tortured the grammar until the parser came close to doing what we wanted, and then we stopped. The tools then were more primitive, but they did fit in 64K of memory.

Al Aho also replied (quoted with permission):

> Peter's observation about torturing the grammar is apt! As awk grew in its early years, the grammar evolved with it and I remember agonizing to make changes to the grammar to keep it under control (understanding and minimizing the number of yacc-generated parsing-action conflicts) as awk evolved. I found yacc's ability to point out parsing-action conflicts very helpful during awk's development. Good grammar design was very much an art in those days (maybe even today).

It's fun to hear the perspectives of the original AWK creators. I've had some correspondence with Kernighan and Weinberger before, but I think that's the first time I've been on an email thread with all three of A, W, and K.


"The tools were more primitive, but they did fit in 64k of memory."

I will take "primitive" over present-day bloat and complexity every time, quirks and all.

That programs fitting in 64K of memory have remained in continuous use and the subject of imitation for so long must be a point of pride for the authors. From what I have seen, contemporary software authors are unlikely to ever achieve such longevity.


Thanks for posting this.

I think it casts a pretty harsh light on criticisms of awk.

Ultimately awk is one of the all time great languages. Small. Good at what it does.

There’s something satisfying about using it which languages like Python just don’t give you. It’s a little bit of Unix wizardry.


Awk is something that I think every programmer and especially every sysadmin should learn. 8 like the comparison at the end and have never heard of nnawk or bbawk before.

I recently made a dashboard to compare four versions of awk output together, since not all awk scripts I'll run the same on each version: https://megamansec.github.io/awk-compare/ I'll have to add those:)


awk is also not hard to understand, scroll through the Wikipedia page for a few minutes https://en.wikipedia.org/wiki/AWK#Structure_of_AWK_programs

It runs an action for each line in the input (optionally filtered by regex). You get automatic variables $1,$2... for the words in the line split by spaces.

The syntax is almost like a simple subset of Javascript. Builtin functions are similar to C standard library.

If you have input in text that is separated in columns with a delimiter, and you want do simple operations on that (filter, map, aggregate), it can be done quickly with awk.

That's all you need to know about awk.


I often find the missing support for slicing (like fields from 2-6, as `cut -f` can do) a handicap. I tend reach for jq instead of awk, these days.


> Awk is something that I think every programmer and especially every sysadmin should learn

I'd argue that it should be every programmer who doesn't already know a scripting language like Ruby or Python. If you already know a scripting language, chances are the time saved between writing an Awk one-liner and quickly banging out a script in your preferred language is negligible. And if you ever need to revisit it and maybe expand it, it'll be much easier to do in your preferred scripting language than in Awk, especially the more complex it gets.

I'm speaking from experience on this last point... At my work I wrote a very simple file transformer (move this column to here, only show lines where this other column is greater than X, etc etc) in Awk many years ago. It was quick and easy and did what it needed to. It was a little too big to be reasonable as a one-liner, though not by very much at all. But as we changed and expanded what it needed to do, it ended up getting to be a few thousand lines of Awk, and that was a nightmare. One day I got so fed up with it that I rewrote it all in Ruby in my free time and that's how it's been ever since, and it's soooo much better that way. Could have saved myself a lot of trouble if it were that way from the beginning, but I had no idea at that time it would grow beyond the practically-a-one-liner size, so I thought Awk would be a great choice.


> every programmer and especially every sysadmin should learn

There are lots of things "every <tech position> should learn", usually by people who already did so. I still have a bunch of AI/ML items on that list too.

What's the advantage of learning AWK over Perl?


> What's the advantage of learning AWK over Perl?

Getting awk in your head (fully) takes about an afternoon: reading the (small and exhaustive) man page, going through a few examples, trying to build a few toys with it. Perl requires much, much more effort.

Great gain/investment ratio.


another commenter said something similar - But nothing says you have to learn everything - you can learn a subset of perl that does everything you would want to do (with awk), would that take as long?


Yup, but defining that subset isn't free! Perhaps some people did the work already, but I'd still be cautious as to how much Perl one actually need to know to use those comfortably.


Both will get you where you want to go, but I don't think the usecase for perl and awk are the same.

I reach for awk when my bash-scripts get a bit messy, perl is/was for when I want to build a small application (or nowdays python).

But both perl and python require cpan/pip to get the most out of and with awk, I just nead awk.


Is there any particular functionality which does exist in awk, but doesn't exist in Perl or Python without third-party libraries? I've always found "Python + built-in modules" more than sufficient for my text-manipulation needs. (Also, it lets me handle binary data and character data in the same program, which is very useful for certain tasks.)


It’s just that awk has a concise syntax that can make for some really quick one-liners in your terminal prompt. Why spend a minute or two in Python if you can get an answer in 15 seconds instead?


> Why spend a minute or two in Python if you can get an answer in 15 seconds instead?

Because you (or someone else) can run your Python later if needed, and have confidence the output will be the same.

Sure, there are some times when a one-liner is needed, and you can always put that one line in a document for others to run. I can think of many times when I was on-call and needing to grep some data out of logs that wasn't already in a graph/dashboard somewhere. When time is of the essence, or if you're really really sure that you won't need to run the same or similar thing ever again, even if the data changes. I even changed my shell to make up-arrow go through commands with the same prefix instead of linearly traversing command history because I had so many useful one-liners I re-ran later.

But as I've gotten more experienced, I've come to appreciate the value of committing those one liners to code as early as possible, and having that code reviewed. Sometimes a really useful tool will even emerge from that.


I put off learning awk for literal decades because I knew perl, but then I picked it up and wish I had done so earlier. I still prefer perl for a lot of use cases, but in one-liners, awk's syntax makes working with specific fields a lot more convenient than perl's autosplit mode. `$1` instead of `$F[0]`, basically.


but, then couldn't you use "cut" as even simpler syntax?


`cut` doesn’t work natively on data that’s been aligned with multiple spaces, you need a `tr -s` pass first.

It also doesn’t let you reorder or splice together fields.

I used it for years but now that I have a working understanding of `awk` I have never looked back.


FreeBSD cut has -w for that ("split by any amount of whitespace"), but that never made it into GNU cut. Sad, because it's mega useful.

Of course awk can do much more, but if all you want is "| awk '{print $2}'" then "cut -wf2" is so much more convenient.


Reordering and splicing are common enough that it’s easier just to always use awk, since the cost of rewriting one to the other is significantly higher.


Maybe if all you want to do is unconditionally extract certain columns from your data. But even in that case cut doesn't let you use a regular expression as the field delimiter.


- Awk is defined in POSIX

- Awk is on more systems than Perl

- Awk has more implementations than Perl


POSIX not really relevant, more systems? Debatable. More implementations could be seen as a negative.

Perl is more regular than Awk for the simple cases and is more usable for anything that isn't merely iterating over input.

Of course, you shouldn't any of awk/perl/shell for tasks that aren't being run by you or are over say 20 lines long.


awk is also a much smaller language than perl, so it's generally less effort to teach, learn, and read.


Is it not possible to learn a subset of perl?


Learning any language more or less starts with learning a subset of it.

Asking a new hire to "learn awk" vs "learn perl" have two very different time investments attached to them.

Tasking someone with "learning a subset of perl" begets the question "what subset?", and a very exhausting conversation with someone(s) routinely asking "so?" follows. After spending a large amount of time re-litigating which subsets of perl features we want that awk already supplies.


Which subset, and how do you ensure that every example you come across and everyone you work with sticks to that subset?


> Awk is defined in POSIX

so?

> Awk is on more systems than Perl

By what metric?

> Awk has more implementations than Perl

so?


Whatever you think my opinion is of Perl you're probably wrong and the tone of your advocacy is kind of odd.

Awk is older and as a part of POSIX the version found on unix-like environments will be (outside of extensions) compatible with others. If one or one without the extensions you want isn't present you can choose an implementation, even one in Go and it'll work.

Perl, and I've been writing Perl since Perl4, doesn't have those characteristics. It's a much more powerful language that has changed over the years and it is not always present by default on a unix-like system. Because the maintainers value backward compatibility, even scripts written on Perl5.005 have a fair chance of working on a modern version but it's not assumed (and you shouldn't assume anything about modules). Because Awk is fossilized, you can assume that.


The first and last items in your list provide no reason why they are relevant, there is no "tone", nor "advocacy" - it's not "odd" to ask for that context, as given here.


Awk is found in small-ish embedded systems that don't have no reason to waste space on Perl or anything like it.

One reason for this is that the popular BusyBox project includes an Awk implementation: BusyBox Awk.

Pretty much everywhere there is BusyBox, there is an Awk, unless someone went out of their way to compile it out of the BuxyBox binary.


Every linux system comes with awk already on it. Perl has to be installed, and might not be available on a system you don't control.


I think this is a good illustration of why parser-generator middleware like yacc is fundamentally misguided; they create totally unnecessary gaps between design intent and the action of the parser. In a hand-rolled recursive descent parser, or even a set of PEG productions, ambiguities and complex lookahead or backtracking leap out at the programmer immediately.


Hard disagree. Yacc has unnecessary footguns, in particular the fallout from using LALR(1), but more modern parser generators like bison provide LR(1) and IELR(1). Hand-rolled recursive descent parsers as well as parser combinators can easily obscure implicit resolution of grammar ambiguities. A good LR(1) parser generator enables a level of grammar consistency that is very difficult to achieve otherwise.


> Hand-rolled recursive descent parsers as well as parser combinators can easily obscure implicit resolution of grammar ambiguities.

Could you give a concrete, real-life example of this? I have written many recursive-descent parsers and never ran into this problem (Apache Jackrabbit Oak SQL and XPath parser, H2 database engine, PointBase Micro database engine, HypersonicSQL, NewSQL, Regex parsers, GraphQL parsers, and currently the Bau programming language).

I have often heard that Bison / Yacc / ANTLR etc are "superior", but mostly from people that didn't actually have to write and maintain production-quality parsers. I do have experience with the above parser generators, eg. for university projects, and Apache Jackrabbit (2.x). I remember that in each case, the parser generators had some "limitations" that caused problems down the line. Then I had to spend more time trying to work around the parser generator limitations than actually doing productive work.

This may sound harsh, but well that's my experience... I would love to hear from people that had a different experience for non-trivial projects...


If you start with an unambiguous grammar then you aren't going to introduce ambiguities by implementing it with a recursive descent parser.

If you are developing a new grammar it is quite easy to accidentally create ambiguities and a recursive descent parser won't highlight them. This becomes painful when you try to evolve the grammar.


The original comment says that using yacc/bison is "fundamentally misguided." But parser generators make it easy to add a correct parser to your project. It's obviously not the only way. Hand-rolling has a bunch of pitfalls, and easily leads to apparently correct behavior that does weird things on untested input. Your comment then is a bit like: I've never had memory corruption in C, so Rust/Java/etc. is for toy projects only.


> Hand-rolling has a bunch of pitfalls

I'm arguing that this is not the case in reality, and asked for concrete examples... So again I ask for a concrete example... For memory corruption, there are plenty of examples.

For parsing, I know one example that lead to problems. Interestingly, it was about using a state machine that was then modified (manually) and the result was broken. Here I argue that using a handwritten parser, instead of a state machine that is then manually modified, would not have resulted in this problem. Also, there was no randomized testing / fuzz testing, which is also a problem. This issue is still open: https://issues.apache.org/jira/browse/OAK-5367


There's no reason for concrete examples, because the point was about the fundamental misguidedness of parser generators, not about problems with individual parser generators or the nice things you can do in a hand-rolled one, but to accommodate you, ANTLR gives one on its home page: "... At Twitter, we use it exclusively for query parsing in Twitter search... Samuel Luckenbill, Senior Manager of Search Infrastructure, Twitter, inc."

Also, regexps are used very often in production, and that's definitely a parser-generator of sorts.

The memory corruption example was an analog, but to spell it out: it's easier and faster to write a correct parser using flex/bison than by hand, especially for more complex languages. Parser-generators have their use, and are not fundamentally misguided. That you might want to write your own parser in some cases does not diminish that (nor vice versa).


Same. LR(k) and LL(k) are readable and completely unambiguous, in contrast to PEG, where ambiguity is resolved ad hoc: PEG doesn't have a single definition, so implementations may differ, and the original PEG uses the order of the rules and backtracking to resolve ambiguity, which may lead to different resolutions in different contexts. Ambiguity does not leap out to the programmer.

OTOH, an LL(1) grammar can be used to generate a top-down/recursive descent parser, and will always be correct.


A large portion of this consistency is not making executive decisions about parsing ambiguities. The difference between "the language is implicitly defined by what the parser does" and "the grammar for the language has been refined one failed test at a time" is large and practically important.


I think it would be interesting and adequate to hear about and link to the reflections of the original awk authors (Aho, Kernighan, Weinberg et al) considering they were also experts for yacc and other compiler-compiler tools from the 1977–1985 era and authors of the dragon book. After all, awk syntax was the starting point for JavaScript including warts such as regexp literals, optional semicolons, for (e in a), delete a[e], introducing the function keyword to a C-like language, etc. I recall at least Kernighan talked about optional semicolons as something he‘d reconsider given the chance.


And GNU is notorious for their use of yacc. Even gnulib functions like parse_datetime (primarily used to power the date command) rely on a yacc generated parser.


That's mostly for historical reasons. Nobody felt the need to switch and do all the work needed to avoid breaking edge cases.

GCC used to have Bison grammars but it switched to recursive descent about 20 years ago. The C++ grammar was especially horrible.


If you think AWK is hard to parse then try C++. The latter is so hard to parse thus very slow compile time that most probably inspired a funny programmer skit like this, one of the most popular XKCDs of all time [1].

Then come along fast compilation modern languages like Go and D. The latter is such a fresh air is that even though it's a complex language like C++ and Rust but it managed to compile very fast. Heck it even has RDMD facility that can perform compiled REPL as you interacting with the prompt similar to interpreted programming languages like Python and Matlab.

According to its author, the main reason D has very fast compile time (as long as you avoid the CTFE) is because of the language design decisions avoid the notorious symbols that can complicated symbol table just like happened in C++ and the popular << and >> overloading for I/O and shifting. But the fact that Rust come much later than C++ and D but still slow to compile is bewildering to say the least.

[1] Compiling:

https://xkcd.com/303/


Pretty sure Rust's compile times are a function of the complex type system and generic instantiation. Everything's a trade-off.


Except in some rare edge cases, it’s mostly the latter, indirectly: in the average crate the vast majority of the time is spent in LLVM optimization passes and linking. Sometimes IR generation gets a pretty high score, but that’s somewhat inconsistent.


`cargo check` that does all the parsing, type system checks, and lifetime analysis is pretty fast compared to builds.

Rust compilation time spends most time in LLVM, due to verbosity of the IR it outputs, and during linking, due to absurd amount of debug info and objects to link.

When cargo check isn't fast, it's usually due to build scripts and procedural macros, which are slow due to being compiled binaries, so LLVM, linking, and running of an unoptimized ton of code blocks type checking.


Which are damn more important (to me) than is the compile time metric.


IIRC, rust's long compile times are because it is basically doing static analysis, looking for potential errors


> According to its author, the main reason D has very fast compile time (as long as you avoid the CTFE) is because of the language design decisions avoid the notorious symbols that can complicated symbol table just like happened in C++ and the popular << and >> overloading for I/O and shifting. But the fact that Rust come much later than C++ and D but still slow to compile is bewildering to say the least.

The reasons why Rust (rustc) is slow to compile are well-known. Not bewildering.


Rust isn't particulary slow to compile as long as you keep opt-level to 1 and the number of external library minimal. But even them it isn't as slow as C++ (but i write shit C++ code, i've heard that modern C++ is way better, i learned with C++98 and never really improved my style despite using C++11).


http://canonical.org/~kragen/sw/dev3/gcd.rs, which uses no external libraries, takes 400–450ms to compile with rustc -C opt-level=1 gcd.rs (buggy program, i know). gcc 12, which is not anyone's idea of a fast c compiler, compiles the c equivalent http://canonical.org/~kragen/sw/dev3/gcd.c in 70–90ms, so the rust compiler is 300–500% slower

tcc, which is most people's idea of a fast c compiler, compiles gcd.c in 8–9ms, so the rust compiler is 4300–5500% slower

so from my point of view 'rust isn't particularly slow to compile' is off by about an order of magnitude

is it as slow as c++? well, g++ compiles the c++ version of the same code http://canonical.org/~kragen/sw/dev3/gcd.cc in 460–490ms. so in this case compiling rust is, yeah, on the order of 10% faster than compiling c++? i feel like that's basically the same

of course you can make compiling c++ arbitrarily slow with templates


I couldn't quite replicate those numbers (rustc 1.78, gcc 14, g++ 14) with a recent state. On my machine (Ryzen 9 7900X, LVM on NVMe) it's rustc 60-80ms, gcc 20-30ms and tcc in 2ms. Intererestingly, g++ is still 200ms on that machine. Activating time and the builtin time-passes in rustc here's also an interesting observation: rustc spends 47ms of its time in sys and 23ms in user compared to <3ms for both C variants. It counts its own time as 50ms instead for some reason, not sure what it is subtracting here. Also looking at individual passes of the compiler (rustc +nightly -C opt-level=1 -Z time-passes gcd.rs) reveals it spends 33ms linking, 16ms in LLVM and only a negligible time in what you'd consider compiling.

I think the test is uultimately non-sensical for the question being posed here. It doesn't reveal anything insightful about scaling to real world program sizes, either. The time of rustc is dominated by the platform linker anyways. Sure, one might argue that this points out Rust as relying too much on the linker and creating too many unused symbols. But the question of whether this is caused by the language and in particular its syntactical choices .. should at that point be answered with probably not. It's not a benchmark you want to compare by percentage speedups anyways since it's probably dominated by constant time costs for any of the batteries included standard library languages compared to C.


thank you very much for the failed replication!

it's interesting, my machine is fairly similar—ryzen 5 3500u, rustc 1.63.0, luks on nvme. is it possible that rustc has gotten much faster since 1.63?

while i agree that it's not the most important test for day-to-day use, i don't agree that it falls to the level of nonsensical. how fast things are determines how you can use them. tcc and old versions of gcc are fast enough that you could very reasonably generate a c file, compile it into a new shared object, dlopen it, and call it, every screen frame. there are some languages, like gforth, that actually implement their ffi in such a way, and sitkack and i have both done some experiments with inline c and jit compilation by this mechanism

i do agree that the syntactical choices of the language have relatively little to do with it, and your rustc measurements provide strong evidence of that—though perhaps it is somewhat unfavorable for c++ that it commonly has to tokenize many megabytes of header files and do the moral equivalent of text replacement to implement parametric polymorphism


Thank you for re-validating the numbers on your end, it's indeed very possible. There's been quite a few improvements in those versions. Though the effect size does not quite fit with most of the optimizations I can recall, maybe it's much more related to optimizations to the standard library's size and linking behavior.

With regards to standard use, for many users the scenario is definitely not common. I'd rather rustc be an effective screw driver and a separate hammer be built than try to mangle both into the same tool. By that I mean, it's very clear which portion of the compiler must be repurposed here. The hard question is whether the architecture is amenable to alternative linker backends that serve your use-case. I'm afraid I can't answer that conclusively. Only so much, the conceptual conflict of Rust is that linkining is a very memory-safety critical part of the process. And with its compilation module model it relinks everything into the resulting binary / library which includes a large std and dependency tree even if much of this is removed by the step. Maybe that can be changed; and relying a tool whose interface was ultimately designed with C in mind is also far from optimal to compute those outputs and inputs. It's hard to say how much of it stems from compatibility concerns and compatibility overheads and how much is fundamental to the language's design which could be shed for a pure build process.

With regards to C++, I suspect it's rooted in the fact that parsing it requires in principle the implementation of a complete consteval engine. The language has a dependency loop between parsing and codegen. This of course, is not how data should be laid out for executing fast programs on it. It's quite concerning given the specifications still contains the bold faced lie that "The disambiguation is purely syntactic" (6.8; 1) for typenames vs non-typenames to parse constructors from declarations which at the present can require arbitrary template specialization. It might be interesting to see if those two headers in your example already execute some of these dependency loops but it's hard for me to think of an experiment to validate any of this. Maybe you have ideas, is there something like time-passes?


dunno. with respect to c++, you could probably hack together a c++ compiler setup that was more aggressive about using precompiled-header-like things. and if you're trying to abuse g++ as a jit, you could maybe write a small, self-contained header that the compiler can handle quickly, and not call any standard library functions from the generated code


I think you've struck on the actual reason: Rust programmers don't perceive compile times as slow, and don't really view it as a problem. Thus, nobody works on making them faster.

Every language has tradeoffs, and every language community has priorities. In general, the Rust community doesn't care about compilation speed. For now, the community has basically decided that incremental cached compilations are good enough.

Which is fair, because there's only so many engineering hours, and the language has a lot of other priorities that fast to compile languages like Go ignore.

I'm biased towards C and Go's way of thinking about language design, which I know a lot of other people hate. But, there's also the universal problem that once you introduce a feature into a language, people will have a field day using it in contexts where it's not needed. Just like Perl programmers have never met a regex they've never disliked, and C++ programming have never heard of a bad operator overload, Rust programmer's have never seen a bad procedural macro or external crate dependency. Showing just a little bit of restraint using complex or slow to compile language features goes a long way, but it seems like most devs (in all languages) can't resist. Go is partially fast to compile because it just tells devs they aren't allowed to do 90% of the slow-to-compile things that they want to do.

Powerful languages like Rust and C++ give devs the choice, and they almost always choose the slow-to-compile but elegant option. At least, until the build hits an hour, then they wish compile times were faster. For the record, I'm not bashing C++ or Rust, I'm a C++ developer by trade.


haha, yes, exactly

probably nobody but distribution packagers and bsd committers would care about the compile time if it happened while you were editing the code


> of course you can make compiling c++ arbitrarily slow with templates

This might be my problem :/ (template are the closest to metaprogramming I can find outside of Lisps)

Tbf I was mostly comparing my experience with Rust, SBCL and C++, to me it was a given that C was an order of magnitude faster (3 order of magnitude seems a bit much). I found opt-level=1 quite early and managed to feel way better about rust and let C++ go (i was toying with polynomial regressions) (I rolled my own matrix library :D never do that!)

Thank you for the informations.


yeah! you can get an enormous amount of metaprogramming mileage out of c++ templates. i think the pattern-matching paradigm embodied by sfinae is maybe a better fit for, effectively, user-defined language extensions, than the more straightforward imperative approach lisp uses by default. but c++ templates are unnecessarily hard to debug i think, for reasons that aren't inherent to the pattern-matching paradigm

i didn't get c to compile three orders of magnitude faster, just 44×–56× faster (4400% to 5500%). sorry to be confusing!

i've certainly experienced the temptation to roll my own matrix library more than once, and i'll definitely have to do it at least once for the zorzpad. i may do something this week in order to understand the simplex method better; my operations research class was years ago, and i've forgotten how it works, probably because i never implemented it in software, just worked through examples by hand and on the chalkboard


Honestly, it was a school project, i had time and my final internship was month away, so i took the time to do it. Barely finished in time and it was quite lousy, but i was proud of it. It was my peak "Dunning-Kruger", because i was probably the most mathematically-inclined of all my classmates, and thought i was really clever.

Funny stuff, during my final internship i made a heavy use of scikit-image, learned about openBlas and unterstood how much better low-level libraries were for matrix computation, and how far away my own library was. And at my next job i was setting up our PaaS VMs with a lot of stuff, including TitanX with Cuda and pytorch, informed myself on the tools i was installing (i did set up tutorials in notebooks for an easy launch), and then understood i was years behind and way less informed than i thought i was. I think i learned about HN around that time.


How much time does the parsing step take when compiling c++, relatively speaking? Is it actually significant compared to everything else that happens?


If you are parsing awk, you must treat any ream of whitespace that contains a newline as a visible token, which you have to reference in various places in the grammar. Your implementation will likely benefit from a switch, in the lexical analyzer, which sometimes turns off the visible newline.


Another tricky bit is deciding whether "/" is the division operator or the start of a regular expression.

IIRC, awk does this in a context sensitive manner, by looking at the previous token.


Surely it is AWKward?


just use raku


Reading awk as a human is hard too. And performance of awk is crap. A lot slower than most interpreter language out there. I had replaced all the awk scripts in python and everything is a lot faster.


> And performance of awk is crap. [...] I had replaced all the awk scripts in python and everything is a lot faster.

My experience points exactly the other way: for data-processing tasks, especially streaming ones, even Gawk is a lot faster than Python (pre-3.11), and apparently I’m not the only one[1]. If you’re not satisfied with Gawk’s performance, though, try Nawk[2] or, even better, Mawk[3]. (And stick to POSIX to ensure your code works in all of them.)

[1] https://brenocon.com/blog/2009/09/dont-mawk-awk-the-fastest-...

[2] https://github.com/onetrueawk/awk

[3] https://invisible-island.net/mawk/


Do you know of any performance comparisons vs. PyPy? I find it works extremely well as a drop-in replacement for CPython when only the built-in modules are needed, which should generally hold for awk-like use cases. Yet some brief searching doesn't seem to yield any numbers.


You gotta share the code how you are doing it. If you are using awk alternative, you would be comparing against pandas or pypy. I will do a comparison as soon as I am free.


Discussing performance only makes sense in the context of a particular awk implementation, like TFA is doing as well. If you‘re (stuck) on gawk, try setting LANG=C to prevent Unicode/multi-byte regexp execution, or switch to mawk (which according to [1] is much faster than cpython).

[1]: https://brenocon.com/blog/2009/09/dont-mawk-awk-the-fastest-...


Honestly only makes sense in the context of a Python library and implementation as well, since so many libraries use C extensions in order to speed up processing. Also, Python has gotten a lot faster over time.


We gotta compare against pypy,or cpython plus pandas then


Awk is blazingly fast for some operations. I remember using it to solve Project Euler problem 67 [0] in a couple of milliseconds, which is more comparable to C/Rust than Python. Weirdly the forum posts from between 2013 and 2023 are missing so I can't see what I wrote there.

[0] https://projecteuler.net/problem=67


skill issue


Sure. I do not live in the terminal. But, I work with Linux enough to comfortably navigate around, read various shell scripts with relative ease. With the exception of awk. Which to me signals that, at least in my case, awk has a higher barrier for entry compared to most other things in the same environment.

So with alternatives around I can more easily parse myself, I happily concede that I have a skill issue with awk.


Well, using awk because you are familiar with it could be due to a skill issue with other languages too. Can't use python for parsing? Skill issue I guess, going by your logic.


Even the eminent Mr. A., W., and K. had sKiLl isSueS when designing this language, apparently. You can only ask so much from regular programmers.


once there are more productive alternatives that require less specialized "skill", your condescending "skill issue" becomes a devex issue, and basically a productivity gap which will doom your language or tool.


You just need to have the skill to overcome whatever non-technical, legacy, lack of education, or poor judgement issues that are steamrolling you into choosing to use awk instead of a sane rational decent modern efficient maintainable language.


To be fair, sometimes awk is just faster to call. In all other case, as my sibling says, use perl :D


Perl, then?


The rule of thumb back at Netcraft was to prototype in awk/sed for brevity/expressiveness and then port to perl for production use for performance reasons.

Been a couple decades since I was wrangling the survey systems there though, no idea what it looks like now.


i very much appreciate the server surveys; for a time i read the report every month!


As a dare from a friend I compared my Perl solution to an AWK solution:

  $time perl -MData::Dumper -ne '$n{length($_)}++; END {print Dumper(%n)}' bigfile.txt

  $VAR1 = '1088';
  $VAR2 = 349647;

  real    0m1.326s
  user    0m0.814s
  sys     0m0.371s

  $time awk 'length($0) > max { max=length($0) } END { print max }' bigfile.txt

  1087

  real    0m21.400s
  user    0m18.596s
  sys     0m0.455s
I prefer Perl, but I have no issue with AWK and I actually use it frequently.


Well. I don't know. Those two programs don't really do the same thing. There's an awful lot of comparisons in the second one. After making the awk program more similar to the Perl program, and using mawk instead of gawk (which is quite a bit slower) the numbers look a bit different:

  $ seq 100000000 > /tmp/numbers 
  $ time perl -MData::Dumper -ne '$n{length($_)}++; END {print Dumper(%n)}'  /tmp/numbers 
  $VAR1 = '7';
  $VAR2 = 900000;
  $VAR3 = '8';
  $VAR4 = 9000000;
  $VAR5 = '5';
  $VAR6 = 9000;
  $VAR7 = '4';
  $VAR8 = 900;
  $VAR9 = '6';
  $VAR10 = 90000;
  $VAR11 = '10';
  $VAR12 = 1;
  $VAR13 = '2';
  $VAR14 = 9;
  $VAR15 = '3';
  $VAR16 = 90;
  $VAR17 = '9';
  $VAR18 = 90000000;
  
  real 0m16.483s
  user 0m16.071s
  sys 0m0.352s
  $ time mawk '{ lengths[length($0)]++ } END { max = 0; for(l in lengths) if (int(l) > max) max = int(l); print max; }' /tmp/numbers 
  9

  real 0m5.980s
  user 0m5.493s
  sys 0m0.457s
[edit]: Actually had a bug in the initial implementation. Of course.


I used them both to find the longest line in a file. The Perl option just spits out the number of times each line length occurs. It will get messy if you have many different line lengths (which was not my case).

You also have to take into account that awk does not count the line terminator.

Let's try the opposite: make the Perl script more like the AWK one.

  $ time perl -ne 'if(length($_)>$n) {$n=length($_)}; END {print $n}'  rockyou.txt 
  286

  real 0m2,569s
  user 0m2,506s
  sys 0m0,056s

  $ time awk 'length($0) > max { max=length($0) } END { print max }' rockyou.txt 
  285

  real 0m3,768s
  user 0m3,714s
  sys 0m0,048s


`perl -lne ...` to have perl strip the trailing newlines like awk does. Should give the same result with it.


You're right. It even makes the times converge.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: