Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Xzgrep is twice as fast with dash versus bash
2 points by chasil on Oct 17, 2022 | hide | past | favorite | 5 comments
In the CentOS world, a shell script written by "Charles Levert" has been extended as the compressed grep for (gzip/bzip2/xz) as /bin/(zgrep/bzgrep/xzgrep); variants of this script are bundled in their respective packages.

The dash shell, available in EPEL for CentOS, appears to specifically speed up xzgrep considerably when altered from #!/bin/sh to #!/bin/dash.

In my testing on a 32-bit platform under xargs, a search of 14k files of 20M total size runs in half the time.

Do others see this performance increase?

Is this another tangible benefit for Debian/Ubuntu moving the system shell from bash to their Almquist shell derivative?




If the script was using bash, a noticeable speedup when using dash instead is not surprising. Having used NetBSD's sh for many years as an interactive shell, I find bash on Linux is far too bloated and slow. I modified dash (a derivative of NetBSD sh) for use as an interactive shell as well as a non-interactive one by adding tabcompletion. I like to write quick scripts without having to include #!/bin/sh at the top. I get no benefit from having two shells: one interactive and the other non-interactive.


Have you traced what system calls xzgrep is making in one shell vs the other? I ask because there are a few odd behaviors in each shell that can slow down some operations. A funny one I ran into was not setting the TZ variable and thus slowing down some system calls. This was also on CentOS which I have not used in a while.


I have not done an strace.

I did try a much larger dataset, and the speed advantage does decrease.

Test files - number and total size:

  $ ls *.xz | wc -l
  59908
  $ du -b *.xz | awk '{s += $1}; END {print s}'
  444456636
  $ echo $((444456636 /1024 /1024))
  423
Dash trial:

  # head -1 /usr/bin/xzgrep
  #!/bin/dash

  # time xzgrep 1234567 *.xz 2> /dev/null
  ...
  real 16m41.275s
  user 8m50.386s
  sys 11m0.066s

  # time xzgrep 1234567 *.xz 2> /dev/null
  ...
  real 16m1.004s
  user 8m41.904s
  sys 10m45.139s
Bash trial:

  # vi /usr/bin/xzgrep
  # head -1 /usr/bin/xzgrep
  #!/bin/sh

  # time xzgrep 1234567 *.xz 2> /dev/null
  ...
  real 21m53.975s
  user 7m23.522s
  sys 22m10.284s
p.s. The wrapper script fails for filenames that have spaces. It could definitely use some developer attention.


Dash was developed to be more posix compatible.


The Almquist shell first appeared in 1989, which predates the POSIX.2 standard shell from 1992. The new POSIX features were likely retrofitted onto the code.

https://en.wikipedia.org/wiki/Almquist_shell

https://en.wikipedia.org/wiki/POSIX#POSIX.2

Bash was introduced in the same year as the Almquist shell.

https://en.wikipedia.org/wiki/Bash_(Unix_shell)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: