Hacker News new | comments | ask | show | jobs | submit login

I'm surprised that neither GNU yes nor GNU cat uses splice(2). Here's what I get. Note that ./rust is the final Rust program from the article modified to print stats and exit after a set byte count[1], ./splice is a simple C program that uses splice to copy the input string from a generated anonymous temporary file to stdout, and ./consume uses splice to move data from stdin to /dev/null. In all tests the default string of "y\n" is used.

  $ ./rust | cat >/dev/null
  5.220 GB/s (8589934592 bytes in 1.645 seconds)

  $ ./splice | cat >/dev/null
  14.514 GB/s (8589934592 bytes in 0.551 seconds)

  $ ./rust | ./consume 
  9.312 GB/s (8589934592 bytes in 0.922 seconds)

  $ ./splice | ./consume 
  25.118 GB/s (8589934592 bytes in 0.318 seconds)
As you can see, cat(1) is the real bottleneck.

splice is Linux-only. Most systems have sendfile(2), including Linux, but I didn't test it. The implementation and semantics of sendfile vary across platforms.

[1] I did this before I wrote ./consume (which takes an optional byte count limit), having assumed that GNU cat was using splice. As cat doesn't have a way to limit the number of bytes read/written, and other tools like head or tail definitely don't use splice, the simplest way to limit the benchmark without introducing a bottleneck was to have the producers themselves print stats and exit. The relevant diff is

    + let mut count = 0u64 as usize;
    + let bytes = (1u64 << 33) as usize;
    -while locked.write_all(filled).is_ok() {}
    +while locked.write_all(filled).is_ok() {
    +  count += filled.len();
    +  if count >= bytes {
    +      break;
    +  }

> I'm surprised that neither GNU yes nor GNU cat uses splice(2). ... splice is Linux-only.

The fact that splice is Linux only is most likely why. GNU programs are portable to many different OSes, many of which no longer exist in any real form. They try to not use any 'single OS' specific feature if at all possible.

splice(2) and tee(2) were basically tailor made for cat(1) and tee(1), and for some reason I was under the impression that GNU tee was using splice(2) and/or tee(2). Using these could be trivial--just a few extra lines of code that could fall-through to the existing methods, for a huge speed-up in performance. (Performance matters because the consumers could be CPU bound, and an inefficient cat or tee might be taking away resources that could be used by the consumer.)

Regarding portability, GNU tail uses the Linux-specific inotify(2) to respond faster to writes. Like alot of OSS, coreutils uses the BSD .d_type member extension[1] of struct dirent to avoid unnecessary stat() calls. There are many other more intrusive OS-specific details baked into coreutils, but often it's the nature of the problem--in many situations you're dependent on platform-specific details or features. For the most part, these nitty-gritty platform-specific details are far more intrusive in terms of code complexity than the performance optimizations.

[1] Missing from Solaris, and probably most other SysV derivatives.

Most importantly, GNU programs ought to work on Hurd.

Forgive me if I'm missing something subtle, but aren't the first two examples Useless Use Of Cat?

No, the use of cat is intentional.

The output of yes(1) is normally consumed by another program, not redirected to a file, so this is what the examples emulate.

Yes they are. Why not just send it straight to /dev/null?

1) Because splice(2) requires either the source or sink to be a pipe. The source is a regular file so the sink has to be a pipe.

2) Because the example Rust program(s), emulating yes(1), had no [simple] way to measure throughput except by piping to another program. We can't fairly compare program A that writes directly to /dev/null with program B that writes to a pipe even if program A can measure its throughput.

2a) What jwilk said.

3) For some reason I thought that glibc had optimizations to elide fwrites to /dev/null, and some of my code was using stdio (e.g. for the final trailing bytes less than the pagesize). I could've sworn either glibc or bash did this, but I can't find any mention of it, now. I realize it would be crazy difficult and ugly for glibc to do this (because of dup2, etc), but glibc does alot of crazy things, and in any event I didn't bother checking beforehand.

Mostly it comes down to fairly comparing benchmarks and kernel facilities. Otherwise, yes, those would be classic examples of Useless Use of Cat.

Show us the source of your programs. :)

Here's a cleanup version:

  #ifndef _GNU_SOURCE
  #define _GNU_SOURCE 1
  #include <errno.h>    /* ERANGE errno */
  #include <inttypes.h> /* strtoumax(3) */
  #include <limits.h>   /* LONG_MAX */
  #include <stdint.h>   /* uintmax_t */
  #include <stdio.h>    /* fflush(3) fprintf(3) fread(3) fwrite(3) tmpfile(3) */
  #include <stdlib.h>   /* EXIT_FAILURE */
  #include <string.h>   /* strlen(3) */
  #include <err.h>      /* err(3) errx(3) */
  #include <fcntl.h>    /* loff_t splice(2) */
  #include <sys/time.h> /* struct timeval gettimeofday(2) */
  #include <unistd.h>   /* _SC_PAGESIZE pread(2) sysconf(3) write(2) */
  #ifndef HAVE_SPLICE
  #define HAVE_SPLICE __linux__
  #define MIN(a, b) (((a) < (b))? (a) : (b))
  #define MAX(a, b) (((a) > (b))? (a) : (b))
  #ifndef howmany
  #define howmany(x, y)  (((x) + ((y) - 1)) / (y))
  #define UMAX_PREC (sizeof (uintmax_t) * CHAR_BIT) /* NB: assumes no padding */
  #define UMAX_HALF ((UMAX_PREC + 1) / 2)
  #define UMAX_LO(n) ((n) & ((UINTMAX_C(1) << UMAX_HALF) - 1))
  #define UMAX_HI(n) ((n) >> UMAX_HALF)
  static inline _Bool
  add_overflow(uintmax_t *r, const uintmax_t a, const uintmax_t b)
          if (~a < b)
                  return 1;
          *r = a + b;
          return 0;
   * Implement multiplication using a polynomial with four multiplications and
   * three additions, except we can optimize out some operations.
  static inline _Bool
  mul_overflow(uintmax_t *_r, const uintmax_t _a, const uintmax_t _b)
    uintmax_t a[2] = { UMAX_LO(_a), UMAX_HI(_a) };
    uintmax_t b[2] = { UMAX_LO(_b), UMAX_HI(_b) };
    uintmax_t r[2];
    /* if both are non-0, we'd always overflow */
    if (a[1] && b[1])
      return 1;
    /* either a[1] or b[1] must be 0 here, so no intermediate overflow */
    r[1] = (a[1] * b[0]) + (a[0] * b[1]);
    /* if the result has MSW bits set, we'd overflow */
    if (UMAX_HI(r[1]))
      return 1;
    r[0] = a[0] * b[0];
    return add_overflow(_r, r[0], r[1] << UMAX_HALF);
  #if 0
  static uintmax_t
  add(uintmax_t a, uintmax_t b)
    uintmax_t r;
    if (add_overflow(&r, a, b))
      errx(1, "arithmetic overflow (%ju * %ju)", a, b);
    return r;
  static uintmax_t
  mul(uintmax_t a, uintmax_t b)
    uintmax_t r;
    if (mul_overflow(&r, a, b))
      errx(1, "arithmetic overflow (%ju * %ju)", a, b);
    return r;
  static uintmax_t
  toumax(const char *p)
    char *pe;
    uintmax_t n;
    errno = 0;
    if (UINTMAX_MAX == (n = strtoumax(p, &pe, 10)) && errno != 0)
      err(1, "%s", p);
    if (*pe != '\0' || p == pe)
      errx(1, "%s: invalid number", p);
    return n;
  static uintmax_t
  gcd(uintmax_t a, uintmax_t b)
    uintmax_t c;
    while (a != 0) {
      c = a;
      a = b % a;
      b = c;
    return b;
  static uintmax_t
  lcm(uintmax_t a, uintmax_t b)
    return mul(a, b) / gcd(a, b);
  static size_t
    long n = sysconf(_SC_PAGESIZE);
    if (n <= 0)
      err(1, "sysconf");
    return n;
  static double
    struct timeval now;
    gettimeofday(&now, NULL);
    return now.tv_sec + (now.tv_usec / 1000000.0);
  main(int argc, char **argv)
    _Bool cflag = 0, nflag = 0;
    uintmax_t count = UINTMAX_MAX, lines = 0, total = 0;
    int optc;
    while (-1 != (optc = getopt(argc, argv, "c:n:"))) {
      switch (optc) {
      case 'c':
        if (nflag)
          errx(1, "-c and -n mutually exclusive");
        count = toumax(optarg);
        cflag = 1;
                  case 'n':
                    if (cflag)
          errx(1, "-c and -n mutually exclusive");
        lines = toumax(optarg);
        nflag = 1;
        return EXIT_FAILURE;
    argc -= optind;
    argv += optind;
    const char *const line = (argc > 0)? argv[0] : "y";
    const size_t linesize = strlen(line) + 1;
    const size_t pagesize = xgetpagesize();
    const size_t blocksize = MAX(mul(pagesize, 16), lcm(linesize, pagesize));
    const uintmax_t filesize = mul(blocksize, howmany((1UL << 20), blocksize));
    if (nflag)
      count = mul(linesize, lines);
  //  fprintf(stderr, "linesize:  %zu\n", linesize);
  //  fprintf(stderr, "pagesize:  %zu\n", xgetpagesize());
  //  fprintf(stderr, "blocksize: %ju\n", blocksize);
  //  fprintf(stderr, "filesize:  %ju\n", filesize);
    FILE *fh;
    if (!(fh = tmpfile()))
      err(1, "tmpfile");
    for (size_t i = 0; i < filesize; i += linesize) {
      int n = fprintf(fh, "%s\n", line);
      if (n == -1)
        err(1, "fprintf");
      if ((size_t)n != linesize)
        errx(1, "wrote %d bytes, expected %zu", n, linesize);
    if (0 != fflush(fh))
      err(1, "fflush");
    double begin = gtod();
    int fd = fileno(fh);
    _Bool eof = 0;
    do {
      loff_t p = 0;
      size_t r;
      ssize_t n;
      /* NB: no LOFF_MAX available */
      _Static_assert(sizeof p <= sizeof (long), "unexpected type for loff_t");
      if (filesize > LONG_MAX)
        errx(1, "filesize too large (%ju > %jd)", filesize, (intmax_t)LONG_MAX);
      while ((size_t)p < filesize && count >= blocksize && !eof) {
        r = MIN(blocksize, count);
        r -= r % blocksize;
        n = splice(fd, &p, STDOUT_FILENO, NULL, r, SPLICE_F_MOVE);
        if (n == -1)
          err(1, "splice");
        p += n;
        count -= n;
        total += n;
        eof = n == 0;
    } while (!eof && (count || (!cflag && !nflag)));
      char buf[BUFSIZ], *p, *pe;
      size_t n;
      while (count) {
        if (!(n = fread(buf, 1, MIN(sizeof buf, count), fh))) {
          if (ferror(fh))
            err(1, "fread");
        p = buf;
        pe = &buf[n];
        while (p < pe) {
          if (!(n = fwrite(p, 1, (size_t)(pe - p), stdout)))
            err(1, "fwrite");
          p += n;
          count -= n;
          total += n;
    if (0 != fflush(stdout))
      err(1, "fflush");
    double elapsed = gtod() - begin;
    fprintf(stderr, "%.3f GB/s (%zu bytes in %.3f seconds)\n", ((double)total / elapsed) / (1UL<<30), total, elapsed);
    return 0;

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact