Hacker News new | past | comments | ask | show | jobs | submit login
How I wound up finding a bug in GNU Tar (utoronto.ca)
189 points by fanf2 on Dec 26, 2018 | hide | past | favorite | 46 comments

FTR: This gdb script will print full stack traces for every read syscall made:

  catch syscall read
Put that into a file `trace-read.gdb` and attach to a running process like so:

  gdb -x trace-read.gdb -p $(pgrep -n tar)
This assumes you are running an executable with built-in debugging symbols (gcc -g). It should be possible to side-load debugging symbols provided in an external package, though I don't have a command at hand. (Anyone?)

This works well enough on Linux, to quickly debug situations like the one above. However, it can make the attached process painfully slow, and occasionally bring it down all-together.

The right tool for this kind of in situations where slowing down or crashing the process is not acceptable, is DTrace:

  dtrace -p $(pgrep -n tar) -n 'syscall::read:entry { ustack(); }'
- Tutorial: https://wiki.freebsd.org/DTrace/Tutorial

- ustack: http://dtrace.org/guide/chp-user.html#chp-user-4

Ironically, DTrace is one of the main selling points for Solaris/OmniOS (or FreeBSD) over Linux. The situation has gotten better recently, with bpftrace becoming available:

- http://www.brendangregg.com/blog/2018-10-08/dtrace-for-linux...

- https://github.com/iovisor/bpftrace

Until you have a 4.x Kernel with the right configuration options running, I am afraid the above gdb scripts is your best option on Linux.

You can do this on Linux using perf, if you want less overhead and less impact on the application-under-test:

  $ perf record -e syscalls:sys_enter_read -g -- application arg1 arg2 ...

  [ application runs while perf writes out a log, recording every read() syscall, and keeping track of the backtrace each time]

  $ perf report -g --stdio
  [ perf reads the log, writing out the backtraces ]
This is the basic usage. Lots more is available, obviously. This has been available for a LONG time.

Good point. I did not know perf could print backtraces.

fun trivia: gnu tar predates the GNU project. What?

It started out as an implementation of tar written by John Gilmore whose uname is 'gnu'. gnu has written a lot of free software as well, but the fact that his name and the GNU project's name are the same is a complete coincidence.

Are you sure you're not confusing GNU tar with gnuplot? http://www.gnuplot.info/faq/faq.html#x1-70001.2

Yep, see another comment by me down thread.

Fascinating but is that actually true? As far as I have been able to search their implementation seems to have been called pdtar.

The FSF references this in a news article [1]. Stating that pdtar is what GNU tar was based on.

[1] https://www.fsf.org/news/2009-free-software-awards

Yes, that is definitely true. I was simply referring to the claim that pdtar was known as gnu tar.

pdtar was another name for gnu tar as can be seen on john's home page or wikipedia page if you can't access it.

I see references to John Gilmore being the original author of pdtar, and John Gilmore using the gnu handle, but I don't see any references to his usage of the name predating the GNU project or pdtar being referred to as gnutar before it became part of the GNU project.

I'm not saying that I disbelieve you, I just haven't been able to find any references for your claim.

ChangeLog.1 and NEWS only go back to after it was migrated to the GNU project (http://git.savannah.gnu.org/gitweb/?p=tar.git;a=tree), his personal site, http://www.toad.com/gnu/ seems to be down but an archive (https://web.archive.org/web/20180623140434/http://www.toad.c...) just says 'In 1985 I wrote the "pdtar" program, which eventually became GNU Tar.'

He clearly uses the name "gnu", but did that usage predate the GNU project? I haven't found a reference for that.

It sure did. We have been friends for longer then that.

While the bug is in tar, the API for the read(2) system call doesn't help. Its return value can be positive, which indicates the number of bytes read; negative, which indicates an error, unless errno is EINTR, in which case you should retry; or zero, which indicates either end of file, or the number of bytes read when the buffer length is zero (and in this case you can't distinguish end of file from a normal read). It's easy to forget to handle one of these five cases.

Issues like this are almost entirely solved with rust-esque enum pattern matching. The compiler with fail if you don’t have every match case handled. There is ‘_’ to abuse the system but doing so should be fairly obvious.


Rust in general is pretty exciting for systems work. I just wish there were more shops using it professionally.

Unfortunately, Rust kept nearly the same design for its `std::io::Read` trait. It only separates the two "error" cases from the three "success" cases, so you still have to check for `ErrorKind::Interrupted` in the error case, and for `Ok(0)` in the success case (and when the passed buffer has zero length, you still can't distinguish between end-of-file and success, though that's not Rust's fault). A better design could have used a return type with more alternatives than just `Ok(n)` or `Err(e)`.

I'm guessing that's probably for cross-platform behavior. Maybe a linux specific std replacement crate could probably solve the problem.

" It's easy to forget to handle one of these five cases."

/feels cold shiver wondering if i've every used read()/

The issue with EINTR is pretty standard between most syscall; if you get one you should retry (or, depending on what flags you have set, this can be done for you automatically in some cases). A return value of zero is an EOF, there is no such thing as an “empty read”. Why would you pass in an empty buffer?

You can get an empty read if you have setup the FD to be nonblocking.

That would be a sixth case I had forgotten above (negative return value with errno EAGAIN and/or EWOULDBLOCK - these two might or might not be the same number).

Apparently the use of an empty read is to probe to see if you get various errors or an immediate 0-size return. This is merely a 'may' requirement in the Single Unix Specification, so I'm not sure how portable it is in practice.

> or zero, which indicates either end of file,

EOF is not zero.

It cannot be, because it must be outside the range of `char`.

On my (linux) system, EOF happens to be -1.

This is explained in in the first chapter of K&R.

Do not confuse getchar(3) with read(2). From the manual page for read(2) (http://man7.org/linux/man-pages/man2/read.2.html):

       On success, the number of bytes read is returned (zero indicates end
       of file), and the file position is advanced by this number. [...]

       On error, -1 is returned, and errno is set appropriately.  In this
       case, it is left unspecified whether the file position (if any)

Ah yes, I did in fact mistake the function in question.

I wonder why they chose not to use out of range values for EOF there, given that `getchar` surely predates `read`?

I was going to assert that read significantly predates getchar, but it turns out that both may have existed from V1 Unix onward. However, getchar in V1 through V6 had a different EOF value from our familiar modern one of (-1); they all instead returned a 0 byte to indicate EOF.

(The V7 getc/getchar/etc manpage notes this as a BUG, although it doesn't specifically document what the V1-V6 EOF was. Presumably everyone who this was relevant for already knew. All of this is based on the historical Unix trees available through www.tuhs.org.)

With read(), the return value is separate from the data read. It is arguably elegant to read as many characters as possible and to not consider hitting the end of the file as an error—just a point that can't be read beyond, like a stop on a piece of machinery.

With getchar(), the decision was made to return the char directly (to have used a pointer to a single char would be daft, from the point of view of economy) and to use an in-band code to indicate end of file. This "in-band signalling" is considered a bit of a kludge by some people. So I'd say read() is a better/cleaner design.

See https://en.wikipedia.org/wiki/Semipredicate_problem for some related thoughts.

Ah, yes, that Semipredicate problem article is very informative, thanks for sharing.

> if you run GNU Tar with --sparse and a file shrinks while tar is reading it, tar fails to properly handle the resulting earlier than expected end of file.

Is that a bug? Yes. It's different than changing the data of file but still.

Clearly it’s a bug because tar goes into a mad loop reading nothing and outputting empty blocks forever, or until the file changes size again. It needs to check for eof to avoid this mistake, even if the eof is unexpected because the file size was different.

I think most gnu utils are written so that they treat input files as a stream of data so you can also use pipes instead of regular files. This also means they can somewhat handle size changes since they will keep reading until they encounter an EOF without relying on an earlier snapshot of the file size or sparse chunks. Whether the results are useful depends on the file format, if it's just temporary padding as in this case it works.

Yes, it's expected for a general system utility like tar to handle such simultaneous file operations gracefully.

Agreed. OTOH, it is not a bug if the fact that tar is not instantaneous causes the tarball to contain combinations of files unlikely to occur during normal operation of the system. e.g., if a file is renamed while tar is running it is not a bug for the tarball to contain zero or two copies of that file.

It is a bug for tar to enter an infinite loop writing an ever expanding file of trailing zeroes because the current file was truncated to a size smaller than its stat returned.

It's basically a classic TOCTOU race, tar needs to honor the EOF returned from read() as authoritative.

This is actually a very common trap for junior *nix programmers; treating stat() and a subsequent mmap() or read() loop as if they were atomic with regards to stat.st_size. The size returned by stat() can be used as an estimate, but otherwise can't be applied to subsequent operations.

A relevant question about tar and backups:


Finding a bug in GNU tar is not hard... just ask Jörg Schilling.

One of the first things I do when I download source code is decompress it and then de-archive it with GNU tar, then promptly re-archive it wit a real, AT&T UNIX System V 4.0 tar, because it's a known fact among us old UNIX folks that GNU tar is buggy as hell and only GNU tar can correctly read GNU tar (that wasn't the point of a tape archiver, but GNU people didn't get that memo).

Just to provide some context, Jörg Schilling is author of star[1], which is listed among key implementations of tar on wikipedia[2].

One can check the problems he noticed in gnu tar implementation in the readme file of his start project:


But note that the last version of star was released in 2008 and README.otherbugs file is actually from 2001.

[1] https://sourceforge.net/projects/s-tar [2] https://en.wikipedia.org/wiki/Tar_(computing)#Key_implementa...

Kind of curious as to where you get that version of tar from.

Either from Heirloom tools, or by running any freeware, open source operating system which is built from the illumos source code, like SmartOS. illumos code base is the most direct, canonical descendant of AT&T UNIX System V 4.0. If you run anything based on illumos not only are you running a true UNIX, but a reference implementation of it.

Additionally, any true System V UNIX like IRIX and HP-UX will have it.

that’s cool! thanks for the write up. i learned a bit

I think he should have reported this but writing it up as a fairly detailed blog entry.

I rather enjoyed reading the adventure of hitting a problem in a large system, and then working towards narrowing down the specific bug or set of bugs.

Such posts have in the past been super helpful for me personally (and to others I imagine), in going from plain weeping that stuff just randomly breaks, to learning to enjoy examining the possible causes and understanding how to explain the problem concisely and with enough detail to make it useful to a developer.

He’ll be able to reuse much of his blog post in an great bug report with easy steps to reproduce the problem, excepted and actual outcomes of following those steps, and a scenario where it will happen in real deployments as well as a providing a workaround.

I think the guy just wants to enjoy the break. Talking and responding to an email/bug thread is bit more involved than just shouting text to your blog.

Give him a couple days! You're responding to the PPS so take the entire PPS into account.

The blog post author is one who routinely submits bug reports upstream, and even code fixes where he can.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact