
Road to OCIv2 Images: What's Wrong with Tar? - dankohn1
https://www.cyphar.com/blog/post/20190121-ociv2-images-i-tar
======
klodolph
I think articles like this are important... we need a good way of collecting
knoweldge (and especially frustrations) like this in a compact format.
Everyone who thinks, "I'll use tar," in response to some problem involving
archives, backups, package distribution, container distribution, etc. will at
some point run head-first into one of the things that makes tar _weird._
Without good collections of complaints, somebody comes in to solve the problem
and ends up only solving the corner of it that they personally care about.

Something I personally find frustrating is that I'd like to be able to create
a tar archive without recording UIDs, usernames, or timestamps. The command-
line tools aren't really built for that, and yet this seems pretty reasonable,
and libraries which create tar files do this by default. (This is covered
under the reproducible builds section.)

The next thing that you're likely to want is some kind of random-access
archive, for which you'll use either zip (which sticks the directory at the
end) or squashfs, maybe.

Finally, the last thing that I often want is just a very binary key-value
format which I can distribute and randomly access with reasonable efficiency.

Looking at things from the opposite perspective, I'm personally frustrated by
the cross-platform personal backup options. Currently, I'm using Duplicity or
some variant, but I find it a bit of and as a result my backup schedule is
rather sporadic.

~~~
bruce_one
I recently learned about [SQLite
archives]([https://www.sqlite.org/sqlar.html](https://www.sqlite.org/sqlar.html))
which feel like an interesting option to consider, especially with your
comment around sometimes wanting a "binary key-value format" in mind.

Because it's "just sqlite" it's possible to use the archive functionality in
archive-like ways, while also using the file as a sqlite db (because that's
all it is).

It's not a perfect tool for all scenarios (and I'm thinking it wouldn't be
good for backups), but I feel there are definitely some very useful scenarios
for it.

------
jacques_chester
It may sound esoteric but this stuff is actually Super Damn Important, for at
least two big reasons:

1\. Security. Right now it is hard to trust a container image. Determining
provenance and/or reproducibility are hampered by the image format and
_especially_ hampered by the charming little quirks of tar.

2\. Performance. There are lots of optimisations that would be available if we
didn't need to stream out an entire layer before we could do anything use with
it. The tar family of formats is, as their name suggests, all about creating
linear files intended to be saved _to a tape_. On a tape random access is
bonkers. For a container, random access is a frequent operation. On top of
that there's the whole hassle of shipping almost-but-not-quite-identical
layers over and over and over. There must be petabytes of wasted bandwidth
worldwide by now.

------
anarcat
I'm curious to hear what solution they are proposing for this, especially
since there are at least _two_ existing proposals in that field, in the form
of OStree and casync:
[https://github.com/ostreedev/ostree](https://github.com/ostreedev/ostree)
[https://github.com/systemd/casync/](https://github.com/systemd/casync/)

~~~
cyphar
There are also a few others, and I've spoken to Lennart about using casync.
It's a bit of a mixed bag because while casync does give us mostly what we
want (and I've had a very long thread on Twitter with Lennart about it), there
are some other concerns I've heard in the past few days from OCI users that
make me quite cautious about using a format that cannot be easily made into a
runtime format.

For instance, some folks want to have their runtime format be identical to the
image format (so that signatures of the image can be used to verify the
running containers). This is something that you cannot currently do with stock
OCI (though they have worked around it), but should be a consideration for a
future format. I only became aware of this concern after writing my blog post,
so I will have to include it in the next one. :P

~~~
jacques_chester
My last skim of casync suggested that it's heavily block-oriented. That's fine
for backups and FS duplication, but as a distribution format it doesn't really
fly.

A lot of folks want two levels of insight into the image's provenance or
trustworthiness:

1\. Has the image been tampered with? This is mostly solved by TUF / Notary.

2\. What are the files and where did they come from? This means you need a
file-level abstraction that is simple and fast.

Insofar as casync is block-oriented instead of file-oriented, it's a poor fit
for the second problem. It doesn't matter how efficient the streaming and
storage is if you make people download the entire layer each time you want to
check a single file.

------
jzwinck
Using Zip archives on Unix doesn't feel right, but it can solve a bunch of
these problems straight away, while remaining highly portable.

~~~
mwcampbell
How about 7z? How do 7z, zip, and tar compare in terms of support for Unixy
things like permissions, uid/gid, xattrs, and so on?

~~~
chungy
7z only supports DOS-style attributes, while Zip supports the full host of
Unix metadata like tar does.

~~~
jabl
IIRC the unix metadata is not part of the canonical zip format, but rather an
extension introduced by (?) infozip, an implementation of zip for Unix systems
(if you use the zip/unzip cli tools on a Unix style system, it's most likely
infozip).

------
mjevans
A better solution might be to just emulate a block device with a sparse file
and mount it somewhere with a loopback device.

~~~
cyphar
This is what Singularity (and to a lesser extent, LXD) do. The main problem
with this is that you don't get de-duplication of transfer (or really of
storage) -- any small change in your published rootfs and you'd have to re-
download the whole thing. In addition, it requires that the system you're
mounting it on supports the filesystem you use (and that the admins are happy
using that filesystem).

There is also a potential security risk -- since filesystem drivers are
generally not secured against potentially malicious sources (there are plenty
of attacks that have been found against the big Linux filesystems when you
attack them with un-trusted filesystem data). This is one of the reasons auto-
mounting USBs is generally seen as a bad security practice.

Don't get me wrong, there is a _huge_ benefit to using your runtime format as
your image distribution format. But there are downsides that are non-trivial
to work around. I am thinking about how to bridge that gap though.

~~~
ofrzeta
What about some kind of binary delta diffs such as bsdiff?

~~~
cyphar
Yes, and this is what LXD does. I think I mentioned it in the article, but
basically the issue is that it requires one of:

1\. A clever server, which asks you which version do you have so it can
generate a diff for you. This has quite a few drawbacks (storage and
processing costs as well as making it harder to verify that the image you end
up with is what was signed by the original developers). But this will
guarantee that you will always get de-duplication.

2\. Or you could pre-generate diffs for a specific set of versions, which
means it's a lottery whether or not users actually get transfer de-
duplication. If you generate a diff for _every_ version you're now back to
storage costs (and processing costs on the developer side that increase with
each version of the container image you have). You could make it so that the
diffs only step you forward one version rather than instantly get you the
latest, but then you now have clients having to pull many binary diffs again.

This system has existed for a long time with BSD as well as distributions
having delta-RPMs (or the equivalent for debs). It works _okay_ but it's far
from ideal, and the other negatives of using loopback filesystems only make it
less ideal.

In my view, dumb formats are best.

~~~
bruce_one
Using [zsync]([http://zsync.moria.org.uk/](http://zsync.moria.org.uk/)) is an
alternative option from my understanding?

I could be technically inaccurate, but my understanding is that it's rsync but
with the server serving a metadata file which allows the rsync-diffing to
happen from the client side rather than the server side - hence no clever
server required.

It also doesn't require diffing particular revisions; but only the different
blocks will be fetched. It does require server the metadata file, but they're
note very large afaik.

------
zbentley
As someone who was not programming when many of the customizations introduced
in the article were introduced: this seems like a cautionary tale that goes
against the wisdom about "competing standards":
[https://xkcd.com/927/](https://xkcd.com/927/)

The desire to unify formats' competing back-compatibility needs created
something that was (because of the standards conflicts/schisms/reunifications)
extremely sub-par for most use cases, but (because of the time spent baking
common interfaces) just usable enough that it remained the primary basis for
storage formats for, if the author is to believed, far longer than it should
have.

I wonder how many other tools that are venerated because of their age and
ubiquity are similarly decrepit and broken when you peel back the curtain?

------
JdeBP
Coincidentally, I recently submitted an article by Thomas E. Dickey on the
subject of tar and portability.

* [https://news.ycombinator.com/item?id=18969290](https://news.ycombinator.com/item?id=18969290)

The hoo-ha over tar and cpio in the 1980s when the subject of standardization
came up, and the whole raison d'être of pax, is largely forgotten nowadays.
But the problems that were brought to light back then live on.

The thing to bear in mind is that we already have attempts to improve upon
this. The problematic formats in the 1980s were _themselves_ products of the
second-system effect, adding on various things to the original tape archive
format. We are now, in the second decade of the 21st century, well into
_umpteenth_ -system effect.

This brings up a point where this article is very wrong. The pax utility did
not appear in 2001. pax was in the POSIX draft back in the early 1990s, and
had become widespread enough to be in reference books by 1991. The article
conflates the PAX extensions with the pax utility. This, and the incorrect
dates, are errors promulgated by Wikipedia, which I challenged back in 2018 at
[https://en.wikipedia.org/wiki/Talk:Pax_(Unix)#Incomplete_Inf...](https://en.wikipedia.org/wiki/Talk:Pax_\(Unix\)#Incomplete_Information)
and which was probably the source of these errors in the article. Always
double-check what Wikipedia says on computing topics with a proper reference.

Before being tempted to reinvent the wheel here, I recommend a look at all of
the times during the so-called "archiver wars" where this wheel _already was_
reinvented, and learning from them. Of particular note, given this article, is
Rahul Dhesi's ZOO file format from 1986. It allowed for multiple generations
of a given file, and archive headers marked deleted files with a flag
(allowing them to be undeleted), which could be used for "whiteouts". It
suffers far less from an extension mess because support for filesystem
features from Unix, MS-DOS, and VMS (at least as they were in 1986) was
provided in the base data strutures; including 8.3 and long filenames.

But really the basic error here is in using an _off-line archive_ format for
an _on-line live filesystem_ mechanism. It's the wrong data structure for the
job. Whereas there _are_ right data structures, and have been for years. The
history of filesystem formats development includes several cases of addressing
the very things mentioned in the article, from deduplication (c.f. ZFS)
through generations for deleted files (ODS for Files-11 in VMS, where ZOO got
the idea from) to reproducible directory scan orders (c.f. the side-effects of
using B-trees in HPFS).

~~~
cyphar
> This, and the incorrect dates, are errors promulgated by Wikipedia [...]
> which was probably the source of these errors in the article. Always double-
> check what Wikipedia says on computing topics with a proper reference.

I didn't use Wikipedia, actually. The primary source was the libarchive
documentation, star, bits of GNU tar's docs, and POSIX. The main issue is that
it's hard to get a copy of POSIX.1-1988, let alone POSIX _drafts_ from the
1990s.

EDIT: Also, I didn't actually notice there was a rationale section in
POSIX.1-2001 which references PAX as existing in earlier standards. I will
read through it and update my article accordingly. Thank you!

> Of particular note, given this article, is Rahul Dhesi's ZOO file format
> from 1986.

Funnily enough, I have heard of ZOO (not sure where) and looked into it. While
it does support file versions (and deletion) and has many improvements over
tar, there are many other properties it doesn't have that we'd need (and last
I checked there's no real support for it in modern Linux and Unix-likes -- so
it makes no difference from a ubiquity perspective).

> But really the basic error here is in using an off-line archive format for
> an on-line live filesystem mechanism.

We don't use tar archives as the live filesystem for containers, it's used as
a distribution mechanism.

> from deduplication (c.f. ZFS)

While ZFS has de-duplication, it's not really the kind we need and (from
memory) zfs-send doesn't include the de-dup tables so they're all generated on
the receiving end. Ideally we'd want content-defined de-duplication because
that way you can reproducibly generate the blobs.

~~~
JdeBP
I suggest books. (-:

Fred Zlotnick's 1991 book mentions pax and the (contemporary) POSIX.2 draft.

Mark Horton's _Portable C Software_ from 1990 has a command options listing
for pax.

That's just two of the books.

------
bydo
Reminds me a bit of “PSD is not my favorite file format”[0]

0: [https://stackoverflow.com/questions/5355708/psd-file-
format#...](https://stackoverflow.com/questions/5355708/psd-file-
format#5355949)

