Hacker News new | past | comments | ask | show | jobs | submit login

Yea, I used to do this with a little script. The strategy I used, which worked well when I was compressing and archiving workspaces (which might often contain checkouts of different branches of the same project) was essentially this:

    find * -print | rev | sort | rev |
    tar --create --no-recursion --files-from - |
    gzip
This clusters file types together and within file types and within that files with the same base name close together.

This worked surprisingly well for my use cases, though you can imagine that packing and unpacking times were impacted by the additional head seeks caused by the rather arbitrary order in which this accesses files.




A small experiment with a 143M directory.

  $ tar -zcvf file.tar.gz directory/
  $ du -sh directory.tar.gz
  57M directory.tar.gz

  $ find directory/* -print | rev | sort | rev | \
      tar --create --no-recursion --files-from - | \
      gzip -c > directory.tar.gz
  $ du -sh directory.tar.gz
  55M directory.tar.gz
3.51% (2MB) reduction makes many sense here.


Small nitpick: * will miss hidden files.

-print is also unnecessary.

Simply use

  find |


Awesome! I will use this. I would like this even more if it stopped at filenames (ignored paths) and when equal, sorted by file sizes.


This will already sort equal file names together. If I wanted to combine that with file sizes, I'd probably do some kind of

    decorate | sort | undecorate
dance on each line produced by find. Where decorate would add the start of each line the things you want to sort by and undecorate would remove them again.


On the other hand SSDs have cut at least 20% off head seek times. ;)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: