Hacker News new | past | comments | ask | show | jobs | submit login
Archiving JPEGs for long-term storage (github.com/danielgtaylor)
73 points by turrini on Feb 17, 2018 | hide | past | favorite | 24 comments

The 'jpeg-archive' tool, if anyone was wondering, appears from the source code to not produce an archive file of any kind but just a directory of lossily compressed files, does no deduplication on the file or image level, stores no hash sums, creates no forward error correction information, does not shard or store across multiple storage services under the LOCKSS principle, and otherwise doesn't do anything relevant for long-term archival storage. The title of the submission and tool are more aspirational than descriptive.

Yup, it just re-compresses the photos to save some space potentially sacrificing the quality of originals. Nothing to do with the backing up the files securely. And also not something any photographer would ever do to photos, but I guess it can be useful when you have tones of holiday snapshots where maximum quality or ability to edit raw photos in future is not a concern.

I imagine this tool served a need to store lots of images online and that there are wider use cases for image libraries and ecommerce stores where lots of porduct images are needed to be served efficiently.

Sadly it is not 'jpeg-and-png' and you do need to keep some images in PNG becuase you cannot have transparency in JPEG.

Much can be done with 4:2:0 colour and the optimiser tools Google recommend. In PNG world I like pngquant for some images but not all. It is also possible to prepare images in Imagemagick to suit an optimiser.

However, storage is cheap and there are better ways. Google's Pagespeed for NGINX and Apache works wonders at serving optimised images. This abstracts the problem away from code so there is no need to be worrying about image sizes over the web, size in pixels or bytes.

I don't think this tool is about providing a professional archiving solution, just efficiently processing an online image library so that you have an extremely practical archive regardless of what compression ratio images were originally uploaded at.

The use of all cores to quickly process all images was instructive. However, in real life it is more useful to have a cron job that just updates all images to correct standards slowly in the background, not using all cores.

I would like to see some comparisons against more commonly used jpeg optimisers.

Agreed with most of these points, but FEC / sharding / storing should be handled at a different layer.

What layer of the filesystem it should be handled at can be debated (a tar with separate PAR2s? a single archive split into multiple ZFEC shares?) but that sort of thing is precisely what a long-term archiving tool should be handling for you. I don't want to mess with par2create by hand anymore than I want to try Imagemagick on multiple settings to get an acceptable tradeoff.

Heh. As a photographer I'm not sure what utility this would have for someone like me. The last thing I want to do is smash my glorious high-bit-depth RAW files with LR adjustments into 8 bit low quality JPEGs. And if I did, ol' Photoshop can do that with image macros, and Lightroom can do it with an export. Storage costs are so cheap now and continue to fall, even a guy like me with 5TB+ of images isn't panicking about how to archive this stuff. A 9TB RAID5 is cheap as dirt compared to the rest of the costs of photography, a few rotating externals to back up is also cheap, and cloud backup is mostly only limited by network speeds. Glacier will archive stuff for $4 per month per terabyte (and falling), there is zero need in most circumstances to archive at such a terribly lossy quality level, at least in my industry. Maybe there are other applications out there, I dunno.

Interesting stuff, although not a fan of the lossy recompressing, there are good open source lossless jpeg compressors like packjpg, lepton, giving ~22% compression, and a yet to be released one from the guy who made webp/brotli.

So, this is a low-res facsimile of Flickr but without the original's battle-tested infrastructure or a human-friendly UI.

Very interesting, specially the "jpeg-hash" tool, useful for finding quasi-duplicates (different files, but quasi-identical at visual level).

I'd love something like this but for video formats that come off consumer cameras and cell phones. Especially old ones that used Motion JPEG, you ought to be able to compress them quite a lot without losing much perceptual quality, but I've no idea what the best way to do that is.

How to actually archive images:

1. Convert to png

2. Save image on multiple storage mediums

I like that method better than:

1. Re-encode lossily.

2. ???

3. Profit.

3. Print it in a photo album

Slightly off-topic but does anybody know any services that you can pre-pay to store images long term? Say 10+ years.

Flickr only allow you to buy an account for the next 2 years.

I’m surprised no one has started a “perpetual storage for upfront cost”.

With hardware costs going lower and lower, and rate limits enforceable to keep this for archivig, you could model a price with a suitable profit margin.

Maybe tarsnap? What you need is a service to take object buckets and make a tape backup that is replicated, re-read and re-archived over time.

Glacier is $0.004/GB


Just looked into the pricing and the lions share comes when you (especially quickly) need the data again.

Why not buy instead one or even multiple cheap but large USB hard disc and put it into your shelve or to your friends? It gives you immediate access for free. I don't get it why one rents a service instead of staying independent.

Data storage devices have a finite lifetime, so data storage is necessarily a recurring expense.

Copies that can't be read aren't copies any more, which means regular access is a requirement, which means another recurring expense.

Different requirements can reasonably result in different solutions, but a USB hard drive on its own is simply not comparable to storage service. A redundant set of USB hard drives with a specified replacement schedule and testing procedure would be comparable -- but if you amortize all that out, we're talking about $X + Y hours per month for Z GB of storage, which compares directly to $X/GB-month if time has a dollar value.

From this perspective, I argue that storage services like S3, Glacier, B2, and Digital Ocean Spaces are priced fairly.

Glacier is billed monthly. I'm looking to buy 5-20 years service in advance.

Amazon s3 long term use contract?

Why choose a lossy compression scheme?

You could make the comparison of compressions a bit more interesting by adding 100% crop to say the lips or any area with intricate detail. Also, if you do a recompression wouldn't jpeg2000 be an option?

It's a good encoder. It will give you more consistent quality and better compression than your bash script launching imagemagick.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact