thanks for the link, but i think i found my favored solution right now, which is extracting the archive into raw files (and header, hashtable, blocktable) and then reassembling the files on demand into a byte-for-byte equal archive on demand (or via virtual file system).
this will block align everything, give people access to raw assets and is flexible and performant on the filesystem because of hardlinking.
appreciate your help though :)
A virtual filesystem to "simulate" the MPQ files on-demand based off the raw deduplicated assets was the exact thing that came to mind when I read the OP. Happy to help on this. Email in profile.
jdupes is quite powerful and allows you to create different types of links and even more.
i used it for creating hardlinks, since this was the most efficient for my use case.
from what i read, there is currently also a rewrite of jdupes in progress, which will introduce significant performance improvements... you can see a post about it here:
a BTRFS subvolume can be exported as a snapshot, and this can be passed around rather easily.
i know the concern mounting BTRFS, but there are windows drivers for that as well, and you can also mount it via WSL nowadays, to have proper linux tooling.
the approach of having a custom storage blob with pointer references which is something i consider as well, will play around with that during the holidays and do some experimenting.
thanks for your input.
World of Warcraft - from the earliest (publicly leaked) Alpha up and including the very last Wrath of the Lich King version.
Other people are already working on getting backups of the next 2 expansions working again to add those client versions to the archive as well.
will probably write the MPQ blobs down to disk and deduplicate via hardlinks and additionally on block level.
i don't know about restic (or borg, which was also recommended), but i will read up on it and doe some tests with it, regardless, since this seems to be a very nice tool for a lot of problem scenarios.
thanks for the input!
thanks for the link - i will write my extraction code though, since the format is very simplistic, and it gives me fine grained control over how things are done.
appreciate your help though!
prying apart the MPQ file into it's parts and writing them down to disk is at the moment probably the strongest contender for my solution.
it will cause the parts to be block aligned, be able to be hardlinked and cut down on metadata and improve performance (same inode when hardlinked).
only thing i need in this case is to have a script to reassamble those extracted ones into the original MPQ archives, which have to match byte-by-byte to the original content ofc.
extracting them into it's distinct parts also allows to access the contents directly, if so desired, without needing to extract them on demand (some people wanna look up assets in specific versions).
these distinct parts can then additionally be deduplicated on block level as well.
Maybe you can use it to decompress some files and assess how much disk space you can save using deduplication at filesystem level.
If it's worth the effort (1 mean, going from 1TiB to 100-200 Gib) I would consider coding the reassembly part. It can be done by a "script" fist, then can be "promoted" to a FUSE filesystem if needed.
even though I do not like it too much, i think i will have to pry apart the MPQ files into it's distinct parts, and write them down to the filesystem individually (and then deleting the original file) - basically what i wanted to do with the extents, but instead have them as distinct files.
this then can be reversed via script to assemble the original archive file on demand to get a byte-by-byte equal file again.
writing the parts down to the filesystem will cause the parts to be properly block aligned, and be able to be hardlinked, if they exist multiple times on the filesystem - this cuts down on metadata even more and also boosts performance when doing block/extent deduplication, since a single inode is only processed once in most proper deduplication programs.
the MPQ files range from a few MiB to around 2.5 GiB.
since the access should be rather fast, pairing them as an archive file is not an option for me.
thanks about the hint of the merkle trees, i will read up on what that is... always good to know about different approaches to a problem :)
I'm not sure about how it works in WoW but I can give you a warning about what happens in StarCraft. To prevent other people from editing StarCraft maps (which are MPQs), users would intentionally mangle the MPQ format in just the right way so that the game could still play it but other tools could not open it for editing. So, if there is anything like that going on in WoW world then it might be very hard to reassemble the original MPQs and get a byte for byte match.
shouldn't be the case, and i would implement a proper verify when exploding the MPQ file, by running a reassamble and hash comparison right afterwards.
you are absolutely on point - i would prefer having a real filesystem with deduplication (not compression), which offers data in a compact form, with good read speed for further processing.
i was already brainstorming of writing a custom purpose-built archive format, which would allow me to have more fine grained control over how i can lay out data and reference it.
the thing is that this archive is most likely not absolutely final (additional versions being added) - having a plain filesystem allows for easier adding of new entries.
an archive file might have to be rewritten.
if i go the route of custom archive, i can in theory write a virtual filesystem for it to access it read only like it would be a real filesystem... and if i design it properly, maybe even write it.
still would prefer to use a btrfs filesystem tbh ^^
will brainstorm a bit more over the next days - thanks for your input!
duperemove has "--dedupe-options=partial" which also enables this, not just full extents.
the issue still is, that the data within the archive is not block aligned, thus preventing me from deduplicating them properly