
Archive that can be reconstructed with total loss of file system structure - zie
https://github.com/MarcoPon/SeqBox
======
alcari
This is a cool idea. Please don't take the below as gratuitous negativity,
just a reminder that these are hard problems for which there are no general
solutions.

The README says it was tested on ZFS, but I doubt its utility in real-world
deployments. I don't know of anyone who has significant data in a ZFS pool
that isn't one or more of: raidz, compressed, encrypted, or embedded_data.

raidz implies that logical blocks aren't allocated as single physical blocks,
but instead striped across multiple drives. Finding the SBX magic isn't enough
to get you the rest of the block, but the checksum might (but, given that's
it's CRC16, probably won't) let you try appending blocks from other disks to
find the remainder of the block.

Transparent compression prevents you from identifying the magic header on each
block, unless you decompress every disk sector that could have data (which is
certainly feasible, but complicates recovery if you don't know which
compression was in use, and zfs supports at least 3 kinds, and pools will
generally have at least 1 in use whether compression is on or not).

Encryption (present in Oracle ZFS) means there's no plaintext data to recover.

embedded_data is a feature flag (and on by default in supporting versions of
zfs) that packs blocks into block pointer structs when the amount of data is
small. I can easily imagine the final block of an SBX, which may be mostly
padding, getting compressed into one of those block pointers, which itself may
be embedded in a larger structure which is part of an array that's compressed
by default. That array is also probably long enough the compressed stream
takes multiple blocks, and you may have lost some of the early ones, making
the rest of it unrecoverable.

------
toolslive
Alba (
[https://github.com/openvstorage/alba](https://github.com/openvstorage/alba) )
implements the same idea, but for a distributed object store: The objects
(files) are split into chunks and erasure coded into fragments, which are
stored on storage devices (disks). The structure of the object and the
metadata is stored in a distributed key value store. _BUT_ The metadata is
also attached to the fragments so that if the distributed key value store is
lost, everything can be restored from the storage devices alone. Obviously,
this is a very heavy operation and is the very last line of defense against
data loss, only needed in case of a severe calamity.

------
trevyn
Sounds like it would have been useful in 1998, when filesystems and drives
were less reliable and encryption was rare.

------
ahazred8ta
The author Mark0 writes: "each SBX file can contain only 1 file and there are
no error correcting informations (at least at the moment) ... But it's
possible to create an SBX file out of a RAR archive with recovery records, for
example."

IMHO that would be the win-win combination for data recovery. How common is
the failure scenario? Common enough that there are people who make a living
piecing lost files back together. See
[http://forensicswiki.org/wiki/File_Carving](http://forensicswiki.org/wiki/File_Carving)

More discussion at
[https://www.reddit.com/r/datarecovery/duplicates/66hgnt/i_ma...](https://www.reddit.com/r/datarecovery/duplicates/66hgnt/i_made_a_thing_that_may_be_of_some_interest/)

------
HappyKasper
This looks great! So I will ask the obvious question - how common is the
failure scenario this protects against compared to others?

~~~
jaclaz
I would say rather common.

The whole issue is that file carving (once filesystem structures are
destroyed) works fine with contiguous data but it doesn't work (or it works
only very partially, and depending on a number of factors) with any fragmented
data.

But I will give you a common enough example, that I have seen happen many
times.

The user wants to format a USB stick, but by mistake he/she selects a data
volume instead.

Two possibilities (under Windows later than XP):

1) user chose NOT the "/q" (cli) or "quick" (GUI) format, all data is lost
forever as starting from Vista if a "full" format is cosen the whole volume
will be overwritten with 00's

2) user chose to "quick" format, only the filesystem structures are recreated
(blank) and - given that the volume was originally formatted on the same OS -
they will be exactly where previous filesystem structure were, thus cleanly
replacing them.

In this latter situation all files are still there, what is lost is where they
are, the address info for a contiguous file is two pieces of data, where it
begins and how big in size it is, there are a number of programs that can
"recognize" a file (or its beginning) by its "signature" (usually a header,
but sometimes a combination of header and footer), among them the excellent
TriD tool by the same Author Marco Pontello, and a large number of file
formats do include some info on the filesize, and there are slightly more
sophisticated programs that can usually recognize if data belongs to a given
filetype.

The address info of a fragmented data is as many starting points and as many
extensions as the fragments, and since a fragment - bar the first one - have
no headers it is extremely difficult to find all the fragments of a file (and
knowing to which file a fragment belongs) and re-build the file by re-
assembling the fragments in the right order, often impossible.

In a nutshell, if the volume was only made of contiguous files, they can be
recovered 100% or very near 100%, only losing their filenames and their path
in the "previous" filesystem structure.

BUT any file that was fragmented will either be lost completely or will need
(in some cases only this is possible) manual reconstruction, something that
may take days or weeks of work and very often with only very scarce or partial
success.

The idea of a "self-referencing" archive with sector-level granularity is
simply great, you won't ever _need_ it, but if you do, it could be a
lifesaver.

~~~
dr_zoidberg
When I was younger I worked a while in file carving problems and algorithms,
and while your take on it is accurate, it's not complete. There are algorithms
that can do what you call "manual reconstruction" in a more or less automated
way. Of course, as with everything that is automated, there are edge cases
where it doesn't work correctly, but it'll save you a lot of time.

What has been discussed here and in the OC, is mostly header-footer carving,
but file-structure-based carving has been around for a long time already (I
think Foremost included it in 2005/2006, and PhotoRec supports it for _some_
formats). Using the file command (*NIX) or TRiD to find the headers is a bit
of a hack, and useful only if you want to sit a long time with dd (or an
equivalent tool) and direct yourself the carving process.

As for the fragmentation issue, there used to be around a great program called
"Adroit Photo Forensic" which implemented the amazing SmartCarving(TM)
approach (academically known as graph-theoretic-carving) in which the disk is
scanned, pools of segments by file format are populated, and then a graph is
constructed and analyzed with a Dijkstra's shortest-path variant which works
on multiple paths from multiple starts-to-ends. Very interesting, though a bit
computationally expensive. Plus you need an extremely precise similarity
metric between the segments, which was actually the weak-spot for this.
However, it seems as if the Adroit product has died (I can only find mirrors
now which host old versions of it), and as it was a try-and-buy program I
don't think you'd be able to use it now.

Thing is, there are easier ways to handle fragmentation. For example, PhotoRec
itself lets you pass an agressive validation step, to discard most of the
"probably wrong" recovered files, and then dump the remaining to an
alternative disk image, on which you can try carving again. Of course, this
iterative process takes time, but it can work, and of course it can be
automated (and thus, saving you from manual intervention).

While my work has taken me away from it, I never quite understood why the
digital forensics community "forgot" about file carving, or disregarded it as
an important problem -- it seems as if there are still the same issues around
as when I worked on it, while having less tools available.

As for the SeqBox format, more than an alternative to this tools, I find it as
an interesting complement to them.

~~~
jaclaz
Sure, I tested a couple of times Adroit, anyway it is (was) only about some
specific file formats (photos, aka jpeg's), and even the Authors' "pitch" was
about (only) 20% more photos recovered, I remember not being that much
impressed by the results of the actual tests (but maybe the devices on which I
tested it at the time also suffered from other forms of corruption):
[https://web-beta.archive.org/web/20120313073659/http://digit...](https://web-
beta.archive.org/web/20120313073659/http://digital-
assembly.com/products/adroit-photo-forensics/)

[https://web-beta.archive.org/web/20120806031446/http://photo...](https://web-
beta.archive.org/web/20120806031446/http://photo-recovery.info:80/)

Some info about Smart Carving (for those interested):

[http://www.forensicswiki.org/wiki/File_Carving:SmartCarving](http://www.forensicswiki.org/wiki/File_Carving:SmartCarving)

and GuidedCarving:

[http://www.forensicswiki.org/wiki/File_Carving:GuidedCarving](http://www.forensicswiki.org/wiki/File_Carving:GuidedCarving)

I am pretty sure that a similar tool (specific for photos/jpeg's) would be
very useful, but extending the same principles to different file formats is -
as I see it - extremely hard and in a large number of cases simply impossible
(due to the actual structure of the file format itself).

~~~
dr_zoidberg
Depends on the format, for example PNG has checksums per segment which make it
extremely easy to find if the segment has been correctly rebuilt. Same goes
for ZIP and RAR, and JPEGs can be matched against the Huffman tables for
invalid byte sequences.

Also for JPEG there was a (sort of experimental) carver that validated the
image against the thumbnail, and then decided if it was correctly
reconstructed or not (and maybe it also helped to choose the correct block
when there was an error). Still, most of this tools never left the alpha
stage, and ended as curiositys presented in a conference and then left to rot.

~~~
jaclaz
Yes :), it is queer how some programs/tools that _should_ exist and of which
we do have the basic algorithms were never properly written/developed while we
have tens or hundreds of tools similar among them missing these
functionalities. Another kind of tool that was never properly developed is
interactive jpeg repair, besides the good jpegsnoop:
[http://www.impulseadventure.com/photo/jpeg-
snoop.html](http://www.impulseadventure.com/photo/jpeg-snoop.html) (which has
only some of these functionalities) there are a handful of tools that are of
little or "very narrow" use, but nothing much effective at rebuilding a
damaged jpeg.

~~~
dr_zoidberg
I've seen this in the steganography/steganalysis "community": almost every
time a script or program is written, its purpose is to rapidly fix the needs
by some highly specialized expert who deems that development too simple (that
is, to himself) to further develop into a full-fledged tool.

------
pronoiac
This brought parchive/par2 to mind; as I recall, given an index file, which
you could keep separate, you could do a rolling scan for the recovery blocks
and file blocks on the disk, even without the file system structure.
Unfortunately, it sorta looks like that software hasn't been worked on in a
while.

