
For Science: Does ZFS deduplication work on intros of TV shows? - matrixagent
http://manuelgrabowski.de/2014/09/29/zfs-deduplication-tvshow-intros/
======
gilgoomesh
I challenge you to record 30 seconds of a fixed test pattern from TV on two
separate occasions and have it encode to identical bits. I don't think it can
be done.

Many media formats (including all digital TV formats) encode a rolling
hardware timestamp of 33 bits (or more) which won't be the same for two
separate segments by random chance. Audio, video, subtitles and other metadata
will be ordered differently in the stream because they all comes from sources
that have their own separate clocks. Synchronization between clocks at
different stages in the media pipeline will cause different frames in the
sequence to be dropped, padded, made into keyframes, etc, which then affects
every bit in subsequent frames. TV stations use time-based watermarks. TV
stations use digital compositing software that may retain bits from previous
frames on an ongoing basis. Many studio media pipelines still involve analog
steps.

The list of complications goes on.

Unless your pipeline is lossless, uses only digital sources and uses totally
synchronized clocks for all stages, you're about as likely to get an MD5 hash
collision by accident as you are to get any two non-trivial sequences of
compressed video to be bit-identical.

~~~
a-dub
Could make for an interesting codec idea though! A codebook based compressor
that works on corpora of audio/video seems doable. More interesting would be a
codec that aims not only to exploit models of human perception to throw away
bits, but also exploits those models to produce "canonical" compressed forms
that would be suitable for byte/block level deduplication. Of course,
interesting things are hard, and errors would be hilarious. (See: Xerox and
their copiers that change letters and numbers in documents...)

~~~
Kayou
Also, I don't know if this is already implemented in some codec, but it would
be interesting to re-use key frames from previous shots. For instance when you
have a dialog and the camera switches from one person to another, a
conventional codec probably creates a new i-frame for each shot whereas an
intelligent codec would recognize that this is the continuation of the
previous shot.

------
laumars
The reason I assumed deduplication wouldn't be effective on TV shows is
because the intro isn't always at the same point as most shows have a prelude
these days. Since the preludes are of different lengths, it means that certain
things like key frames (if used on that particular codec) would appear on
different frames of the intro sequence. Which makes me wonder how effective
deduping would be on older TV shows where the intro is at the start of the
programme.

As for compression being largely ineffective, the shows are already compressed
anyway. In fact file system compression seems to be less relevant these days
as most modern file formats have compression built in (even Office documents
are just ZIP files). But as the author said, the overhead for compression
isn't damaging for the performance* of ZFS - unlike with deduplication

* ZFS compression is particularly performant† when using the newer lz4 algorithm available in OpenZFS - which I'd highly recommend people using if they're not already.

† I know "performant" isn't technically a word. But it should be.

~~~
zimpenfish
If people use and understand the word "performant", it's a word. That's how
language works.

~~~
laumars
In the evolutionary sense languages, I agree. But there's also a lot of people
who strictly disagree with slang, "verbing" and other colloquialisms - instead
believing language should be applied strictly (people otherwise known as
"language Nazis").

My "performant" footnote was just an attempt to pre-empt such attacks; though
ironically including the footnote has now sparked the tangent itself.

~~~
zimpenfish
It's a fun thing to point out to those people how many of the words they're
using to lambast you have actually changed meaning or are new words. They get
-very- flustered and sulky.

------
pixelcort
For shows with identical intros, mkv format supports linking to a separate
file to store it. The separate file just has to be in the same directory.

~~~
Filligree
A feature which is frequently used by anime fansubbing groups, and almost
never elsewhere.

Those groups often seem strangely bleeding-edge. They always adopt new
features years before western groups will... I wonder why.

~~~
masklinn
I'm guessing it's because they have to do much more work: western rip group
just rip and upload, fansubs have to rip, translate (or find a translation),
integrate subbing, recompress and upload. And they'll be going through each
episode multiple times to QC timings, subs readability and check that subs
doesn't prevent seeing anything of interest, at least for quality groups.

As a result they're always on the hunt for options which could help them do
their stuff and:

* they'll find out about other possibly interesting features at the same time as they're poring over release notes

* because they might have to re-export their rips multiple times during QC, they can toy around extensively with features & settings (especially on early exports when they'll most likely have typos & al and will have to redo the export anyway)

tl;dr: they have plenty of opportunities to find and try out bleeding-edge
features.

~~~
rplnt
I only wish "western" releases were already packed with subtitles. Especially
now that opensubtitles tries to sell ad space directly in the subtitles.

------
oldmanjay
Since videos from the iTunes Store are encrypted, it seems like the individual
files should look like random noise to the filesystem anyway. Unless there is
something I don't understand happening.

~~~
dewey
There are ways to remove the DRM. In case you don't want to rely on Apple
keeping their activation servers online forever.

~~~
rakoo
I don't think OP did it in the post, so I'm also inclined to believe that he
did not see deduplication in part because of that.

~~~
matrixagent
dewey is correct, DRM/encryption is not the issue.

------
josteink
If you were going to do this for science, a good start would be to analyze the
intro of shows in question, using rudimentary tools such as md5sum on a per-
block basis.

You would then quickly discover that the chances of huge gains would be small,
without ever even looking into the ZFS aspects.

~~~
ccrush
ZFS supports byte-level deduplication where it can compute anchor points so
that it works on unaligned data. However, quite a few a few assumptions about
the entropy of the data set could have been inferred from the "iTunes TV Show"
file choice, which clearly means the files would be compressed and encrypted.

~~~
TheCondor
Is that new? Zfs has supported block deduplication for a few years but I'm
unaware if it supporting byte level deduplication.

~~~
ccrush
[https://blogs.oracle.com/bonwick/entry/zfs_dedup](https://blogs.oracle.com/bonwick/entry/zfs_dedup)

~~~
lkateley
Not new.. dated 2009.

Oracle seems to be doing alot of the lately where they are posting really old
collaterial

------
ccrush
Yes, ZFS deduplication is not very useful when trying to deduplicate encrypted
and compressed video files. Encryption is supposed to make the contents of the
file look random. Compression is itself a form of in-file deduplication. Where
ZFS deduplication will be extremely useful: on a SAN in a video editing studio
where originals are kept in raw format. Even then, file-level and block-level
dedups will probably be the least effective means of deduplication. What we
need is byte-level deduplication which computes anchor points from which data
ought to be deduplicated. Even then, it sounds like the dedup feature would be
scripted so that it be a part of the tooling, and would be updated with the
team's projects. I would go as far as to say the author chose a deliberately
eye-catching headline and coupled up a few Linux commands with some
"hilarious" GIFs to address the wannabe geeks out there while displaying a
staggering pile of shamelessness to even the remotely competent of his blog's
visitors.

~~~
matrixagent
The files are not encrypted. I mention the aspect of doing it with files that
are already compressed and that I never expected it to work for that very
reason. I just wanted to to for once actually do it and I don't see why you
need to get so mad about that.

~~~
ccrush
Unless you stripped the DRM from the files, yes, they are encrypted. At least
according to the Wikipedia page on FairPlay
[[http://en.wikipedia.org/wiki/FairPlay](http://en.wikipedia.org/wiki/FairPlay)],
all the iTunes files are encrypted.

Also, wouldn't proving it doesn't work by setting up a deliberately flawed
scenario amount to a bit of blogsturbation? Should I write an article about
"finding out whether my computer will turn on even if it's unplugged"? Answer:
no. But read my article about me TRYING it just to be sure.

I think you should write a follow-up article where you splice together raw
video files and include a similar segment in all the different files. Then,
put that through the dedup test. I would actually be interested to see the
ability of ZFS implementations to find worthwhile anchor points within files
and do smart byte-level deduplication.

------
jamesbrownuhh
There's basically no way to make this work because, as has been mentioned, in
compressed formats there's every chance that title sequences will occur at
different times in the show, so will occur at different points in the group-
of-pictures sequences in compressed media.

And especially for rips from an already-compressed signal source (like TV),
the odds are well against a sequence, although identical visually, encoding
into exactly the same sequence of bits digitally, as there are so many factors
that can vary the datastream you receive, even before it's ripped and re-
encoded.

To stand any chance of this working, you'd really need to be storing and
comparing uncompressed frames direct from the master, before ANY kind of
variable encoding or compression. But if you have enough storage to be working
with files on that scale, deduplication is probably not your biggest concern.
:)

------
PaulHoule
Reminds me of the DVD sets of Sailor Moon I made years ago where I spliced the
same video sequences in for the intro and outro. I never got so far as
deduplicating the henshi sequences because there often are different voice-
overs during them.

------
PythonicAlpha
I am curious and did not read the docs: Does ZFS deduplication work on parts
of files? I would have expected that filesystem deduplication just deduplicate
complete files (for conveniance). Deduplication of file parts could of course
be good for some types of files, with a fixed structure and with no "noise"
\-- but with videos, there is no chance I guess without manually linking
parts.

~~~
matrixagent
Yep, it does. [http://arstechnica.com/information-
technology/2014/01/bitrot...](http://arstechnica.com/information-
technology/2014/01/bitrot-and-atomic-cows-inside-next-gen-filesystems/) has a
pretty nice overview.

------
anilgulecha
A few bits of change in uncompressed video can change the following stream in
the compressed video completely. Deduplication isn't intelligent enough to
understand "video" and compress for that -- that's what video codecs are for.
So it's not surprising that deduplication would not prove useful for
compressed video. You could see the advantage in uncompressed streams.

------
fulafel
A codec that could search for reference frames from large collections of media
might be pretty interesting for some applications, especially ones with large
storage demands and repetitive material. The indexing system would be an
interesting problem since it would have to be very low overhead. Wonder if any
systems like this exist?

~~~
AlyssaRowan
The player would have to be able to reference a huge number of sparsely-
positioned reference frames.

The seeking might make it impractical.

------
VeejayRampay
I've always dreamed that subbers would just provide one version of the credits
in their season packs and provide a bash/mencoder/whatever script to glue the
bits together after download. We're wasting so much bandwidth on opening and
closing credits that people end up skipping anyway, it's crazy...

------
bio4m
TV shows may have the same intro sequence, but the overlaid credits are always
a bit different.

Hence any compression applied will always produce different results between
episodes of a show. This would make any de-duplication extremely difficult.

~~~
laumars
I think it's a bit of a stretch to say the opening credits are " _always a bit
different_ ". I can think of quite a few shows display the episode directors
and any guest stars after the intro (ie during the first scene of the main
episode). I presume they do this to save editing time and cost changing the
intro - though there's also the convenient side effect of still having the
credited names appear on screen even when viewers fast-forward through the
intro.

------
kapsel
I think this was a pretty dumb experiment, and the outcome was to be expected.

There will always be some sort of noise, pixels aligned differently etc., in a
production like a tv series, and expecting the encoded output to be
identical/matching on a blocklevel is pretty naive to say the least.

I did some experimenting using ZFS deduplication on MPEG2 files, where I
encoded hundreds of MPEG2 dvd-sized videos, where 90% of the material was
identical (the last 10% was affected by applying different watermarking
techniques to the footage), and got some decent deduplication ratio (x1.2:1 or
so).. But ZFS deduplication is expensive in memory/SSD, and it was definitely
not worth it.

~~~
ZoFreX
It may be a "dumb experiment" if you already understand how the underlying
systems (both the file system, and video encoding) work - but the author did
not, and now having done the experiment, they understand more about them.

------
bshimmin
Perhaps I'm just in a bad mood today, but really, why must technical articles
be festooned with meme GIFs featuring abusive language and children injuring
themselves? What does it add? If your content is so boring to you, or to your
prospective audience, that you think it needs livening up with that sort of
garbage, maybe just don't write it in the first place.

~~~
matrixagent
I tend to agree with you, but the article wasn't supposed to be very technical
in the first place, and: both those GIFs are from Dexter, the show of which I
used a season in my tests. It's not completely unrelated.

~~~
bshimmin
I didn't realise they were from Dexter, though I stand by what I said. Clearly
the masses on HN agree with your choices, though, since I'm getting down-
voted!

I'm writing some technical documentation for a project right now - perhaps
I'll put in a few tasteful images from Baywatch to make sure the project
managers who have to read it will make it to the end.

