
PySceneDetect – A tool for detecting scenes in movies - josefslerka
https://pyscenedetect.readthedocs.io/en/latest/reference/command-line-params/
======
spapas82
My PhD (before 10+ years) was in video search and one of the proposed methods
for video comparison was using the shot durations of the video. Notice that
with _shots_ I refer to cuts in the camera flow, IIRC hollywood movies have a
such a shot cut every 4 seconds as an average (for example when two people
talk the camera will move from one person to the other).

I remember that I used two techniques for extracting scene cuts:

* Difference in the brightness (Y) histogram of the YUV video between frames; when that difference is more than a threshold there's a scene cut

* Counting the number of Intra macroblocks per frame on an H.264 encoded video; if that number was more than a threshold then there's a scene cut

~~~
Breakthrough
Author of PySceneDetect here. The current implementation does exactly what you
hint at, except instead of YUV, it considers deltas in the HSV domain
(specifically differences in hue and color).

Other techniques being considered for future work include use of optical flow,
background subtraction, and analyzing histograms.

~~~
spapas82
From what I remember the Y (luma) component in a YUV video has more
information than the other two components and it could also be extracted
without the need to fully decompress the video (in mpeg compressed videos). Of
course this info is more than 10 years old (I don't really do any video
research any more) so I guess there should have been progress in that area.

~~~
Breakthrough
This is indeed correct, I'm just using HSV instead of YUV, but the primary
source of information is the luma/brightness component (although currently all
3 of the HSV components are averaged, so perhaps a better weighting may
improve precision).

------
speps
Can the link be changed to the home page of the docs instead of the CLI params
one?
[https://pyscenedetect.readthedocs.io/en/latest/](https://pyscenedetect.readthedocs.io/en/latest/)

~~~
spookyuser
Am I stupid, or is it really hard to get from a readthedocs page to the github
repo it is generated from.

It seems like the only way to do it is click edit, get an error message and
then hit backspace in the url to get to the root of the repository.

~~~
Breakthrough
Author of PySceneDetect here. Sorry about that, this is one thing I never
figured out how to fix with the generated documentation yet.

Now that you mention it, I might ask the Readthedocs folks to see if they have
any idea why this is happening.

~~~
brachi
Even if that is fixed, a link to the repo on the left column[1] would be nice.
Maybe you don't want to edit that page, just encountered the docs first, e.g.
from HN :D. This project sounds pretty interesting, I'll take a look!

[1]See e.g. docs of requests:
[https://requests.readthedocs.io/en/master/](https://requests.readthedocs.io/en/master/)

~~~
Breakthrough
Great idea, will give that a shot - thanks for the suggestion! :)

------
Breakthrough
Author of PySceneDetect here: Thank you all for the thought-provoking
discussions, and the attention you've given my side project. There are some
specific cases where PySceneDetect achieves great accuracy (like fast cuts or
fades), and some where it's currently not so good at (like sudden flashes or
large obstructions). That being said I do want to track these things and come
up with solutions to improve the robustness of the content detection algorithm
over time.

I'm most open to any feedback or feature requests/ideas/suggestions; feel free
to checkout the issue tracker on Github, or create a new entry:

[https://github.com/Breakthrough/PySceneDetect/issues](https://github.com/Breakthrough/PySceneDetect/issues)

Some ideas being considered/researched for future releases:

\- looking at changes to image histograms

\- using edge detection to improve robustness

\- camera flash / foreground object suppression

\- automatic threshold detection using statistical methods (currently is just
a heuristic)

~~~
bb88
So this is likely beyond the scope of your project, but I've always thought a
really good project would be a website to host scene indexes for movies and
TV.

Eg. Let's say that you wanted to watch a prerecorded football game or baseball
game without all the commercials, timeouts, commentators talking about the
fans, etc.

Or... Let's say that you wanted to re-cut a movie in a certain way, by re-
ordering the scenes, you could just generate a new scene data file and let the
encoder/player use that.

~~~
Breakthrough
This is still relevant I think :) What you mentioned is very similar to an
edit decision list (EDL [1]), of which I only learned recently. I had a
feature request [2] to support EDL as an ouptut format, and upon further
investigation, it seems like the format is very similar to what you're talking
about. The Wikipedia page also indicates that VLC supports xspf files, but I
haven't done much research into that yet ("XML Shareable Playlist Format").

[1]
[https://en.wikipedia.org/wiki/Edit_decision_list](https://en.wikipedia.org/wiki/Edit_decision_list)

[2]
[https://github.com/Breakthrough/PySceneDetect/issues/101](https://github.com/Breakthrough/PySceneDetect/issues/101)

------
Quarrel
So I can look forward to a "Next scene" shortcut in my video player of choice
soon? (Without needing embedded "DVD" chapters) While also getting nice
thumbnails for them?

Pornhub et al often show scene markers, but I have assumed they're manual or
extracted in the same way as DVD chapters.

------
Hamuko
Tested it out on an anime episode to see if it could detect accurately when
the opening and closing credits play. It seemed to be able to detect the
ending of both even with a high threshold, but it didn't really detect the
start. And with a low threshold, it gave me like 350 chapters for a 22 minute
video, which seemed a bit excessive.

~~~
Breakthrough
Hi Hamuko;

Would you be able to share a small sub-set of the episode, in particular the
area where you're unable to detect the starting segment? (If not, no worries!)

There's a few issues with PySceneDetect currently that may lead to false or
missed detections, but these are things that I would like to solve in the long
run:

\- threshold is heuristic/fixed right now, but I would like to change it to an
adaptive/statistical method which can dynamically change

\- single-frame events can trigger false scene changes

Thanks for your feedback, and feel free to share any other suggestions you
might have.

------
achow
Little off topic..

I'm trying to help a friend in media industry. His requirement is ability to
identify different voice in a movie and have the output time stamped - Ex.
Voice A: 00.00.00sec - 00.05.30sec, Voice B: 00.05.31 - 00.06.30, etc.

It would be very helpful if anyone can point to any tools that exists that can
do that (open source or otherwise).

~~~
lozenge
AWS offer this for $1.44ph

~~~
achow
Thanks. Are you referring to Transcribe?
[https://aws.amazon.com/transcribe/](https://aws.amazon.com/transcribe/)

Transcribe seems to be more for speech-to-text irrespective to who is making
the speech.

Here the requirement is to identify the unique voices. Ex. if "Mary had a
little lamb" is voiced by two different voices then the engine should identify
Voice A said "Mary" at 00.00.00sec-00.00.01sec and then Voice B said "had a
little" at 0.00.02sec-00.00.03sec, then Voice A again said "lamb".. etc.

~~~
yorwba
From your link:

 _Recognize Multiple Speakers

Amazon Transcribe is able to recognize when the speaker changes and attribute
the transcribed text appropriately. This can significantly reduce the amount
of work needed to transcribe audio with multiple speakers like telephone
calls, meetings, and television shows._

So they claim to be able to do it. However, having to do speaker diarization
doesn't exactly make speech recognition easier, so you should adjust your
expectations regarding the error rate accordingly.

~~~
achow
Duh. It was right there.

But thanks; on doing little bit of search on 'speech diarization' saw that
Google Cloud has that service. [https://cloud.google.com/speech-to-
text/docs/multiple-voices](https://cloud.google.com/speech-to-
text/docs/multiple-voices)

And then came across this comparison between various cloud transcription
service, which should be helpful for evaluating on diarization aspect as well.

AI-POWERED TRANSCRIPTION SERVICES SHOWDOWN: AWS VS. GOOGLE VS. IBM WATSON VS.
NUANCE [https://www.armedia.com/blog/transcription-services-aws-
goog...](https://www.armedia.com/blog/transcription-services-aws-google-ibm-
nuance/)

------
genp
I've used this tool for breaking up movie trailers and old Vines. I really
didn't like the output quality.

I ended up using optical flow and key frame information instead.

~~~
Breakthrough
Hi genp;

Just curious, what didn't you like about the output quality? What version of
PySceneDetect were you using?

The latest version (v0.5.x) uses ffmpeg instead of mkvmerge for output by
default now, which produces significantly more accurate and higher quality
output than before.

That being said, you are correct in that optical flow and keyframe information
is currently not being used during the detection phase. There are several
proposals to incorporate this into a future release, however, along with
several other techniques: \- histogram analysis \- edge detection \-
background subtraction \- automatic or dynamic threshold detection

Thanks for your feedback!

------
dariosalvi78
This was my assignment in the "artificial intelligence" course I took at my
University in Naples (Italy) in 2003. My NN could recognise scenes cuts with
about 50% probability :)

------
Daub
a Film is a collection of shots, scenes and sequences. Shots are fairly easy
to detect. Most editing software comes with built in shot detection.

This approach promises to detect scenes. I’m not enough of a coder to know if
it is valid, but I can certainly imagine that it is a solvable problem. Most
scenes are defined by location, and movies makers like each locatation to be
distinct from the preceding one.

To my mind, sequences are where the action is at. Sequences are defined by
narrative development... These can usually be defined by a trained human, but
are probably tough to computationly define.

There is also a huge difference between film genres. The scenes and sequences
of Finding Nemo are pretty easy to define. But you try the same approach on an
art house film like The Scent of Green Papaya, and see how far you get.

[https://www.helpingwritersbecomeauthors.com/movie-
storystruc...](https://www.helpingwritersbecomeauthors.com/movie-
storystructure/finding-nemo/)

~~~
onion2k
Detecting scenes seems incredibly difficult to me. Even in a basic edit you
could have back to back shots looking at very different parts of a single set.
Take the "Jack Rabbit Slims" scene from Pulp Fiction - Vincent Vega walking
through the diner, the car table head to head scene about the milkshake with
linking cuts around the diner when they discuss the two Marilyns, and the
beginning of the Twist competition are three scenes that use the same
location, actors, background, etc. Finding the boundaries would be very hard.

~~~
joshvm
One approach might be to estimate the intrinsic parameters of the camera based
on keypoint tracking. Assuming the camera was adjusted between shots, you
might be able to get that to work.

Or you could make some kind of generic feature vector from each frame that
includes contrast, etc. The interiors are the same, but the camera focus/fov
will likely be different.

Nowadays just take an annotated movie and throw it into an encoder/decoder
network to assess if a frame is from the same scene?

------
jimmaswell
Odd, scene differentiations often seemed like arbitrary slices just meant for
trying to get to a certain point in the movie, long ago when I'd look at them
on DVD's. Not something where you could objectively say "this is where one
scene ends and another begins" and others would come to the same conclusion on
their own.

------
natmaka
Can it detect subliminal frames? Would an adequate threshold also lead to many
false positives (triggering on any brutal massive change, for example a blast-
inducing explosion)?

~~~
tripzilch
You could differentiate single-frame inserted "subliminal" (like the ones in
Fight Club) images from other flashy things like explosions, by the fact that
the frames right before _and_ after the flash should be nearly identical or
very similar. While after an explosion, things usually changed a bit.

~~~
Breakthrough
This is definitely a good idea, and something that I'm most open to
considering for a future release of PySceneDetect. Admittedly the current
version does not handle single-frame "upsets" like this, but this does seem
like a logical and reasonable approach to a first attempt at filtering them
out.

~~~
doubleunplussed
I would do an exponential smoothing of pixel values over some timescale, say,
0.2 seconds, before further detection of scene changes. That should do the
trick.

------
nabeards
Testing with some older black and white shows, results aren't accurate to
catch fades-to-black. Any options I should be playing with?

~~~
Breakthrough
Thanks for giving PySceneDetect a try. Can you share what you're using as the
command line arguments?

For detecting fades-to-black, you want to make sure that you're using the
detect-threshold command (not detect-content). For example:

    
    
        scenedetect -i somevideo.mp4 detect-threshold -t 12 list-scenes
    

Where -t specifies the threshold to use (the default being a value of 12).
Full documentation for the detect-threshold command:

[https://pyscenedetect-
manual.readthedocs.io/en/latest/cli/de...](https://pyscenedetect-
manual.readthedocs.io/en/latest/cli/detectors.html#detect-threshold)

