Hacker News new | past | comments | ask | show | jobs | submit login
Automatic Video Editing (tratt.net)
104 points by todsacerdoti on March 23, 2021 | hide | past | favorite | 23 comments



A lot of the issues raised in this post are well solved by commercial video editing software.

For instance, markers and scene detection / edit lists in DaVinci, or word by word editing with recognition in Descript.

I feel like the OP should give them a try before trying to apply the universal script hammer to this problem. The ffmpeg line alone is frightening.

Use of unix time instead of timecode is another problem here.

It’s cool that he wrote this but it doesn’t conform to any video standards in use.


My guess is that it was written because the author much prefers writing code and automating tasks to video editing, and it was worth the extra time for them to do that.

Out of curiosity, do any commercial video editing programs offer automation/scripting? That seems like a great place for scripting to be available to help alleviate common tedious tasks, if it's not already there. I think that could lead to a best of both worlds.


Yes there are many existing automation options for video production applications. For example, the concept of a portable EDL (edit decision list) file is already well established in the industry for decades: https://en.wikipedia.org/wiki/Edit_decision_list

Scripting languages are also available for specific post production applications such as Final Cut: https://en.wikipedia.org/wiki/FXScript

Then there grew a whole crop of cloud-based scripted content generators like stupeflix, sundaysky, idomoo.com.

I think the author did a great job creating a fun project and explaining the ffmpeg workflow, but a video professional has many off-the-shelf options already.


descript is the only thing I’ve seen that turns video editing into something resembling word processing - it gets very jump cut-y and jerky but most people are used to it in the TikTok/YT era.

It’s terrifically buggy right now but as a PoC it’s amazing.


oh, and scripting wise you can fully automate davinci with python.


The author seems to be using OpenBSD, so I imagine that rather limits their options.


I've been thinking a lot about this problem myself! Making screencasts, I realized most of my editing time was spent cutting out silence. If I recorded with that in mind (staying silent between re-takes), the editing became pretty straightforward but really tedious.

I made my own automatic editor tool [0] recently, and the approach I started with is simply cutting out the silent parts.

I looked at other options like jumpcutter.py and auto-editor.py before building it, and they helped prove out the idea, but I really wanted an interactive UI, so I built an app in Swift.

I figured the automated approach is probably not gonna be perfect, so I built it to export XML/EDL/ScreenFlow files that can be imported to other editors for fine-tuning. It works pretty well and people seem happy with it so far.

Someone else mentioned timestamps vs. timecode. Maaan, timing has been a real thorn in my side with this project. The "even" framerates aren't too bad (30fps/60fps/etc) but the uneven ones are a huge pain (29.97, 59.94). One fun problem recently was figuring out how to bin audio samples at 48khz into frames at 29.97. Because each frame holds an uneven number (1601.6), I had to alternate between assigning 1602/1601 to each frame, or else my idea of time would slowly but steadily skew out of sync. As someone who'd never worked with video before, this has been a fun adventure.

Right now I'm working on adding more manual control over the cuts, and this kind of stuff is what I'd like to tackle next! Automatic scene detection, ability to leave markers during recording, more control over transitions, stuff like that would be really cool. Feels to me like automatic editing might get more popular as more people realize it's even possible.

0: https://getrecut.com


Ugh, those messy frame rates. I like my Sony α6100, but they’ve made such a mess of the frame rates. They’ve tied the 25/30 decision to a PAL/NTSC switch (and this fact is completely undocumented, of course), and so if I want to record at 30 or 60fps instead of 25 or 50, I have to go to NTSC, which makes it pop up an annoying “Running on NTSC” message every time I start the camera in video mode, and then to add insult to injury it’s not even 30fps but 29.97fps, which is a real nuisance in editing. (This is all when recording on the camera; I don’t know if it would be any different via HDMI capture.)

But then I’ve been adding still more fun by recording audio on my laptop with a good USB microphone, and replacing the audio recorded on the camera. But the laptop’s clock runs a bit slower than the camera’s¹ so I have to stretch the computer-recorded audio by 0.003% (Audacity: Effect → Change Speed) to counter what would otherwise be about 110ms of drift per hour (3.3 frames at 30fps). Good times.

¹ To measure rates absolutely rather than just one device to another, you need a canonical time source; internet time servers will do, so you compare what the audio device is sending with what internet time servers report over the course of a few hours to eliminate jitter. As for why you’d care, I cared so I could tune my piano more accurately. I measured the alleged 22050Hz as being about 22048.862Hz, which means it’s running at about 99.99484% of real speed. This difference amounts to −0.9 cents, which is utterly unimportant for a solo instrument, but hey, it’s the principle of the thing.


Your software is such a great idea. The UI could use some work to be more attractive, but the core functionality is top notch.

Here's a suggestion- using Speech.framework[0] you could probably quite easily transcribe the audio and identify filler words ("umm", "hmm", etc) and add an option to automatically exclude those as well.

[0] https://developer.apple.com/documentation/speech


Thanks! I like the Speech framework idea. I've toyed with it a bit and the results not so great, especially offline, and the online one has some limits. I think if I want to properly add transcription I'll need to integrate with some SaaS solution, but I need to do a little more experimenting first.

Do you have any specific suggestions or critiques for the UI? I definitely agree it could be more attractive but I've had trouble figuring out what to do besides "make it look like Final Cut" or whatever. (or maybe I should actually do just that!)


By "Automatic scene detection" do you mean somehow detecting in the video if interesting things are going on and avoiding cuts around those spots? Because a lot of videos I record have silence in the audio when important things might be happening in the video, and having to manually go find those places is a bit of a pain.

Of course, automatically detecting which parts of the video are interesting or not is probably impossible, but it sure feels like an interesting problem to try to tackle.


Something like that. I could base it on frame-to-frame changes, or if my app was doing the screen recording, I could look at keyboard/mouse input as another signal of "non-silence".

I'm pretty much looking at silence removal as a good starting point. My overall goal is to cut down on the manual editing required, so just looking for repetitive processes that I could add automations for.


"Aeschylus is the worst written bit of software I’ve put my name to since I was about 15 years old and it will probably never be usable for anyone other than me."

Hilarious and honest.


This reminds me of a cool tech demo about "enchanced tool" for video editing I saw in January - https://www.youtube.com/watch?v=Bl9wqNe5J8U from descript.com (no affiliation).


Love stuff like this. It feels like we're close to some really interesting things here but I haven't quite seen it yet. Facebook/Apple have their "auto movies" but they're largely just montages over music. Any interesting/useful audio captured in those clips just seems ignored.

My brain cycles on thoughts of what GPT-3 like things could be enable here possibly. Could there be some interesting algorithm trained on which clip should come next kind of AI: these 10s or skip and check again.

Self promotion, but I did fool with a way to try and automate making stop motion movies: https://www.trylocomotion.com

Rudimentary process of letting people take a video and doing a reverse motion detection algorithm. "When there's nothing moving in frame, use that frame for the stop motion movie." But that was a fun dive into this world.


Makes me think of https://www.descript.com/



It composites his face from a camera input into the screencast, and automatically performs the editing for re-takes.


After having recorded close to 600 screencast videos, I automated a number of setup and teardown processes too.

Such as using the Sizer[0] tool to move windows to specific 1920x1080 coordinates of the screen where I configured OBS to record from. This way my desktop resolution never needs to change. Using Sizer requires right clicking a title bar and choosing a pre-created menu item and it auto-resizes and positions the window correctly. Very painless.

But I also have these little shell scripts that are responsible for setting up font sizes and making sure my history is clear. Not showing history is so important if you're using CTRL+r and FZF frequently because having to blur stuff later is time consuming and error prone (ie, missing 1 frame of blur by accident). The stop record script reverts everything back to normal.

    record-start () {
        mv ~/.bash_history ~/.bash_history.bak && history -c
        rm /tmp/%*
        change_terminal_font 9 18

        if [[ "${1}" = "--obs" ]]; then
            cd "/c/Program Files/obs-studio/bin/64bit"
            wslview obs64.exe
            cd -
        fi
    }

    record-stop () {
        mv ~/.bash_history.bak ~/.bash_history && history -r
        change_terminal_font 18 9
    }

    change_terminal_font () {
        [[ -z "${1}" || -z "${2}" ]] && echo "Usage: change_terminal_font FROM_SIZE TO_SIZE"

        from="${1}"
        to="${2}"
        windows_user="$(powershell.exe '$env:UserName' | sed -e 's/\r//g')"
        terminal_config="/c/Users/${windows_user}/AppData/Local/Packages/Microsoft.WindowsTerminal_8wekyb3d8bbwe/LocalState/settings.json"
        perl -i -pe "s/\"fontSize\": ${from}/\"fontSize\": ${to}/g" "${terminal_config}"
    }

I don't think I'll ever automate the editing process because editing is where you can throw in a lot of human nice touches, like zooming into a specific area of the screen for emphasis or adding an overlay picture for context.

But I do try to make things as live as possible, such as using OBS scenes to cut down on post processing editing. That and automating your audio processing so you don't need to edit your audio afterwards has given me the biggest bank for my buck in terms of how fast I can go from an idea in my head to a video ready for YouTube.

A complete list of tools that I use for dev + recording + editing can be found here: https://nickjanetakis.com/blog/the-tools-i-use

[0]: http://www.brianapps.net/sizer4/


I dunno - tools like this can take care of the 80% drudgery, freeing you to really focus on that 20% that provides the human touch :)

Unless you are also recording all your streams before you make your on the fly directors cut with OBS, if you make an "ooops" you're done. With his approach if you don't like the automatic edits, the underlying source files are still there and you can override the automation.

It would be easier if he did the automation within a traditional NLE workflow; overriding the automatic editing would be a lot easer. Since, ya know, editors were designed to make and keep track of changes (ha!)


I think it comes down to recording styles too.

My work flow is to start recording with OBS. Since I use a webcam in the corner while recording my screen I'm aiming for as little cuts as possible because with a webcam unless your face is positioned exactly how it was before watchers will see the jump cut (even if it's subtle). Editing where to cut manually to produce the least visible cut is an art form and takes a human touch.

But I'll press record and do my best. If I get let's say 5 minutes out of a 20 minute video down solid but screw up then I'll stop the recording. Then I'll start recording another file with OBS and resume where I left off trying to place my mouse cursor exactly where it was and lead off by saying what I was saying before so it flows.

In the end I might have 2-5 relatively good videos that I then edit together using an NLE tool. Knowing where the cuts are is easy because it's pretty much the beginning of the file to the end.

This also helps reduce massive file sizes where you end up with like a single 45 minute source video that gets edited down to 20 minutes because you made a ton of little mistakes. At a decent scale this matters because disk space while cheap isn't free for someone who is just a solo developer and does everything from 1 dev box. Usually by the end of a recording session I'll have like 15 source videos that I delete because I know they came out bad (often times getting into the first few minutes is the hardest for me), here's a screenshot of what I mean haha https://twitter.com/nickjanetakis/status/1347574482714685441.

Normally I edit my stuff at 2x speed. If I'm not adding a lot of extra effects (zooming, tooltips, highlights, blurs, overlays, etc.) it goes by pretty fast. With work flows that you're used to editing really isn't that bad. It's creating the content / material and executing the human part (delivering the video -- execution basically) that takes most of the time.

Then there's also editing things like an audio only group podcast. There's no way an automated editing process is going to be able to intelligently remove ums and ahs but leaving in a few at key places to make things sound more natural, or maybe removing a bit of stuttering from someone's line in a way that no one would ever notice. Or perhaps cutting out 2 minutes all together because it doesn't add much to the conversation and isn't referenced later so it's safe to cut and no one would ever know.


checked your tools page for "zoom" with ctrl-f - wondering what you are using for zooming in.

some years ago I had a microsoft mouse that included a third button and a driver-addon (I think) - that gave a great magnified box you could move around the screen until you unclicked the third button)

I would love to find a way to do this magnified box again - when finding zoom in your comments, should I assume that you are using camtasia and doing it in post?

I'd love to have this zoom ability for making videos but also when screen sharing live.


Live zooming is something I tried but ultimately stopped doing it because it's too difficult to live code + narrate my thought process + zoom in on demand due to lack of hands. Maybe if I had a foot pedal or something to control it heh.

I zoom in post production during the editing process. Once you get used to your tools it's fast. It takes about a minute to zoom into a specific area of the screen, position it in the exact spot I want and then eventually zoom out back to normal. I like this process because it lets you adjust the zoom transition speed as needed and sometimes I also offset the X / Y coords to center it, etc..




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: