Hacker News new | past | comments | ask | show | jobs | submit login
Learn FFmpeg the hard way (github.com)
586 points by dreampeppers99 on Jan 20, 2018 | hide | past | favorite | 79 comments

>In summary this is the very basic idea behind a video: a series of pictures / frames running at a given rate.

>Therefore we need to introduce some logic to play each frame smoothly. For that matter, each frame has a presentation timestamp (PTS) which is an increasing number factored in a timebase that is a rational number (where the denominator is know as timescale) divisible by the frame rate (fps).

Constant framerate is a very dangerous assumption to make. It's true for video recorded by professional cameras, but if you check video recorded by most mobile phones you notice that it's wrong.

Any tutorial will not get very far if every corner case is spelled out every time. The point of “learning” documents is to give enough concepts and useful information for the reader to have a basis to start building additional knowledge on top of. Getting into the weeds quickly derails this process.

what's wrong with presenting everything in the correct abstraction to start with?

Whether you a public school math class, or a git tutorial online, you always get the same thing. A list of procedures for using the tools in the most common cases you are likely to encounter. This seems like a good idea in the move fast and break things ideology, but what do we notice about people's skills in the real world? No one knows how git works, and no one knows any math by the time they leave high school.

Give people a set of tools and prove to them that it solves every problem in some domain. If they can't solve the problems using the most primitive, complete toolset you can give them, then the case where they solve the problem is the edge case.

As an example, often, we will have students write fractions in "lowest terms", then only present students with fractions in which the numerators and denominators only share a few, small common prime factors. But checking prime factors is absolutely the hardest way of solving this problem, and unless you're exhaustive in your search, you can't have any confidence that you actually have solved it.

Students are intrinsically aware that this wold be tedious in general, and it gives them anxiety to know that they could have missed something, or not tried sufficiently large divisors. They have no confidence in their tools, and rightly so. The only reason they can solve these problems at all is because the instructor gives them problems which can be solved this way.

The origin of the question "when will I use this in the real world?" is the instinct that the tool you have been given is only for edge cases, and cannot be relied upon.

I think you are exactly wrong in your assessment. Getting into the weeds is the only way to learn, because almost everything in the world is weeds, some of those weeds just happen to be called crops.

> what's wrong with presenting everything in the correct abstraction to start with?

Nobody can retain the material that way. Each thing you learn has to be attached to other concepts you already have, and your skills build on one another.

Nobody teaches Peano's axioms of arithmetic to kindergarteners; they're still learning how to identify shapes, to compare quantities. When they eventually learn even the simplest proofs, they'll build on the basis of careful attention to detail that they learned while mastering basic arithmetic.

Nobody teaches general relativity on the first day of college-level physics class. Students are still learning calculus at that point; even if they already had a calculus course, this is their first opportunity to apply it.

And nobody teaches the fundamental proofs of calculus on the first day of calculus class; you can use sloppy language like "infinitesimals" to establish a good intuition for how to use derivatives and integrals.

If you tried introducing this material from the bottom up, it wouldn't take. But if I'm wrong, sure, go try it and see how it works.

You're right, I shouldn't have said the correct abstraction, but you should at least provide a complete one. You don't have to build from the most fundamental possible assumptions to build from a set of tools that is provably complete for some domain of problems.

Doing otherwise is like giving someone a screwdriver that only turns clockwise.

Also, I challenge the notion that "no one can retain material that way". Have you ever met anyone who has retained the material any other way who didn't do it despite the system?

Essentially, this is impossible for anything that’s not easily explainable all at once- so it’s useless for any advanced domain.

You can’t just give people a firehose of information. When you learn acoustic engineering, you have to first learn the prerequisite math needed to understand later concepts such as room architecture. It simply does not make sense to jam pack it all in at once, because 1) it won’t make sense and 2) nobody has that kind of memory.

Next, if you want to cover all possible use cases, that could take forever in certain domains.

Last, some of this stuff is objective as to what’s necessary, or what cases need to be covered.

No, what’s better is to give general knowledge as needed, and the student can seek out knowledge for various corner cases. It would be ridiculous otherwise.

>> not true for videos shot with mobile phones

> Any tutorial will not get very far if every corner case is spelled out every time.

Is mobile phone video a corner case?

I wouldn't consider options other than constant bit rate (variable bit rate) as corner cases, but your other points stand.

Parent said frame rate, not bitrate. They are orthogonal to each other.

[1] is a useful list of such assumptions. Found it very handy when programming AV handling.

[1]: https://haasn.xyz/posts/2016-12-25-falsehoods-programmers-be...

See also: this talk from a Vimeo engineer about the video encoding horrors they've encountered in the wild.

https://www.youtube.com/watch?v=cRSO3RtUOOk / https://speakerdeck.com/demuxed/things-developers-believe-ab...

Honestly I don't quite know how or why engineers keep working in a "all of this is fundamentally broken always has been and never will be fixed" domain, on a motivational basis.

"all frame timestamps are monotonically increasing"


It's true for video recorded by professional cameras, but if you check video recorded by most mobile phones you notice that it's wrong.

It's unfortunate that so many tools assume constant framerate, because VFR is a useful compression technique in itself --- for sequences where there's basically no change between frames (few-second black scenes, for example), the encoder can stop continually encoding "no change" and just wait until there is.

In my experience even cameras with constant framerates aren't really constant, if you look at the timestamps the delay between frames typically varies a few (dozen) ms. So on average you might have 30fps, but if you sample a random second you might end up with 28-32 frames.


That comes from the old NTSC standard and the history of adding colour in a backwards compatible way:


In the case that the above posters are mentioning, this probably has less to do with NTSC compatibility and more to do with bad timing hardware or cameras that attempt to do too much in PIO mode instead of using real encoding hardware.

I wouldn't say bad timing means bad hardware though, even expensive scientific cameras suffer from this. You need rather complex electronics to capture at a high and consistent rate. I think normally the encoding is done by software.

Glad to see that video linked; it's a terrific explanation.

This really hits if you try to merge audio which has been recorded separately from video (e.g. Raspberry Pi where at least a few years ago software did not support recording audio synchronously with video).

And the worst thing is- it could vary from one device to other because the crystal oscillators used for image sensor clock have slightly different resonant frequency (which is caused by manufacturing tolerances when cutting the crystals).

This was something I had to find out the hard way in the process of writing this


Without interpolative resampling, it didn't work well for me at all.

to be fair it is possible to do this from the command line with ffmpeg when using concat[1][2]. i did this when creating real time video from screencast in the chrome remote debugger.

[1] https://ffmpeg.org/ffmpeg-formats.html#concat

[2] https://trac.ffmpeg.org/wiki/Slideshow

Having built programs that used the ffmpeg libraries (as well as x264) I have to say that a tutorial that is up to date and recent like this would have been very helpful at the time. Glad to see somebody is undertaking this effort.

I agree. ffmpeg can be inscrutable sometimes. Understanding the internal concepts is a big boost when using it. For example I've used "-avcodec" hundreds of times in the command line but I just now understood where that fit in.

All and all a good intro tutorial that gets into some of the common professional use cases. On "constant bitrate" assumptions and some of the subsequent discussion here, ANY transform-and-entropy codec like VP9 or H264 will ALWAYS be variable bitrate internally. In the pro broadcast world, where you can't go over your FCC-allocated radio frequency bandwidth allocation by even one iota (or nastygrams and fines ensue), this is "handled" by actually having the encoder understoot the actual target (say it's 5mbit), and then the stream is padded with null packets to get a nice even 5mbit/s. This also happens with LTE broadcast as well. The encoders that do this reliably are fabulously expensive and of a rackmount flavor.

Reading this reminds me of all of the days spent using Manzanita MPEG software to get streams to "work". While its true that the bit rate may/will fluctuate, it was the muxing software that saves the stream. The muxer would introduce those null packets. The studio I used to write automated workflows for would get work specifically because we could make streams work that other facilities could not. Manzanita software was always the difference. Rarely, did an encoding software/hardware output work directly. However, remuxing the same elementary streams with Manz software would pretty much always solve the problem. If that didn't do it, it was probably a VBV buffer under/overflow issue around a scene change somewhere in the stream.

Oh gawd how I don't miss those days.

I'd love to hear more details about those things. I'm guessing it's not as simple as wiggling the quality around to keep the output within a certain size.

That's basically it. All of the digital broadcast streams (over the air, cable, satellite) use a fixed bitrate per physical channel, and send an mpeg transport stream. A transport stream is built of fixed length packets. Within the transport stream you can multiplex different programs. OTA gets 20Mbps per channel, if you use that for one program, it's likely you may not use the whole thing, so you include null packets as needed to fill the stream. If you send multiple programs, you probably will fill the stream, so you have to reduce quality and/or play games with timing of i-frames and possibly adjusting program start times or commercial break lengths to avoid having multiple programs needing high bitrate at the same time.

I wrote about creating adaptive streaming over HTTP (MPEG-DASH) using ffmpeg and mp4box before [1]. It is nice to see that ffmpeg is incorporating more and more of the same features.

Has anybody got ffmpeg to create HLS out of mp4 fragments as well? It would save some conversion time and space, since the same MP4 fragments could be used with the two different manifest files, DASH and HLS. Mp4box is not quite there yet, sadly.

[1] https://rybakov.com/blog/mpeg-dash/

I think they're almost there in two ways. 1) is https://github.com/gpac/gpac/issues/790. 2) is to extend FFmpeg to include the MP4Box muxer: http://www.gpac-licensing.com/signals/

Thanks for the links. Yes, the gpac dev version is close to have a working solution - https://github.com/gpac/gpac/issues/772

Not quite sure what the signals project strives to be. Is it a framework for commercial applications built on top of gpac?

Why exactly do people call good explanatory manuals "the hard way"? The hard way to learn something is using nothing but the official reference manual or the man pages. What I have found by the link is what I would rather call "for dummies" :-)

Probably as a reference to Zed Shaw and the website https://learncodethehardway.org/

I feel like everything I do with ffmpeg is the hard way.

It is really an amazing piece of technology, but whoah does it have some gotchas.

Consider GUI wrappers the "for dummies" version.

I actually find those I tried rather unfriendly. Every time I have to convert something I start by looking for a quick-and-easy GUI solution but end up using command line because it's more intuitive (!), more flexible and it actually works (GUI wrappers don't always do, it often happens that you click "start" or whatever and nothing happens or a nonsensical error pops up).

By the way it's always easier to teach a not particularly bright person to use a textual dialogue command line interface than a complex GUI: you just tell them (and let them write down) what do they have to type, what response they can expect and how are they to respond to it if it's this or that. I can remember how easy it was to teach my granny to use the UUPC e-mail system under DOS and how much harder it was to teach GUIs.

Some things are not offered by the CLI where GUIs can provide value-add, e.g. audio + video bitrate calculations for 2pass encoding targeting particular file sizes, stream selection dropdowns when you want to burn in one of several subtitle tracks etc.

Consider WebM4Retards[0]. It provides a simple UI for some avisynth filters + ffmpeg while also allowing you to export the avs scripts and ffmpeg command line if you want to run them manually. It fills a particular niche between simple transcoding/muxing and non-linear video editors plus encoding pipelines.

Additionally GUIs can help with tasks that you execute infrequently enough that you don't manage to memorize the myriad of command line options. MeGUI is convenient when adding chapter information to a video which I do once in a blue moon. In principle I could do this with text files and mkvtools, but it's just faster to use the GUI than reading the manual (again).

[0] https://gitgud.io/nixx/WebMConverter

Handbrake is a pretty solid and easy to use GUI wrapper of ffmpeg.

It handles a few (of the most common) use cases but it is not a generic interface for all of ffmpeg's functionality.

And strictly speaking, it is built on top of Libav, although they are considering switching over after 1.2

This looks pretty awesome..

I was hoping this was about the command-line tool, which is quite daunting to learn once you start getting into the filters and combining/merging multiple audio and video sources.

In a recent project I eventually had to write a script to do 2 passes of ffmpeg after spending a good day trying to figure my way through all the filter documentation to do a 5-step process in one pass..

This isn't a knock against ffmpeg, which is an amazing open source package, but it's complicated, so I'm sure this repo will be of great help to many.

The ffmpeg commandline tool is ultimately just a frontend to all the internals.

So understanding how everything works internally will translate directly to a better grasp on how to use ffmpeg at the commandline.

Even though this document is very new I now understand why fiddling with the PTS value in the filtergraph would slow down/speed up some video I was tinkering with a while back, for example.

I’ve spent a large portion of the past month working with ffmpeg. We are using it to repair, concat, and mux an arbitrary number of overlapping audio and video streams recorded via WebRTC. While this sounds straight forward on the face of it, the interplay of filters and controls is impossible to predict without extensive experience.

To my knowledge ffmpeg is the only tool that could possibly do this. It took a whole lot of reading man pages, bug trackers, stack exchange and various online sources to stitch together the necessary info. This seems like a great resource to learn what the tool is really doing and skip scrambling for information so much.

FWITW I found a python library moviepy a lot easier to work with (although sometimes you still have to drop down to ffmpeg as it won't do everything ffmpeg can do).

Moviepy is basically an easy to use wrapper around ffmpeg and some other utilities like ImageMagick. Not as full featured, but one can get quite a long way with it.

I'm sure you did, but just in case - did you checkout Sox?:


I've used both FFMpeg and SoX extensively for audio and have found SoX very useful on more than 1 occasion.

I'm also working with ffmpeg and am wondering what I'm doing wrong. I have a HTML5 canvas animation editor, and I use node-canvas to convert the canvas to images at 30 fps and then attach audio at the correct times. ffmpeg takes too long...

How does a site like Playbuzz render things so fast in their "Video" tool: https://editor.playbuzz.com/

Their renderer takes 5 seconds. Anyone know how they do it?

Sidenote: over the years I haven't seen a Linux utility crash so often as ffmpeg; not by a wide margin. So when calling this library, it is probably wise to assume that it may not return.

I use ffmpeg to transcode video, perhaps 10,000 times in the past year (part of an automated pipeline). It has failed about 50 times or so, almost every failure down to corrupted input and not a crash per se. It has deadlocked about 200 times, often enough that my tools monitor the log file I redirect output to and restart the job if too long passes without progress.

I'm amazed it's as stable as it is, given the complexity of video formats. Of course crashes are also potential security problems - I sometimes wonder how far you could get with a malicious video spread virally (in the social sense).

Probably good to document those use cases, collect sample clips and contribute back back to the project as test cases.

Help improve the project for everyone.

I've made efforts in the past, but often I can't legally redistribute the problematic data. If the problem is obvious enough, I try and clean-room recreate it.

> often enough that my tools monitor the log file I redirect output to and restart the job if too long passes without progress.

Do you mind sharing an example of how you accomplish this?

By utility, I assume you mean the CLI binary. There have been issues with filtergraph buffer management in the past year or so, but other than that, ffmpeg rarely crashes. What sort of commands crash for you?

Same here, I've rarely ever had anything that links ffmpeg crash, and I have a lot of ancient videos in old formats laying around.

Crash? Or notice and signal an error? I use it all the time and cannot remember it crashing in a bad way (because otherwise I would remember creating a bug report).

About 10 years ago I wrote a video encoding app with libavcodec/libavformat. Beyond very simple examples with everything set to default, there was almost zero material on how to use it. There was maybe one trivial example of decoding a file where all the parameters are guessed automatically. But I needed to encode a live video stream from a webcam. I spent a couple of weeks reading the code and hacking away until I got it right. I wish there was stuff like that back then.

When ffmpeg came along nearly 20 years ago, it seemed to be the absolute video toolbox I was looking for.

It was, but you needed what seemed to be 'dark arts' to use it. And that took a long long time.

I owe years of my career to working out how to use this.

I would gladly donate to anyone that wants to statically bind this for use with Go. I really would love to see the power of FFmpeg open to more apps.

I've used FFmpeg libs dynamically linked, but it requires FFmpeg be installed on the system.

Note any such code would be covered by the GPL. This is why so many people use the cli - by using pipes or sockets your code ( as far as I can determine, you should get real legal advice) can be released under a different license

I would gladly donate to anyone that wants to statically bind this for use with c# with mono multiplatform support. I really would love to see the power of FFmpeg open to more apps. I've used FFmpeg libs dynamically linked, but it requires FFmpeg be installed on the system.

Indeed, this would be great.

This actually came up for me a couple days ago, and my ham-fisted solution was to bundle ffmpeg.exe and ffplay.exe in the DLL and extract when running :)

It solved the problem but is so gross!

IMO better to fork/exec ffmpeg to avoid memory leaks and security issues. The fork/exec takes a few microseconds and works well (you can even pipe it data). Of course, if the command line options doesn’t do what you want, that’s a different issue.

How to deal with efficiency issues from the huge amount of data being piped? Shared memory area and send pointers over the pipe?

Is this really a problem with pipes. I found that I can easily push 8-9 GIB/s over a standard pipe on my 4 year old desktop

    yes | pv > /dev/null

I looked at some FFmpeg-Go bindings recently. Concluded that it was less effort to write exactly the functionality that's needed in C and expose that to Go, rather than fix up an all-encompassing API. More here: https://github.com/livepeer/lpms/issues/24

This is great! my experience with libav is very frustrating. There was no document/tutorial. The most useful thing was the example programs. but if a problem is not covered by the examples, then I would be doomed.

The worst part is that when the api doesn't work, you receive meaningless error messages. You don't know what's wrong.

I wrote a small wrapper library a decade ago that wraps the decoding capabilities of libavcodec/libavformat in way that makes it relatively easy to use from other programming languages (Pascal in this particular case)


Note that this was one of the first C programs I ever wrote and the API is suboptimal (relies on structs being passed around instead of providing access via getter/setter functions). I don't really recommend that people use it, yet looking at the code might help people to get started with ffmpeg.

Also note that the libavcodev/libavformat libraries have gone a long way in terms of ease of use. If you have a look at the first versions of my wrapper library, it required really weird hacks (registering a protocol) to get a VIO interface (i.e. have callbacks for read, write, seek).

All that being said, today I usually just spawn subprocesses for ffmpeg/ffprobe if I need to read multimedia-files, and I think that for most server-side applications this is the best method (it also allows to sandbox ffmpeg/ffprobe).

Based on the code in this tutorial, it seems like it would already be easy to use from other languages. What did you need to change?

By "being easy to use from other languages" I refer to the size of the header that needs to be translated.

About a year ago I wrote a zsh script to analyse files in my collection and format/selectively re-encode them based on a set of rules. Basically, I wanted .mp4 files that the Roku Plex app could play back without having to do any kind of re-encoding (stream/container or otherwise).

That started me on a 2-week mission of understanding the command-line. Good God is it a nightmare! All my script was doing was reading the existing file, deciding on the "correct" video and audio track, creating an AC-3 if only a DTS existed, and adding a two-channel AAC track. It would also re-encode the video track if the bitrate was too high or the profile was greater than the Roku could handle.

Here's the thing that I discovered, as much as I swore at the command-line interface, I couldn't come up with a more elegant solution that would still allow the application to be controlled entirely from the CLI. Ffmpeg is capable of so much that figuring out a way to just "tell it what to do" from a CLI ends up with an interface that's ... that difficult to use. The program, very nearly, handles everything[0] as it relates to media files and simplifying the application to improve the CLI would necessarily involve eliminating features. It's one of those cases where a light-weight DSL would provide a superior interface (and to a certain extent, the command-line of ffmpeg nearly qualifies as a DSL).

[0] The "very nearly" was necessary because of the one feature I found that it didn't handle. It's not possible in the current version to convert subtitle tracks from DVDs to a format that is "legal" in an mp4 file because the DVD subtitles are images rather than text, whereas the only formats that are legal for the mp4 container are text. I found some utilities that can handle this, but didn't try any, opting instead to simply download them when I encountered a file with subtitles in this format.

Would you be willing to share this script?

I'm not entirely sure what state its in (and the code is brutally hacked together -- I spent far more time reading up on how to mangle the ffmpeg CLI to get what I wanted than I did bothering to write anything resembling quality shell code)

I don't use it any longer, but I'll see if it's still sitting on my media server (but probably not until the weekend) - if you're on keybase, you can find me @mdip; I'll toss it in my public keybase filesystem folder (keybase.pub/mdip, I think) if I can locate it! :)

Always wondered why there were never any stellar guides for the underlying libraries of FFmpeg. Had to do a couple of projects with hardware encoding and decoding and the best example I found of how to use it was ripping apart mpv. At least it has samples of how to use the hardware encode features...

My project ran into a limitation that required only the use of FFmpeg libav without the command line tool. This is exactly what I needed. I'm eager to read the rest.

This is awesome -- I read the first few pages and I'm really looking forward to reading the rest. Very clearly explained; thanks for creating/sharing!

I use FFmpeg since more than 10 years and I am amazed at how stable it is. Even when we say to our users drop anything in our apps we will do our best to handle it .. it's exceptional to have an issue with it. Because the API is always evolving the best advice is to read the headers, the best documentation is there and its accurate.

Question related to FFMpeg: In the first few examples, there are input options defining the video and audio encoding -- why is this needed? Shouldn't a video container be self evident of things like that?

The video container of course tells you what codec the video is, but it can lie or be otherwise incorrect. FFMpeg will use what it says as the default, but you can override it if you'd like (or if you have multiple decoders for a given codec).

But yes, you're right. 99.9% of the time there's no need to specify input decoder library

FFmpeg can have any numbers of decoders/encoders sometimes you want to use a hardware decode to read and a hardware encoder to write (like the intel), so you can override the default decoders/encoders for each codec.

i have used it on the commandline really gives you fine grained control. winff is a easy way and almost as good within the front end.nice article on it.

Thank you!

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact