- without dedicated processors, VP9's encoding is roughly 4.5x as slow as H.264, while with the VPUs, the two formats perform the same; this is a big win for the open format(s)
- sad extract: "Google has aggressively fought to keep the site's cost down, often reinventing Internet infrastructure and copyright in order to make it happen" (emphasis mine)
H.264 is an open format. Just not Royalty Free. The baseline of H264 will soon be "free" once those patents expires in 2023. ( Or basically MPEG-5 EVC )
The hardware encoding for VP9 being the same as H.264 is mostly due to hardware specifically optimise for VP9 and not H.264. The complexity difference is still there.
So it's definitely not any more open than H.264.
because the wholesale destruction and minimization of knowledge, education, and information to appease (often arbitrary) intellectual protectionism laws is sad, regardless of who perpetrates it.
non-Google example : What.cd was a site centered around music piracy, but that potentially illegal market created a huge amount of original labels and music that still exists now in the legal sphere.
No one would defend the legal right for what.cd to continue operating, it was obviously illegal; but the unique, novel, and creative works that came from the existence of this illegal enterprise would be sad to destroy.
Swinging back to the Google example : YouTube systematically destroys creations that they feel (often wrongly) infringe upon IP. This is often not even the case, Google routinely makes wrong decisions erring on the side of the legal team.
This destruction of creative work is sad, in my opinion it's more sad than the un-permitted use of work.
Of course, Google as a corporation should act that way, but it's sad in certain human aspects.
Have your own site in your own individual name with no corporate entity nor search for profit offering to host people's videos for free, and I guarantee you that within 24h you are dealing with things ranging from pedophilia to copyright violations and the like. And if you don't clear them out, you're the one responsible.
Google is acting the way society has decided they should act through the laws it voted. Could they act another, more expensive, way in order to save a bit more of content that get caught by mistake ? Definitely, but why would they as a company when the laws says any mistake or delay is their fault.
Source: like many people, I once made a free image hosting thingy. It was overrun by pedos within a week to my absolute horror and shock. Copyright infringement is obviously not the same at all, BUT the way the law act toward the host is not that different "ensure there is none and be proactive in cleaning, or else ...".
If you vouch for someone who is dodgy now you are also seen as a little dodgier than you were before. This doesn't necessarily mean you lose your account because you happened to vouch for someone, but it might mean that your vouching means less in future.
Even if not supported in the article here are two examples in the last couple of days of how YouTube is de facto defining copyright regulation.
Given their entire copyright takedown system is (in)famously entirely automated, I would have thought it would be trivial for it to always follow copyright laws to the letter.. if they wanted it to.
None of those things are trivial, and that's before rights assignment.
YouTube's system is built primarily to placate rightsholders and avoid human labor paid for by Google.
> this is a big win for the open format(s)
How is this a big win if you need dedicated processors for it to be as fast?
I honestly can't parse this sentence.
"Google creates dedicated custom proprietary processors which can process VP9 at roughly the same speed as a 20-year-old codec". How is this a win for opensource codecs?
For example OSX doesn't support it at all and iOS only supports VP9 in 4K/HDR mode and only for certain versions.
Heck, even the M1 supports VP9 in hardware up to 8K60.
I'm not sure where you got that macOS has no VP9 support, it works quite well.
YouTube uses so much bandwidth that this is still measured in millions of dollars but it’s really worth remembering that “at scale” to them is beyond what almost anyone else hits.
> After pushing out and upgrading to VP8 and VP9, Google is moving on to its next codec, called "AV1," which it hopes will someday see a wide rollout.
I still can't see how this is a win.
Vp8 seems to be the most supported on all platforms without melting your intel CPU. At least from when I was deploying a webRTC solution to a larger customer base last year.
> In May 2010, after the purchase of On2 Technologies, Google provided an irrevocable patent promise on its patents for implementing the VP8 format
This is significantly better than even h264 in terms of patents/royalties.
Would you mind elaborating on your hate? There's nothing for google to abandon here? It's already out in the wild.
How did you even come up with this question?
Please go an re-read my original question: https://news.ycombinator.com/item?id=27035059 and the follow-up from me: https://news.ycombinator.com/item?id=27035112 and from another person: https://news.ycombinator.com/item?id=27036150
But yeah, sure, I hate hate hate VP9 smh
VP9 running on custom circuits being equivalent speed to h264 running on custom circuits seems like a win for VP9? Since VP9 isn't royalty encumbered the way h264 is, that could well be a win for the rest of us too.
I can only repeat myself: "Google creates dedicated custom proprietary processors which can process VP9 at roughly the same speed as a 20-year-old codec". How is this a win for anyone but Google (who is already eyeing to replace VP9 with AV1)?
"The rest of us" are very unlikely to run Google's custom chips. "The rest of us" are much more likely to run this in software, for which, to quote the comment I was originally replying to, "without dedicated processors, VP9's encoding is roughly 4.5x as slow as H.264".
Note: I'm not questioning the codec itself. I'm questioning the reasoning declaring this "a big win for the open format(s)".
I suppose you could sample random YouTube urls to find out how many of them link to public videos. Given the total number of possible URLs, it would give you an idea of what percentage of them have been used and therefore how many videos YouTube has in total. It would not tell you how many private videos or Google Drive / Photos videos exist of course.
Youtube isn't the only platform where Google does video transcoding. I don't know them all, but here are a few other places where video plays a part:
Meet - I'm guessing for participates that on are different devices (desktop, android, ios) and depending on their bandwidth will get different video feed quality. Though, the real-time nature of this may not work as well? Though, Meet has a live stream feature for when your meeting is over 250 people, which gives you a youtube-like player, so this likely is transcoded.
Duo - more video chat.
Photos - when you watch a photos video stored at google (or share it with someone), it will likely be transcoded.
Video Ads. I'd guess these are all pre-processed for every platform type for effecient delivery. While these are mainly on youtube, they show up on other platforms as well.
Nest cameras. This is a 24-stream of data to the cloud that some people pay to have X days of video saved.
Another: Youtube TV
EDIT: Stadia: I did some searching around the interwebs and found a year-old interview that hints at something different.
> It's basically you have the GPU. We've worked with AMD to build custom GPUs for Stadia. Our hardware--our specialized hardware--goes after the GPU. Think of it as it's a specialized ASIC.
I can totally see it being non-trivial to count your videos, which is a funny problem to have. But I doubt it's unknowable. More like they don't care/want us to know that.
But knowing the exact number can indeed be hard. It would take stopping the entire uploading and deletion activity. Of course they may have counters of uploads and deletions on every node which handles them, but the notion of 'the same instant' is tricky in distributed systems, so the exact number still remains elusive.
But the question of the precise number of videos before that moment is indeed ill-defined.
It's like knowing the distance to the Sun is 93 million miles. The difficulty there isn't that measuring the distance from the Sun to the Earth exactly is hard, although it is, or that the distance is constantly changing, although it is, or that the question is ill-defined, because the Earth is an object 8000 miles across and the Sun is 100 times bigger, and what points are you measuring between?
The distance is "unknowable" because while we know what "93 million miles" means, it's much harder to say we know what it "means". Even when we try to rephrase it to smaller numbers like "it's the distance you could walk in 90 human lifetimes" is still hard to really feel beyond "it's really really far."
Likewise, does it matter if YouTube has 100, 1000, or 10,000 millennia of video content? Does that number have any real meaning beyond back-of-the-envelope calculations of how much storage they're using? Or is "500 years per minute" the most comprehensible number they can give?
So, yeah, sampling would be a mostly futile effort since you're looking to estimate about 8 to 10 decimal digits of precision. Though it's technically still possible since you'd expect about 1 in every 50 million - 5 billion IDs to work (assuming somewhere between a trillion and 10 billion videos).
My statistics knowledge is rusty, but I guess if you could sample, say, 50 billion urls you could actually make a very coarse estimate with a reasonable confidence level. That's a lot but, ignoring rate limits, well within the range of usual web-scale stuff.
Just generate a random valid link and then check if it gives a video or not.
I found exactly 0 videos.
11 chars encode 66 bits, but actually 2 bits are likely not used and it's simply an int64 encoded to base64.
Given everyone and their grandma is pushing 128-bit UUID for distributed entity PK, it's interesting to see YouTube keep it short and sweet.
Int64 is my go to PK as well, if I have to, I make it hierarchical to distribute it, but I don't do UUID.
The trade-off you make when using short IDs is that you can't generate them at random. With 128-bit Id, you can't realistically have collisions, but with 64-bit ones, because of the birthday paradox, as soon as you have more than 2^32 elements, you're really likely to have collisions.
However theft of the encryption key is a concern, since you can't rotate it and it just sat there in the code. Nowadays they do something a bit smarter to ensure ex- employees can't enumerate all unlisted videos.
Random 64-bit primary keys in mysql for newer videos. These may sometimes collide but then I suppose you could have the database reject insert and retry with a different id.
Surely if it was a great chore for YouTube to have random-looking int64 ids, they would switch to int128. But they haven't.
I'm a big fan of the "works 99.99999999% of the time" mentality, but if anything happens to your PRNGs, you risk countless collisions to slip up by you in production before you realize what happened. It's good to design your identity system in a way that'd catch that, regardless of how unlikely it seems in the abstract.
The concept of hierarchical ids is undervalued. You can have a machine give "namespaces" to others, and they can generate locally and check for collisions locally in a very basic way.
UUID generation basically has to use a CSPRNG to avoid collisions (or at least a very large-state insecure PRNG).
Because of the low volume simply using /dev/urandom on each node makes the most sense. If /dev/urandom is broken so is your TLS stack and a host of other security-critical things; at that point worrying about video ID collisions seems silly.
It's an int64, encoded as URL-friendly base64 (i.e. alphanumeric with _ and -).
If M ≈ 1e9 and N ≈ 1e18, and you sample K = 1000 URLs, then it's about one in 1e-09 that you hit a used ID.
IDs are 64-bit integers. The number of tries before an event with probability P occurs is a geometric distribution. If V is the number of valid IDs (that have a video), the number of tries is 2^64÷V. Assuming 1 megatries per second, since we can parallelize it, we would find the first video in 20 seconds on average, with a conservative estimate of V = 10^12 (a hundred billion videos).
To have a sample of ~100 videos, it’d take about half an hour.
The youtube ID pool is closer to 7.4e19, not 1.8e19. I'm not a math expert and my probability is quite weak. If you assume generously that 1 trillion IDs have been taken, the percent of IDs that are in use is 1e12/7.4e19, then 0.0000000135% of available IDs have been taken.
Getting a single video at 1 megatry per second would take something more like 2 hours, not 20 seconds.
(The other thread that assumes a 62-character set is wrong because they forgot about '-' and '_'. I'm fairly certain a video ID is a urlsafe base64 encoded 64-bit int. 64^11==2^66)
The big issue here is quality - most hardware encoders hugely lag behind software encoders. By quality I mean visual quality per bytes/s. Which means that using a quality software encoder like x264 will save you massive amount of money in bandwidth costs because you can simply go significantly lower in bitrate than you can with a hardware encoding block.
At the time, our comparisons showed that you could get away with as low as 1.2MBps for 720p stream where with an enterprise HW encoder you'd have to do about 2-3MBps to have the same picture quality.
That's one consideration. The other consideration is density - at the time most hardware encoders could do up to about 4 streams per 1U rack unit. Those rack units cost about half the price of a fully loaded 24+ core server. Even GPUs like nVidia at the time could do at most 2 encoding sessions with any kind of performance. On CPU, we could encode 720p on about 2 Xeon cores which means that a fully loaded server box with 36+ cores could easily do 15-20 sessions of SD and HD and we could scale the load as necessary.
And the last consideration was price - all HW encoders were significantly more expensive than buying large core count rack-mount servers. Funny enough, many of those "HW specialised encoding" boxes were running x86 cores internally too so they weren't even more power efficient.
So in the end the calculation was simple - software encodes saved a ton of money on bandwidth, it allowed better quality product because we could deliver high quality video to people with poor internet connectivity, it made procuring hardware simple, it made the solution more scalable and all that at the cost of some power consumption. Easy trade. Of course the computation is a bit different with modern formats like VP9 and H.265/HEVC - the software encoders are still very CPU intensive so it might make sense to buy cards these days.
Of course, we weren't Google and couldn't design and manufacture our own hardware. But seeing the list of codecs YouTube uses, there's also one more consideration: flexibility. HW encoding blocks are usually very limited at what they can do - most of them will do H.264, some of them will stretch to H.265 and maaaaaybe VP9. CPUs will encode into everything. Even when a new format is needed, you just deploy new software, not a whole chip.
I guess what I didn't expect is that Google could design their own encoder IP to beat the current offerings with a big factor at the task of general video coding. I guessed that Google actually built an ASIC with customised IPs from some other vendor.
But maybe Google did do just that?
The most important performance/quality related process in encoding is having the encoder take each block (piece) of previous frame and scan the current frame to see whether it still exists and where it moved. The larger area the codec scans, the more likely it'll find the area where the piece of image moved to. This allows it to write just a motion vector instead of actually encoding image data.
This process is hugely memory bandwidth intensive and most HW encoders severely limit the area each thread can access to keep memory bandwidth costs down and performance up. This is also a fundamental limitation for CUDA/gpGPU encoders, where you're also facing a huge performance loss if there's too much memory accessed by each thread.
Most "realtime" encoders severely limit the macroblock scan area because of how expensive it is - which also makes them significantly less efficient. I don't see FPGAs really solving this issue - I'd bet more on Intel/nVidia encoding blocks paired with copious amount of onboard memory. I heard Ampere nVidia encoding blocks are good (although they can only handle a few streams).
> "each encoder core can encode 2160p in realtime, up to 60 FPS (frames per second) using three reference frames."
Apparently reference frames are the frames that a codec scans for similarity in the next frame to be encoded. If it really is that expensive to reference a single frame then it puts into perspective how effective this VPU hardware must be to be able to do 3 reference frames of 4K at 60 fps.
Would that also depend on the content?
Aren’t panning shots more difficult to encode?
Actually not quite - "reference frames" means how far back (or forward!) the encoded frame can reference other frames. In plain words, "max reference frames 3" means that frame 5 in a stream can say "here goes block 3 of frame 2" but isn't allowed to say "here goes block 3 of frame 1" because that's out of range.
This has obvious consequences for decoders: they need to have enough memory to keep "reference frames" decoded uncompressed frames around in a chance that a future frame will reference them. It also has consequences for encoders: while they don't have to reference frames far back, it'll increase efficienty if they can reuse the same stored block of image data across as much frames as possible. This of course means that they need to scan more frames for each processed input frames to try to find as much reusable data as possible.
You can easily get away with "1" reference frame (MPEG-2 has this limit for example), but it'll encode same data multiple times, lowering overall efficiency and leaving less space to store detail.
> Would that also depend on the content?
It does depend on the content - in my testing it works best for animated content because the visuals are static for a long time so referencing data from half a second ago makes a lot of sense. It doesn't add a lot for content where there's a lot of scenecuts and actions like a Michael Bay movie combat scene.
Of course if you do enough transcoding that you are buying servers for the job then these start to save money. So I guess someone finally decided that the R&D would likely pay off due to the current combination of cyclical traffic, adjustable load and the cost savings of the accelerator.
It was always an interesting and weird product it even runs it’s own OS.
I thought they've already use custom chip for transcoding for decades.
 - https://cloud.google.com/blog/products/ai-machine-learning/g...
Read all that on some random government document portal, but can't seem to find it now...
I feel like the Google results were better 20 years ago, what did they use back then before TPUs?
In my past experience working with FPGA designers, I was always told that any C-to-H(ardware) tooling was always quicker to develop but often had significant performance implications for the resulting design in that it would consume many more gates and run significantly slower. But, if you have a huge project to undertake and your video codec is only likely to be useful for a few years, you need to get an improvement (any improvement!) as quick as possible and so the tradeoff was likely worth it for Google.
Or possibly the C-to-H tooling has gotten significantly better recently? Anyone aware of what the state of the art is now with this to shed some light on it?
Are you basically asking why they don't take a performance-sensitive, specialised, and parallel task and run it on a low-performance, unspecialised, and sequential system?
Would take hours and be super inefficient.
4G is perfectly fine for uploading videos. It can hit up to 50 Mbps. LTE-Advanced can do 150 Mbps.
The CPU and bandwidth costs of transcoding to 40+ different audio and video formats would be massive though. I could imagine a 5 minute video taking more than 24 hours to transcode on a phone.
Uploading corrupt files could allow the uploader to execute code on future client machines. You must check every frame and the full encoding of the video.
So yes, for the bigger names like Google this is an unacceptable risk. They will generally avoid serving any user-generated complex format like video, images or audio to users directly. Everything is transcoded to reduce the likelihood that an exploit was included.
Given your average DSL uplink of 5 MBit/s, that's 2 hours uploading for the master version... and if I had to upload a dozen smaller versions myself, that could easily add five times the data and upload time.
This is in no way a "simple skill" as maximum video bitrate is only one of a number of factors for encoding video. For streaming to end users there's questions of codecs, codec profiles, entropy coding options, GOP sizes, frame rates, and frame sizes. This also applies for your audio but replacing frame rates and sizes with sample rate and number of channels.
Streaming to ten random devices will require different combinations of any or all of those settings. There's no one single optimum setting. YouTube encodes dozens of combinations of audio and video streams from a single source file.
Video it turns out is pretty complicated.
I’m not an expert in this but I know that “optimally encoding a video” is an actual job. That’s because there’s no global definition of optimal (it varies depending on the source material and target devices, not to mention the costs of your compute, bandwidth, and time); you’re doing it multiple times using different codecs, resolutions, bandwidth targets, etc.; and those change regularly so you need to periodically reprocess without asking people to come back years later to upload the iPhone 13 optimized version.
This brings us to a second important concept: YouTube is a business which pays for bandwidth. Their definition of optimal is not the same as yours (every pixel of my masterpiece must be shown exactly as I see it!) and they have a keen interest in managing that over time even if you don’t care very much because an old video isn’t bringing you much (or any) revenue. They have the resources to heavily optimize that process but very few of their content creators do.
A competitor would need either a different model to keep costs low (limit video length/quality, the vimeo model of forcing creators to pay, or go for the netflix-like model of having a very limited library), or very deep pockets to run at a loss until they reach youtube-scale.
I'm still mystified how tiktok apparently manage to turn a profit. I have a feeling they are using the 'deep pockets' approach, although the short video format might also bring in more ad revenue per hour of video stored/transcoded/served.
It probably means that, unless you have a groundbreaking algorithm on something that is available as hardware, you simply do software on something that is not "perfected".
It trims marginal improvements.
I've been surprised this wasn't already the case, but assumed it was just an encoding overhead issue vs. just serving pre-encoded videos for both the content and ads with necessarily well-defined stream boundaries separating them.
I am not, however, suggesting that encoding ads into the final stream is appropriate or scalable, though!
Ah, smart people and ads ...
Sure, it would break people who want to watch at 2x realtime, but they seem small-fry compared to those with adblockers.
For example each chunk URL could be signed with a "donotdeliverbefore" timestamp.
Now the edge server has zero state.
Similar things are done to prevent signed in URL's being shared with other users.
For streaming you actually want the client to have a buffer past the play head. If the client can buffer the whole stream it makes sense to let them in many cases. The client buffers the whole stream and then leaves your infrastructure alone even if they skip around or pause the content for a long time. The only limits that really make sense are individual connection bandwidth limits and overall connection limits.
The whole point of HTTP-based streaming is to minimize the amount of work required on the server and push more capability to the client. It's meant to allow servers to be dumb and stateless. The more state you add, even if it's negligible per client, ends up being a lot of state in aggregate. If a system meant edge servers could handle 1% less traffic that means server costs increase by 1%. Unless those ones of ad impressions skipped by youtube-dl users come anywhere close to 1% of ad revenue it's pointless for Google to bother.
It's also ublock and adblock plus users. Estimated at about 25% of youtube viewership.
Also, the shared clock only needs to be between edge servers and application servers. And only to an accuracy of a couple of seconds. I bet they have that in place already.
Saying that though the day they finally succeed in making ads unskippable will be the time for a competitor to move in.
At that time everyone starts talking about it and I gotta imagine a bunch of new people become adblocking users.
But programming fixed electronics in parallel is also way harder than flexible CPUs.
"Contemporary core counts coupled with very wide simd makes CPUs functionally similar to ASIC/fpga in many cases."
I don't think so. For things that have a way to be solved in parallel, you can get at least a 100x advantage easily.
There are lots of problems that you could solve in the CPU(serially) that you just can't solve in parallel(because they have inter dependencies).
Today CPUs delegate the video load to video coprocessors of one type or another.
Multiple cores work like multiple machines, but parallel units work choreographically in sync at lower speeds(with quadratic energy consumption). They could share everything and have only the needed electronics that do the job.
> I don't think so. For things that have a way to be solved in parallel, you can get at least a 100x advantage easily.
That’s kind of my point. CPUs are incredibly parallel now in their interface. Let’s say you have 32 cores and use 256 bit simd for 4 64-bit ops. That would give you ~128x improvement compared to doing all those ops serially. It’s just a matter of writing your program to exploit the available parallelism.
There’s also implicit ILP going on as well but I think explicitly using simd usually keeps execution ports filled.
In any case, wouldn't you run out of memory bandwidth long before you can fill all those cores? It doesn't really matter how many cores you have in that case.
I've never heard about how much power YouTube's transcoding is consuming, but transcoding has always been one of those very CPU-intensive tasks (hence it was one of the first tasks to be moved over to the GPU).
You can handle larger volumes of incoming video by spinning up more encoder machines, but the only solution for lowering latency is faster encodes, and with the way the CPU and GPU markets are these days a dedicated encoder chip is probably your best bet.
For pre-x264, they probably could, but between the relatively small sizes required for the low resolution those codecs would be supporting, and the cost difference between compute and storage, I'd bet everything is encoded beforehand.
That is quite the understatement. Google's computing system is dozens of connected "warehouses" around the world.