Hacker News new | past | comments | ask | show | jobs | submit login
YouTube is now building its own video-transcoding chips (arstechnica.com)
259 points by beambot 14 days ago | hide | past | favorite | 192 comments

A couple of interesting bits:

- without dedicated processors, VP9's encoding is roughly 4.5x as slow as H.264, while with the VPUs, the two formats perform the same; this is a big win for the open format(s)

- sad extract: "Google has aggressively fought to keep the site's cost down, often reinventing Internet infrastructure and copyright in order to make it happen" (emphasis mine)

>- without dedicated processors, VP9's encoding is roughly 4.5x as slow as H.264, while with the VPUs, the two formats perform the same; this is a big win for the open format(s)

H.264 is an open format. Just not Royalty Free. The baseline of H264 will soon be "free" once those patents expires in 2023. ( Or basically MPEG-5 EVC )

The hardware encoding for VP9 being the same as H.264 is mostly due to hardware specifically optimise for VP9 and not H.264. The complexity difference is still there.

And VP9 is patent encumbered but they have a license from MPEG-LA.

So it's definitely not any more open than H.264.

Source for this? https://en.wikipedia.org/wiki/VP9 says some companies claimed patents on it, but google basically ignored them.

While Google might ignore them, can small company ignore them? I don't think that Google will fight for some guy using VP9 and getting sued.

I was under the impression AOM members were paying into a fund for exactly that purpose, but I'm not sure if there is precedence for its use.

Why is that sad? No company in Google's position could've done better, probably. Youtube was about to be sued into oblivion till Google purchased it.

>Why is that sad?

because the wholesale destruction and minimization of knowledge, education, and information to appease (often arbitrary) intellectual protectionism laws is sad, regardless of who perpetrates it.

non-Google example : What.cd was a site centered around music piracy, but that potentially illegal market created a huge amount of original labels and music that still exists now in the legal sphere.

No one would defend the legal right for what.cd to continue operating, it was obviously illegal; but the unique, novel, and creative works that came from the existence of this illegal enterprise would be sad to destroy.

Swinging back to the Google example : YouTube systematically destroys creations that they feel (often wrongly) infringe upon IP. This is often not even the case, Google routinely makes wrong decisions erring on the side of the legal team.

This destruction of creative work is sad, in my opinion it's more sad than the un-permitted use of work.

Of course, Google as a corporation should act that way, but it's sad in certain human aspects.

It's not just google as a corporation, it's google as a legal entity.

Have your own site in your own individual name with no corporate entity nor search for profit offering to host people's videos for free, and I guarantee you that within 24h you are dealing with things ranging from pedophilia to copyright violations and the like. And if you don't clear them out, you're the one responsible.

Google is acting the way society has decided they should act through the laws it voted. Could they act another, more expensive, way in order to save a bit more of content that get caught by mistake ? Definitely, but why would they as a company when the laws says any mistake or delay is their fault.

Source: like many people, I once made a free image hosting thingy. It was overrun by pedos within a week to my absolute horror and shock. Copyright infringement is obviously not the same at all, BUT the way the law act toward the host is not that different "ensure there is none and be proactive in cleaning, or else ...".

Your free image hosting thingy is an example of low barrier to entry both in cost and anonymity. If you had made the cost trivial but traceable I wonder what the outcome would have been. I wonder if a site like lobste.rs but for video would work better. A graph of who is posting what and a graph of how they got onto the site in the first place.

If you vouch for someone who is dodgy now you are also seen as a little dodgier than you were before. This doesn't necessarily mean you lose your account because you happened to vouch for someone, but it might mean that your vouching means less in future.

They aren't destroying anything. They are just not allowing the material on their site. Are you saying that anyone who creates a video hosting site must allow ANY content on their site? I don't see any practical basis for that contention.

It's not destroyed, it just isn't published. Or is the idea that they should be the canonical archive of all uploads?

I don't see any justification in the linked article for the claim that YouTube has in any way reinvented copyright. It seems like a throw-away line that is unsupported by any facts.



Even if not supported in the article here are two examples in the last couple of days of how YouTube is de facto defining copyright regulation.

These are examples of YouTube following copyright laws imperfectly, which is basically guaranteed to happen on a regular basis at their scale. Definitely not what I would consider YouTube redefining copyright.

> These are examples of YouTube following copyright laws imperfectly, which is basically guaranteed to happen on a regular basis at their scale

Given their entire copyright takedown system is (in)famously entirely automated, I would have thought it would be trivial for it to always follow copyright laws to the letter.. if they wanted it to.

It would be trivial to follow copyright laws to the letter if authorship and user identity were trivial and fair use exceptions were trivial to determine.

None of those things are trivial, and that's before rights assignment.

YouTube's system is built primarily to placate rightsholders and avoid human labor paid for by Google.

If channel A uploads a video copied from channel B, then makes a copyright claim against channel B, how does an automated system determine which owns the rights? Certainly it would seem in most cases that we should presume channel B has the copyright, since they uploaded first. But there is a very economically important class of videos where infringers will tend to be the first to upload (movies, TV shows, etc.). I don't really see how an automated system solves this problem without making any mistakes. Especially because the law (DMCA) puts the onus on the service provider to take down or face liability.

How would that work? Infringement and even ownership are sometimes subjective or disputed. Automating it doesn't make those issues any easier.

> without dedicated processors, VP9's encoding is roughly 4.5x as slow as H.264

> this is a big win for the open format(s)

How is this a big win if you need dedicated processors for it to be as fast?

It increases adoption of the open standard on the supply side.

What impact does this really have, though? Are they making better VP9 tools available to other people? Browsers already have highly-tuned playback engines and YouTube actively combats efforts to make downloaders or other things which use their videos, is there a path I’m missing where this has much of an impact on the rest of the Internet?


I honestly can't parse this sentence.

"Google creates dedicated custom proprietary processors which can process VP9 at roughly the same speed as a 20-year-old codec". How is this a win for opensource codecs?

because the largest video site in the world will be encoding as VP9.

They have always been encoding in VP9 though. But it doesn't mean they will be serving it.

For example OSX doesn't support it at all and iOS only supports VP9 in 4K/HDR mode and only for certain versions.

I'm watching a YouTube video in Safari right now being served via VP9 (Verified under "Stats for nerds").

Heck, even the M1 supports VP9 in hardware up to 8K60.

I'm not sure where you got that macOS has no VP9 support, it works quite well.

There are a pile of transcodes of every video, served appropriately for the device. Serving VP9 to non-OSX devices is still a big win, at scale.

It’s a relatively modest win versus H.265 unless you’re willing to substantially sacrifice quality and that has to be balanced against the extra CPU time and storage.

YouTube uses so much bandwidth that this is still measured in millions of dollars but it’s really worth remembering that “at scale” to them is beyond what almost anyone else hits.

It's a codec developed by Google, for Google, and Google will happily abandon it. From the article:

> After pushing out and upgrading to VP8 and VP9, Google is moving on to its next codec, called "AV1," which it hopes will someday see a wide rollout.

I still can't see how this is a win.

Aren't Vp8/9/av1 all open codecs tho? I don't really see what the issue is.

Vp8 seems to be the most supported on all platforms without melting your intel CPU. At least from when I was deploying a webRTC solution to a larger customer base last year.

> In May 2010, after the purchase of On2 Technologies, Google provided an irrevocable patent promise on its patents for implementing the VP8 format

This is significantly better than even h264 in terms of patents/royalties.

Would you mind elaborating on your hate? There's nothing for google to abandon here? It's already out in the wild.

> Would you mind elaborating on your hate?

How did you even come up with this question?

Please go an re-read my original question: https://news.ycombinator.com/item?id=27035059 and the follow-up from me: https://news.ycombinator.com/item?id=27035112 and from another person: https://news.ycombinator.com/item?id=27036150

But yeah, sure, I hate hate hate VP9 smh

VP9 is meant to be a parallel to h264, and AV1 to h265?

VP9 running on custom circuits being equivalent speed to h264 running on custom circuits seems like a win for VP9? Since VP9 isn't royalty encumbered the way h264 is, that could well be a win for the rest of us too.

> Since VP9 isn't royalty encumbered the way h264 is, that could well be a win for the rest of us too.

I can only repeat myself: "Google creates dedicated custom proprietary processors which can process VP9 at roughly the same speed as a 20-year-old codec". How is this a win for anyone but Google (who is already eyeing to replace VP9 with AV1)?

"The rest of us" are very unlikely to run Google's custom chips. "The rest of us" are much more likely to run this in software, for which, to quote the comment I was originally replying to, "without dedicated processors, VP9's encoding is roughly 4.5x as slow as H.264".

Note: I'm not questioning the codec itself. I'm questioning the reasoning declaring this "a big win for the open format(s)".

You won't get adoption until the word gets around that Big Company X is using Format Y, and they supply content prominently in Format Y. That's when Chinese SoC manufacturers start taking things seriously, add hardware decode blocks to their designs, and adoption just spirals out from there.

Because VP9 achieves better compression than H.264.

> Google probably only provides stats about growth (like "500 hours of video are uploaded to YouTube every minute") because the total number of videos is so large, it's an unknowable amount.

I suppose you could sample random YouTube urls to find out how many of them link to public videos. Given the total number of possible URLs, it would give you an idea of what percentage of them have been used and therefore how many videos YouTube has in total. It would not tell you how many private videos or Google Drive / Photos videos exist of course.

Googler, opinions are my own.

Youtube isn't the only platform where Google does video transcoding. I don't know them all, but here are a few other places where video plays a part:

Meet - I'm guessing for participates that on are different devices (desktop, android, ios) and depending on their bandwidth will get different video feed quality. Though, the real-time nature of this may not work as well? Though, Meet has a live stream[0] feature for when your meeting is over 250 people, which gives you a youtube-like player, so this likely is transcoded.

Duo - more video chat.

Photos - when you watch a photos video stored at google (or share it with someone), it will likely be transcoded.

Video Ads. I'd guess these are all pre-processed for every platform type for effecient delivery. While these are mainly on youtube, they show up on other platforms as well.

Play Movies.

Nest cameras. This is a 24-stream of data to the cloud that some people pay to have X days of video saved.

[0] https://support.google.com/meet/answer/9308630?co=GENIE.Plat...

Also Googler, you can add Stadia here as well. Needs fast and low-latency transcoding.

You're right. I thought stadia did something special, but I guess not. So yes, Stadia.

Another: Youtube TV

EDIT: Stadia: I did some searching around the interwebs and found a year-old interview[0] that hints at something different.

> It's basically you have the GPU. We've worked with AMD to build custom GPUs for Stadia. Our hardware--our specialized hardware--goes after the GPU. Think of it as it's a specialized ASIC.

[0] https://www.techrepublic.com/article/how-google-stadia-encod...

> Google probably only provides stats about growth (like "500 hours of video are uploaded to YouTube every minute") because the total number of videos is so large, it's an unknowable amount.

I can totally see it being non-trivial to count your videos, which is a funny problem to have. But I doubt it's unknowable. More like they don't care/want us to know that.

Quite likely they have a good approximate number.

But knowing the exact number can indeed be hard. It would take stopping the entire uploading and deletion activity. Of course they may have counters of uploads and deletions on every node which handles them, but the notion of 'the same instant' is tricky in distributed systems, so the exact number still remains elusive.

It’s not just tricky and elusive, I think it’s literally unknowable -- not a well-defined question. Like asking about the simultaneity of disconnected events in special relativity.

You can modify the question to be well-defined and not suffer those measurement problems, eg "the total number of videos uploaded before midnight UTC on 2021-04-29"

Interesting, I wonder if a distributed database could be developed to consistently answer queries phrased in this way.

Google Developed Spanner which is a globally distributed database that uses atomic clocks to keep things synchronized: https://static.googleusercontent.com/media/research.google.c...

I think the Chandy-Lamport snapshot algorithm tries to do something like this for all distributed systems (in their model, and it tries to get any consistent snapshot, not allowing you to specify the "time"); not sure if it's actually useful IRL though

Seems like it would either need to be append only or have some kind of snapshot isolation

All these nodes are in the same light cone, so we theoretically can stop all mutation and wait for the exact final state to converge to a precise number of uploaded videos.

But the question of the precise number of videos before that moment is indeed ill-defined.

You can theoretically stop all mutation, but the users might start to complain!

I read "an unknowable amount" as "a meaninglessly large number to our monkey brains".

It's like knowing the distance to the Sun is 93 million miles. The difficulty there isn't that measuring the distance from the Sun to the Earth exactly is hard, although it is, or that the distance is constantly changing, although it is, or that the question is ill-defined, because the Earth is an object 8000 miles across and the Sun is 100 times bigger, and what points are you measuring between?

The distance is "unknowable" because while we know what "93 million miles" means, it's much harder to say we know what it "means". Even when we try to rephrase it to smaller numbers like "it's the distance you could walk in 90 human lifetimes" is still hard to really feel beyond "it's really really far."

Likewise, does it matter if YouTube has 100, 1000, or 10,000 millennia of video content? Does that number have any real meaning beyond back-of-the-envelope calculations of how much storage they're using? Or is "500 years per minute" the most comprehensible number they can give?

It doesn't seem like this would work. I think you could sample trillions of YouTube IDs with a high likelihood of all of them being unused. They're supposed to be unique after all

Clicked into a couple random videos, looks like all of their video IDs are 11 characters, alphanumeric with cases. So 26+26+10 = 62 choices for each char, 62^11 = 5.2e+19 = 52 quintillion unique IDs (52 million trillions).

So, yeah, sampling would be a mostly futile effort since you're looking to estimate about 8 to 10 decimal digits of precision. Though it's technically still possible since you'd expect about 1 in every 50 million - 5 billion IDs to work (assuming somewhere between a trillion and 10 billion videos).

My statistics knowledge is rusty, but I guess if you could sample, say, 50 billion urls you could actually make a very coarse estimate with a reasonable confidence level. That's a lot but, ignoring rate limits, well within the range of usual web-scale stuff.

I tried this for some time, I was looking for unlisted videos.

Just generate a random valid link and then check if it gives a video or not.

I found exactly 0 videos.

They also use "_" and "-" according to Tom Scott.


Which would bring it up to a nice 64 choices, making it exactly 6 bits per character.

It's a URL-friendly form of base64.

11 chars encode 66 bits, but actually 2 bits are likely not used and it's simply an int64 encoded to base64.

Given everyone and their grandma is pushing 128-bit UUID for distributed entity PK, it's interesting to see YouTube keep it short and sweet.

Int64 is my go to PK as well, if I have to, I make it hierarchical to distribute it, but I don't do UUID.

> Given everyone and their grandma is pushing 128-bit UUID for distributed entity PK, it's interesting to see YouTube keep it short and sweet.

The trade-off you make when using short IDs is that you can't generate them at random. With 128-bit Id, you can't realistically have collisions, but with 64-bit ones, because of the birthday paradox, as soon as you have more than 2^32 elements, you're really likely to have collisions.

Youtube video ids used to be just base64 of a 3DES-encrypted mysql's primary key, a sequential 64-bit int - collisions are of zero concern there. By birthday paradox it's about as good as 128-bit UUID generated without using a centralized component like database's row counter, when you have to care about collisions.

However theft of the encryption key is a concern, since you can't rotate it and it just sat there in the code. Nowadays they do something a bit smarter to ensure ex- employees can't enumerate all unlisted videos.

> You seem to know about their architecture. What do they do now?

Random 64-bit primary keys in mysql for newer videos. These may sometimes collide but then I suppose you could have the database reject insert and retry with a different id.

So a single cluster produces those keys? I thought it’s more decentralized.

With random database keys I would think they can just be generated at random by any frontend server running anywhere. Ultimately, a request to insert that key would come to the database - which is the centralized gatekeeper in this design and can accept or reject it. But with replication, sharding, caching even SQL databases scale extremely well. Just avoid expensive operations like joins.

You seem to know about their architecture. What do they do now?

The reason why we want ids to be purely random is so we don't have to do the work of coordinating distributed id generation. But if you don't mind coordinating, then none of this matters.

Surely if it was a great chore for YouTube to have random-looking int64 ids, they would switch to int128. But they haven't.

I'm a big fan of the "works 99.99999999% of the time" mentality, but if anything happens to your PRNGs, you risk countless collisions to slip up by you in production before you realize what happened. It's good to design your identity system in a way that'd catch that, regardless of how unlikely it seems in the abstract.

The concept of hierarchical ids is undervalued. You can have a machine give "namespaces" to others, and they can generate locally and check for collisions locally in a very basic way.

> but if anything happens to your PRNGs, you risk countless collisions to slip up by you in production before you realize what happened.

UUID generation basically has to use a CSPRNG to avoid collisions (or at least a very large-state insecure PRNG).

Because of the low volume simply using /dev/urandom on each node makes the most sense. If /dev/urandom is broken so is your TLS stack and a host of other security-critical things; at that point worrying about video ID collisions seems silly.

I worry about state corrupting problems, because they tend to linger long after you have a fix.

Is the extra 64 bits simply used to lower the risk of collision?

> all of their video IDs are 11 characters, alphanumeric with cases

It's an int64, encoded as URL-friendly base64 (i.e. alphanumeric with _ and -).

Thanks for doing the maths - it does seem the sampling method would not be feasible. Taking the statistic of "500 hours uploaded per minute" and assuming the average video length is 10 minutes, we can say about 1.5bn videos are uploaded to YouTube every year or 15bn every 10 years. So it seems likely that YouTube has less than 1tn videos in total.

If there are N IDs to draw from and M videos on YouTube, then P(ID used) = M/N if the ID is drawn from a uniform distribution, and P(At least one of K IDs used) = 1 - (1 - M/N)^K (not accounting for replacement).

If M ≈ 1e9 and N ≈ 1e18, and you sample K = 1000 URLs, then it's about one in 1e-09 that you hit a used ID.

Let’s do the math.

IDs are 64-bit integers. The number of tries before an event with probability P occurs is a geometric distribution. If V is the number of valid IDs (that have a video), the number of tries is 2^64÷V. Assuming 1 megatries per second, since we can parallelize it, we would find the first video in 20 seconds on average, with a conservative estimate of V = 10^12 (a hundred billion videos).

To have a sample of ~100 videos, it’d take about half an hour.

Formal math & probability is something I'm trained in. that being said, intuitively this sounds off to me...

The youtube ID pool is closer to 7.4e19, not 1.8e19. I'm not a math expert and my probability is quite weak. If you assume generously that 1 trillion IDs have been taken, the percent of IDs that are in use is 1e12/7.4e19, then 0.0000000135% of available IDs have been taken.

Getting a single video at 1 megatry per second would take something more like 2 hours, not 20 seconds.

typo: I meant to say it's not something I'm trained in

What do you mean 'supposed to be unique'? How can an ID not be unique?

Maybe when they reach 62^11 + 1?

Yeah that should work. If I do 1 request per second with 64 IP addresses, I'd expect to find ~110 videos after 1 year of random sampling if there are 1T videos on YouTube.


(The other thread that assumes a 62-character set is wrong because they forgot about '-' and '_'. I'm fairly certain a video ID is a urlsafe base64 encoded 64-bit int. 64^11==2^66)

Surprised they don't also mention Nest, which I assume also has an interesting & significant video encoding operation.

If anything it's just strange that such a part didn't exist before (if it really didn't). Accelerated encoding is hugely more power efficient than software encoding.

I've worked on a IPTV broadcasting system and this isn't as obvious as you'd think.

The big issue here is quality - most hardware encoders hugely lag behind software encoders. By quality I mean visual quality per bytes/s. Which means that using a quality software encoder like x264 will save you massive amount of money in bandwidth costs because you can simply go significantly lower in bitrate than you can with a hardware encoding block.

At the time, our comparisons showed that you could get away with as low as 1.2MBps for 720p stream where with an enterprise HW encoder you'd have to do about 2-3MBps to have the same picture quality.

That's one consideration. The other consideration is density - at the time most hardware encoders could do up to about 4 streams per 1U rack unit. Those rack units cost about half the price of a fully loaded 24+ core server. Even GPUs like nVidia at the time could do at most 2 encoding sessions with any kind of performance. On CPU, we could encode 720p on about 2 Xeon cores which means that a fully loaded server box with 36+ cores could easily do 15-20 sessions of SD and HD and we could scale the load as necessary.

And the last consideration was price - all HW encoders were significantly more expensive than buying large core count rack-mount servers. Funny enough, many of those "HW specialised encoding" boxes were running x86 cores internally too so they weren't even more power efficient.

So in the end the calculation was simple - software encodes saved a ton of money on bandwidth, it allowed better quality product because we could deliver high quality video to people with poor internet connectivity, it made procuring hardware simple, it made the solution more scalable and all that at the cost of some power consumption. Easy trade. Of course the computation is a bit different with modern formats like VP9 and H.265/HEVC - the software encoders are still very CPU intensive so it might make sense to buy cards these days.

Of course, we weren't Google and couldn't design and manufacture our own hardware. But seeing the list of codecs YouTube uses, there's also one more consideration: flexibility. HW encoding blocks are usually very limited at what they can do - most of them will do H.264, some of them will stretch to H.265 and maaaaaybe VP9. CPUs will encode into everything. Even when a new format is needed, you just deploy new software, not a whole chip.

My work is in IP cameras so I'm aware of these tradeoffs.

I guess what I didn't expect is that Google could design their own encoder IP to beat the current offerings with a big factor at the task of general video coding. I guessed that Google actually built an ASIC with customised IPs from some other vendor.

But maybe Google did do just that?

Very interesting description. Are you familiar at all with the details of FPGAs for these very same tasks, especially the EV family of Xilinx Zynq Ultrascale+ MPSoC? They include hardened video codec units, but I don't know how they compare quality/performance-wise. Thanks!

I'm afraid I don't have any experience with those devices. Most HW encoders however struggle with one thing - the fact that encoding is very costly when it comes to memory bandwidth.

The most important performance/quality related process in encoding is having the encoder take each block (piece) of previous frame and scan the current frame to see whether it still exists and where it moved. The larger area the codec scans, the more likely it'll find the area where the piece of image moved to. This allows it to write just a motion vector instead of actually encoding image data.

This process is hugely memory bandwidth intensive and most HW encoders severely limit the area each thread can access to keep memory bandwidth costs down and performance up. This is also a fundamental limitation for CUDA/gpGPU encoders, where you're also facing a huge performance loss if there's too much memory accessed by each thread.

Most "realtime" encoders severely limit the macroblock scan area because of how expensive it is - which also makes them significantly less efficient. I don't see FPGAs really solving this issue - I'd bet more on Intel/nVidia encoding blocks paired with copious amount of onboard memory. I heard Ampere nVidia encoding blocks are good (although they can only handle a few streams).

Only relatively recently has NVIDIA implemented B-Frames into NVENC. If I am not mistaken, AMD still does not have this capability. I am not deeply well versed in this space though, but if memory bandwidth is such a huge bottleneck, how does the CPU do it so efficiently comparatively? GPU surely wins in this area? Is it just designed that way so that consumer cards can offer realtime speeds? I'm not sure why this couldn't be configurable in some way.

That is interesting context for this quote from the article:

> "each encoder core can encode 2160p in realtime, up to 60 FPS (frames per second) using three reference frames."

Apparently reference frames are the frames that a codec scans for similarity in the next frame to be encoded. If it really is that expensive to reference a single frame then it puts into perspective how effective this VPU hardware must be to be able to do 3 reference frames of 4K at 60 fps.

I always thought of reference frames as like the sampling rate, so in that sense, is it how few reference frames can it get away with, without being noticeable?

Would that also depend on the content?

Aren’t panning shots more difficult to encode?

> I always thought of reference frames as like the sampling rate, so in that sense, is it how few reference frames can it get away with, without being noticeable?

Actually not quite - "reference frames" means how far back (or forward!) the encoded frame can reference other frames. In plain words, "max reference frames 3" means that frame 5 in a stream can say "here goes block 3 of frame 2" but isn't allowed to say "here goes block 3 of frame 1" because that's out of range.

This has obvious consequences for decoders: they need to have enough memory to keep "reference frames" decoded uncompressed frames around in a chance that a future frame will reference them. It also has consequences for encoders: while they don't have to reference frames far back, it'll increase efficienty if they can reuse the same stored block of image data across as much frames as possible. This of course means that they need to scan more frames for each processed input frames to try to find as much reusable data as possible.

You can easily get away with "1" reference frame (MPEG-2 has this limit for example), but it'll encode same data multiple times, lowering overall efficiency and leaving less space to store detail.

> Would that also depend on the content?

It does depend on the content - in my testing it works best for animated content because the visuals are static for a long time so referencing data from half a second ago makes a lot of sense. It doesn't add a lot for content where there's a lot of scenecuts and actions like a Michael Bay movie combat scene.

I am by no means an expert, and this by no means is indicative of a video compression FPGA, but I've been looking at .GZIP and .PNG accelerators and it seems that while they deliver incredible speed, it is done so at the worst compression ratio you can fit in the compression spec. Equivalent to .GZIP setting 2, or maybe equivalent to a "super fast" or "ultra fast" video preset. It is important to note that these are lossless algorithms though. Still, it may not make sense to utilize if your application is bandwidth sensitive. If 4k Netflix doubled it's bitrate by switching to FPGA solutions, that would probably be too high of a cost, even for a 20x speedup. At least until high quality internet speeds become more universal of course.

At least for Google's case YouTube videos are usually transcoded in idle datacenters (for example locations where the locals are sleeping). This means that the cost of CPU is much lower than a naive estimate. These new accelerators can only be used for transcoding video, the rest of the time they will sit idle (or you will keep them loaded but the regular servers will be idle). This means that the economics are necessarily an obvious win.

Of course if you do enough transcoding that you are buying servers for the job then these start to save money. So I guess someone finally decided that the R&D would likely pay off due to the current combination of cyclical traffic, adjustable load and the cost savings of the accelerator.

The complement of "building its own video-transcoding chips " isn't just software encoding though. Google/Youtube could have already been using hardware encodings, just with generic GPUs or whatever existing hardware.

Intel has had PCIe cards targeted at this market, reusing their own HW encoder, e.g. the VCA2 could do up to 14 real-time 4K transcodes at under 240W, and the upcoming Xe cards would support VP9 encode. (XG310 is similar albeit more targeted at cloud gaming servers)

These PCIe cards just run a low power Xeon CPU with the iGPU doing the majority of the heavy lifting.

It was always an interesting and weird product it even runs it’s own OS.

There is one such chip in your phone and in your GPU.

well I do believe video transcoding chips have been there forever. But I think those chips should be tailored to their exact application making them more efficient

The real news here is that They still use GPU to transcode their videos whilst other service such as search engine already use TPU for almost a decade now.

I thought they've already use custom chip for transcoding for decades.

GPUs have specialized hardware for video transcoding, no? So this actually makes sense. The product was already made (although, perhaps not up to Youtube's standard) by GPU manufacturers.

The specialized hardware in GPUs is targeted at encoding content on the fly. While you could use this to encode a video for later playback it has a couple of drawbacks when it comes to size and quality, namely h264, keyframes, static frame allocations, no multipass encoding, etc. ... This is why video production software that supports GPU encoding usually marks this option as "create a preview, fast!". It's fast but that's it. If you want a good quality/size ratio you would use something like VP9 for example. Because of missing specialized hardware and internals of the codec itself currently this is very slow. Add multipass encoding, something like 4k at 60 frames, adaptive codec bitrates and suddenly encoding a second takes a over two minutes ... the result is the need for specialized hardware.

That is interesting, I didn't know that. I thought that doing computation on the GPU sped up encoding for even the high quality presets.

Is there any solid information about Google using TPU for the search engine, or is this an assumption you're making?

This[0] Google blog from 2017 states they were using TPU for RankBrain which is what powers the search engine

[0] - https://cloud.google.com/blog/products/ai-machine-learning/g...

They had to get special permission from the US government to export TPU's abroad to use in their datacenters. The TPU's fell under ITAR regulations (like many machine learning chips). The US government granted permission, but put some restriction like 'they must always be supervised by an american citizen', which I imagine leads to some very well paid foreign security guard positions for someone with the correct passport...

Read all that on some random government document portal, but can't seem to find it now...

All that power to show me recipes that hemhaw around and spend ten paragraphs to use all the right SEO words.

I feel like the Google results were better 20 years ago, what did they use back then before TPUs?

The web just got worse in a lot of ways, because everything needs to generate money.

I think search results 20 years ago were laughably worse than today.

They were actually transcoding on CPUs before, not GPUs

Are TPUs particularly useful for this kind of workload, compared to specialized encoders/decoders available on GPUs?

Yeah I was surprised that it's taken them this long to build custom hardware for encoding videos.

Right, at a competing video site we had vendors trying to sell us specialized encoding hardware most of a decade ago.

i think they do more general purpose things like downsampling, copyright detection etc which doesn't have globally available custom asics. i think gpus don't do encoding/decoding themselves, they have separate asics built in which do the standardised encodings

Impressive. I wonder if Google will sell servers with these cards via Google Cloud. Seems like it could be pretty competitive in the transcoding space and also help them push AV1 adoption.

You can transcode as a service on Google Cloud: https://cloud.google.com/transcoder/docs

The paper linked in the ARS article (https://dl.acm.org/doi/abs/10.1145/3445814.3446723) seems to be how they developed it. I find it interesting that they went from C++ to hardware in order to optimize the development and verification time.

In my past experience working with FPGA designers, I was always told that any C-to-H(ardware) tooling was always quicker to develop but often had significant performance implications for the resulting design in that it would consume many more gates and run significantly slower. But, if you have a huge project to undertake and your video codec is only likely to be useful for a few years, you need to get an improvement (any improvement!) as quick as possible and so the tradeoff was likely worth it for Google.

Or possibly the C-to-H tooling has gotten significantly better recently? Anyone aware of what the state of the art is now with this to shed some light on it?

It has not, and the type of design they show in the paper has a lot of room to improve (FIFOs everywhere, inefficient blocks, etc). However, video transcoding is suited to that approach since the operations you do are so wide that you can't avoid a speedup compared to software.

Why don’t they encode on the uploader machine?

> Why don’t they encode on the uploader machine?

Are you basically asking why they don't take a performance-sensitive, specialised, and parallel task and run it on a low-performance, unspecialised, and sequential system?

Would take hours and be super inefficient.

Imagine someone using a 10 year old computer to upload a 1 hour video. not only do they need to transcode to multiple different resolutions, but also codecs. This would not practical from a business / client relationship. They want their client (the uploader) to spend as little time as possible and get their videos as quickly as possible.

I wouldn't even say 10 year old computer. Think phones or tablets. As well as the battery drain. Or imagine trying to upload something over 4G.

> Or imagine trying to upload something over 4G.

4G is perfectly fine for uploading videos. It can hit up to 50 Mbps. LTE-Advanced can do 150 Mbps.

Though that being said, it would be great to be like hey google, ill do the conversions for you! but then they would have to trust that the bitrate isnt too high / not going to crash their servers etc.etc.etc.

Can't trust user input, you'd have to spend quite a bit of energy just checking to see if it's good. You also want to transcode multiple resolutions, it'd end up being quite slow if it's done using JS.

Checking the result is good shouldn't be too hard - a simple spot check of a few frames should be sufficient, and it isn't like the uploader gets a massive advantage for uploading corrupt files.

The CPU and bandwidth costs of transcoding to 40+ different audio and video formats would be massive though. I could imagine a 5 minute video taking more than 24 hours to transcode on a phone.

> Checking the result is good shouldn't be too hard - a simple spot check of a few frames should be sufficient, and it isn't like the uploader gets a massive advantage for uploading corrupt files.

Uploading corrupt files could allow the uploader to execute code on future client machines. You must check every frame and the full encoding of the video.

Must is a strong word. In theory browsers and other clients treat all video stream as untrusted and it is safe to watch an arbitrary video. However complex formats like videos are a huge attack surface.

So yes, for the bigger names like Google this is an unacceptable risk. They will generally avoid serving any user-generated complex format like video, images or audio to users directly. Everything is transcoded to reduce the likelihood that an exploit was included.

Verification is simpler than encoding, I suppose.

YouTube needs to re-encode occasionally (new codecs/settings/platforms), it would be easy to abuse and send too high bitrate or otherwise wrong content, and a lot of end-user devices simply isn't powerful enough to complete the task in a reasonable amount of time.

Because of the massive bandwidth and data requirements. Assuming I as the source have a 20 MBit/s content that is 30 min long - that's about 3.6 GB of data.

Given your average DSL uplink of 5 MBit/s, that's 2 hours uploading for the master version... and if I had to upload a dozen smaller versions myself, that could easily add five times the data and upload time.

Because I make one output file and they optimize for like 7 different resolutions. If they make it longer for me to upload I'd wager that would lower the video upload rate.

Sounds like something for the next version of recaptcha

Because society results in companies being incentivized to babysit users rather than cutting off those who are unable to learn simple technical skills like optimally encoding a video respecting a maximum bitrate requirement.

> simple technical skills like optimally encoding a video respecting a maximum bitrate requirement.

This is in no way a "simple skill" as maximum video bitrate is only one of a number of factors for encoding video. For streaming to end users there's questions of codecs, codec profiles, entropy coding options, GOP sizes, frame rates, and frame sizes. This also applies for your audio but replacing frame rates and sizes with sample rate and number of channels.

Streaming to ten random devices will require different combinations of any or all of those settings. There's no one single optimum setting. YouTube encodes dozens of combinations of audio and video streams from a single source file.

Video it turns out is pretty complicated.

What about cutting off those who condescend others without recognizing the limits of their own understanding?

I’m not an expert in this but I know that “optimally encoding a video” is an actual job. That’s because there’s no global definition of optimal (it varies depending on the source material and target devices, not to mention the costs of your compute, bandwidth, and time); you’re doing it multiple times using different codecs, resolutions, bandwidth targets, etc.; and those change regularly so you need to periodically reprocess without asking people to come back years later to upload the iPhone 13 optimized version.

This brings us to a second important concept: YouTube is a business which pays for bandwidth. Their definition of optimal is not the same as yours (every pixel of my masterpiece must be shown exactly as I see it!) and they have a keen interest in managing that over time even if you don’t care very much because an old video isn’t bringing you much (or any) revenue. They have the resources to heavily optimize that process but very few of their content creators do.

The dangerous case of custom hardware making a software business significantly more efficient: This makes disruption and competition even harder.

This will keep costs down but I am not sure cost of transcoding is the major barrier to entry? I think the network effect (everyone is on YouTube) had already made disruption pretty difficult!

Things like youtube run on super-thin margins. Bandwidth and storage costs are massive, compute costs quite big, and ad revenue really quite low.

A competitor would need either a different model to keep costs low (limit video length/quality, the vimeo model of forcing creators to pay, or go for the netflix-like model of having a very limited library), or very deep pockets to run at a loss until they reach youtube-scale.

I'm still mystified how tiktok apparently manage to turn a profit. I have a feeling they are using the 'deep pockets' approach, although the short video format might also bring in more ad revenue per hour of video stored/transcoded/served.

Depends how you look at it. There could be someone making these chips and then a competitor with lower startup costs than before.

To be honest I suspect it isn't actually a differentiator. It's good for Google that they can produce this chip and trim their hardware costs by some percentage, but it's not going to give them a competitive advantage in the market of video sharing. Especially in a business like youtube with network effects, getting the audience is the difficult bit, the technical solutions are interesting but you're not going to beat google by having 5% cheaper encoding costs.

It's the seemingly infinite bandwidth that Google throws at YouTube that make competition hard. Then there's the inability to monetize. Transcoding is probably about 20th on the list of issues.

What is there to compete for? Video hosting is a money-losing business unless you have exclusives, like Floatplane.

What is floatplane, never heard of it? Seemingly an yt competitor by a somewhat popular youtuber. App on Android has "10k+" installs. Isn't it _way_ too early to say it wouldn't be a money losing business?

My guess is that the commenter is either a Floatplane insider or possibly just optimistic :-)

Think of Floatplane as more of a Patreon competitor with a video component, than a YouTube competitor.

Linus' goal for Floatplane is "If it doesn't fly, it'll at least float." There's only 20 creators on it and it's intended to compliment YouTube, not replace it.

What's floatplane? Hadn't heard of it. The website doesn't say much.

Floatplane is a video service built by the people behind the popular Youtube channel LinusTechTips. It is not a direct competitor to Youtube though. The platform makes it easier to let paying fans get videos earlier but it is not meant to build an audience.

It's like crystallisation of the software. When you decide that this is the best version of an algorithm, you make a hardware that is extremely efficient in running that algorithm.

It probably means that, unless you have a groundbreaking algorithm on something that is available as hardware, you simply do software on something that is not "perfected".

It trims marginal improvements.

Perhaps. But the big issues for YouTube right now isnt efficiency per se, but copyright, monetization, ai-tagging, social clout. If a YouTube competitor can get the content creators and offer them viewers, competition could perhaps work. This fight is probably not fought at the margins of hardware optimization.

Usually competition for a general platform like YouTube comes in the form of unbundling and in that case these last mile optimizations will matter little.

The main competitors to YouTube are the sites that have non-illegal content that YouTube won't host. e.g: Porn and controversial political stuff.

That might be true but I think sites like odyssey are more popular than controversial political video sites.

It's inevitable, and this applies to other kinds of optimizations as well. This place is too mature, disruption might be easier elsewhere.

as long as you can buy the parts or have the HDL to deploy it on an FPGA you should be fine

How long before the ads are realtime encoded into the video streams such that even youtube-dl can't bypass them without a premium login?

I've been surprised this wasn't already the case, but assumed it was just an encoding overhead issue vs. just serving pre-encoded videos for both the content and ads with necessarily well-defined stream boundaries separating them.

You wouldn't even need to do real-time encoding for that, you can simply mux them in at any GOP boundary (other services already do real-time ad insertion in MPEG-DASH manifests)

Example: https://www.youtube.com/watch?v=LFHEko3vC98

Right, using DAI means you don’t have to actually touch the original video (good!) but doesn’t stop a smart enough client (youtube-dl) from pattern matching and ignoring those segments when stitching the final video together.

I am not, however, suggesting that encoding ads into the final stream is appropriate or scalable, though!

The client doesn't even have to know that there is an ad playing if they really want to thwart ad blockers. If you are talking about pattern-matching the actual video stream ad-blockers could do that today and just seek forwards but none do yet.

I suspect doing personalised ads obliterates any caching method on cheaper hardware than transcoding servers. Interesting problem to solve though.

> Interesting problem to solve though.

Ah, smart people and ads ...

The ads are not part of the encoded video AFAICT, they are probably served as a separate stream which the client requests alongside the regular video stream, this means that videos and ads can be cached using traditional techniques.

Then you could just skip the ad in the video, unless the player has some meta-data around when the ad is; in which case youtube-dl can chop it out.

Not if you tightly control the streaming rate to not get far ahead of a realtime playback, just mete out the video stream at a rate appropriate for watching, not as fast as the pipe can suck it down.

I'm kinda surprised Google doesn't do this... They would need to keep track of user seeks and stuff, but it still seems do-able. One simple model is for the server to know when ad-breaks should happen, and prevent any more downloading for the duration of the ad.

Sure, it would break people who want to watch at 2x realtime, but they seem small-fry compared to those with adblockers.

The issue there is scale, MPEG-DASH/HLS let the edge servers for video to be simple. The servers don't need to do much more than serve up bytes via HTTP. This ends up being better for clients, especially mobile clients, since they can choose streams based on their local conditions the server couldn't know about like downgrading from LTE to UMTS.

Google would end up having to maintain a lot of extra client state on their edge servers if they wanted to do that all in-band. Right now it's done out of band with their JavaScript player. Chasing down youtube-dl users isn't likely worth that extra cost.

The edge server could implement this without much extra complexity.

For example each chunk URL could be signed with a "donotdeliverbefore" timestamp.

Now the edge server has zero state.

Similar things are done to prevent signed in URL's being shared with other users.

There's no shared wall clock between the server and client with HTTP-based streaming. There's also no guarantee the client's stream will play continuously or even hit the same edge server for two individual segments. That's state an edge server needs to maintain and even share between nodes. It would be different for every client and every stream served from that node.

For streaming you actually want the client to have a buffer past the play head. If the client can buffer the whole stream it makes sense to let them in many cases. The client buffers the whole stream and then leaves your infrastructure alone even if they skip around or pause the content for a long time. The only limits that really make sense are individual connection bandwidth limits and overall connection limits.

The whole point of HTTP-based streaming is to minimize the amount of work required on the server and push more capability to the client. It's meant to allow servers to be dumb and stateless. The more state you add, even if it's negligible per client, ends up being a lot of state in aggregate. If a system meant edge servers could handle 1% less traffic that means server costs increase by 1%. Unless those ones of ad impressions skipped by youtube-dl users come anywhere close to 1% of ad revenue it's pointless for Google to bother.

> skipped by youtube-dl users

It's also ublock and adblock plus users. Estimated at about 25% of youtube viewership.

Also, the shared clock only needs to be between edge servers and application servers. And only to an accuracy of a couple of seconds. I bet they have that in place already.

That sounds like an enormous pain in the arse just to piss off a vocal minority of users.

A vocal minority who are not bringing in any revenue for the site.

Saying that though the day they finally succeed in making ads unskippable will be the time for a competitor to move in.

When it's possible to skip baked in ads (SponsorBlock[1]) -- the whack-a-mole will continue no matter what. Even if it means you can't watch videos in realtime but have to wait for them to fully download to rip the ad out, someone will figure it out.

At that time everyone starts talking about it and I gotta imagine a bunch of new people become adblocking users.

[1] https://news.ycombinator.com/item?id=26886275

SponsorBlock only works because the sponsored segments are at the same location for every viewer. If Youtube spliced in their own ads they could easily do it at variable intervals preventing any crowd sourced database of ad segment timestamps. To be honest, nothing really stops Youtube from just turning on Widevine encryption for all videos (not just purchased/rented TV & movies) besides breaking compatibility with old devices. Sure widevine can be circumvented but most of the best/working cracks are not public.

Which only works long enough for someone to make some sort of video diffing system and removes anything others didn't get at that timestamp or something

Yeah, if YT wanted to insert unskippable ads on the backend, they would have years ago. The tech is not the hard part. They know it'd be a PR disaster for them.

"now" being 2015. They are talking about the new 2nd generation chip here, which is a bit faster.

Wow 33x throughput improvement for vp9 for the same hardware cost. That seems excessive but their benchmark is using ffmpeg. Is ffmpeg known to have the theoretically highest throughput possible state of the art vp9 encoder algorithms? Or is there any way of knowing if their hardware IP block is structured equivalently to the ffmpeg software algorithm? I know that custom hardware will always beat general hardware but 33x is a very large improvement. Contemporary core counts coupled with very wide simd makes CPUs functionally similar to ASIC/fpga in many cases.

The only OSS VP9 encoder is Google’s own libvpx, which is what ffmpeg uses.

By now Intel has released an open source encoder (BSD-2 + patent grants), tuned for their Xeons:


That doesn't look so excessive to me. We get hundred or thousand times more efficiency and performance regularly using custom electronics for things like 3d or audio recognition.

But programming fixed electronics in parallel is also way harder than flexible CPUs.

"Contemporary core counts coupled with very wide simd makes CPUs functionally similar to ASIC/fpga in many cases."

I don't think so. For things that have a way to be solved in parallel, you can get at least a 100x advantage easily.

There are lots of problems that you could solve in the CPU(serially) that you just can't solve in parallel(because they have inter dependencies).

Today CPUs delegate the video load to video coprocessors of one type or another.

BTW: Multiple CPUs cores are not parallel programming in the sense fpgas or ASICS (or even GPUs) are.

Multiple cores work like multiple machines, but parallel units work choreographically in sync at lower speeds(with quadratic energy consumption). They could share everything and have only the needed electronics that do the job.

Well transistors are cheap and synchronization is not a bottleneck for embarrassingly parallel video encoding jobs like these. Contemporary CPUs already downclock when they can to save power and conserve heat.

>> Contemporary core counts coupled with very wide simd makes CPUs functionally similar to ASIC/fpga in many cases.

> I don't think so. For things that have a way to be solved in parallel, you can get at least a 100x advantage easily.

That’s kind of my point. CPUs are incredibly parallel now in their interface. Let’s say you have 32 cores and use 256 bit simd for 4 64-bit ops. That would give you ~128x improvement compared to doing all those ops serially. It’s just a matter of writing your program to exploit the available parallelism.

There’s also implicit ILP going on as well but I think explicitly using simd usually keeps execution ports filled.

TBH 32 or even 64 cores does not sound all that impressive compared to the thousands of cores available on modern GPUs and presumably even more that could be squeezed into a dedicated ASIC.

In any case, wouldn't you run out of memory bandwidth long before you can fill all those cores? It doesn't really matter how many cores you have in that case.

Those thousands of cores are all much more simple and do not have simd and have a huge penalty for branching. There are problems for which GPUs and CPUs are roughly equally well suited. GPUs have their cons.

I wonder how YouTube's power consumption in transcoding the most useless / harmful videos relates to Bitcoin's power consumption. Maybe even every video should be included in the calculation, since Bitcoin also has its positive aspects.

I've never heard about how much power YouTube's transcoding is consuming, but transcoding has always been one of those very CPU-intensive tasks (hence it was one of the first tasks to be moved over to the GPU).

For YouTube's scale it makes sense, since a small saving or efficiency boost would accumulate at their scale.

Not just cost reduction or efficiency, the faster encodes you can get through dedicated hardware mean they can potentially reduce the delay between a video being uploaded and a video being available to the public (right now even if you don't spend time waiting in the processing queue, it takes a bit for your videos to get encoded)

You can handle larger volumes of incoming video by spinning up more encoder machines, but the only solution for lowering latency is faster encodes, and with the way the CPU and GPU markets are these days a dedicated encoder chip is probably your best bet.

You can split a video up across cores or even across servers. Encoding speed does not need to have a significant impact on publishing latency.

I'm wondering if this is related to the recent Roku argument; perhaps YouTube is trying to force Roku to incorporate a hardware decoding chip (maybe with an increased cost) in future products as a condition to stay on the platform.

I don’t think YouTube cares if you use hardware of software decoding. I also don’t think they care if you use their hardware decoder or someone else’s. The issue with roku is they don’t want to include any extra hardware to support vp9, and they use such cheap/low spec hardware they can’t reliably decode in software.

I'm just always blown away that Google transcodes into as many formats as they do upfront. I wonder if they do a mix of just in time transcoding on top of queue-based.

For VP9/x264, almost certainly not. If you jump on a newly-uploaded video, you'll see that higher resolution comes later. It's common to see 720p nearly immediately, then 1080p, then 4K.

For pre-x264, they probably could, but between the relatively small sizes required for the low resolution those codecs would be supporting, and the cost difference between compute and storage, I'd bet everything is encoded beforehand.

> Google's warehouse-scale computing system.

That is quite the understatement. Google's computing system is dozens of connected "warehouses" around the world.

This is intense. ASICS making a comeback again. It’s weird how the computer market is so cyclical with regard to trends.

I wonder how much storage Youtube currently uses ???.

Honestly I’m surprised they didn’t do this earlier

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact