> High-Density Video Decoder\*\*: 96 channels of 1920x1080p > [...] > \*\*: @10 ...

novaRom · on Jan 10, 2023

10 fps should be fast enough to provide input tensors for real time inference with small scale transformers / convolutional nets.

scottlamb · on Jan 10, 2023

Maybe. Running inference at 10 fps is probably plenty. But that doesn't mean you only have to do 10 fps of H.264/H.265 decoding. I think the most common scenario is for the input video to be e.g. 30 fps with mostly P frames that each depend on the prior frame in a chain. In that case, you need to decode almost [1] 30 fps to get 10 fps of evenly spaced frames to process.

[1] You could skip the last P frame before an IDR frame, but that doesn't buy you much.

zamadatix · on Jan 10, 2023

If your source is 96 YouTube videos sure, if it's 96 CCTV cameras it's different.

scottlamb · on Jan 10, 2023

Still depends. As it happens, I'm developing my own open source NVR software, [1] so I know a bit about this. Some cameras are fairly good about this, supporting the following features:

* "Temporal SVC", in which the frame dependencies are structured so you can discard down to 1/2 or 1/4th of the nominal frame rate and still decode the remainder.

* Three output streams, which you could configure for say forensics (high-bandwidth/high-resolution/high-fps), inference (mid-bandwidth/mid-resolution/low-fps), and viewing multiple streams / over mobile networks (low-bandwidth/low-resolution/mid-fps).

* On-camera ML tasks too. (Although I haven't seen one that lets you upload your own model.)

But other cameras are less good. E.g. some Reolinks [2] only support two streams, and the "sub" stream is fixed at 640x352, which is uncomfortably low. Your inference network may not take more resolution than that, but even if not, you might want to crop down to the area of interest (where there's motion and/or where the user has configured an alert) to improve quality. (You probably wouldn't pair that cheap Reolink camera with this expensive inference card, but the point stands in general.)

Even the "better" cameras' timestamp handling is awful, so it's hard to reliably match up the main stream, sub stream, analytics output, and wall clock time. Given that limitation it'd be desirable to just use the main stream for everything but the on-NVR transcoding's likely unaffordable.

[1] https://github.com/scottlamb/moonfire-nvr

[2] https://github.com/scottlamb/moonfire-nvr/wiki/Cameras:-Reol...

zamadatix · on Jan 11, 2023

Price wise if this card is $5,000 that's $52 per where you don't need any onboard smarts handled by the camera in a space where commercial cameras are hundreds of dollars to buy or replace to have the particular smarts you're looking for that day. I've done a few PoCs in the smart city/smart retail space they are advertising here and they pretty much end up falling into the "everything must be pre-processed as much as possible and sent to the cloud" or "everything must be dumb and sent to the central recorder" buckets as anything in the middle creates a bad cost balance where you're neither optimising hardware+simplicity costs or data+cloud costs. I'll admit though I don't normally go out to sell cameras all day it's just something we've added as clients in part of a larger connectivity rework (CBRS/LTE/Wi-Fi/GPON/traditional wired) and we typically partner up with some specialized company on the video processing use case. The onboard camera processing is usually about justifying a cloud pitch ("we use data to send video when something interesting happens" or "we send only the best picture of the face in HD to save bandwidth but still be able to ID them later") not so much letting you go in and solve your own problem. One exception I ran into was license plates at a car wash outfit where they were able to send the plate numbers back to their main app but that probably came from being a pre-baked solution for road tolls.

I also have a sneaking suspicion using lower channel counts let you raise the FPS but the max of 96 channels is the hard limit, tuned to allow up to use cases like recognition from unprocessed feeds but the documentation access seems to be a manual approval process so I can't verify for sure.

scottlamb · on Jan 11, 2023

> Price wise if this card is $5,000 that's $52 per where you don't need any onboard smarts handled by the camera in a space where commercial cameras are hundreds of dollars to buy or replace to have the particular smarts you're looking for that day.

Good point. At that scale, the price might make sense. (I'd still hesitate to buy this card, though. Based on experience with Amazon VT1 instances, I don't have any faith in Xilinx's software quality.)

There are much lower-cost solutions if you don't need that many cameras, e.g.:

* The Coral TPU is nice and cheap. I keep hoping to see a new version and/or someone making M.2/PCIe cards with several of these chips on it. It doesn't do the video decoding, though, so you need other hardware for that.

* There was an Axelera card just announced. [1] I'm curious to read the reviews when it actually ships to folks.

* The newer Rockchip SoCs advertise decent video decoding and some ML acceleration. I have one and will be trying it out sooner or later.

> The onboard camera processing is usually about justifying a cloud pitch ("we use data to send video when something interesting happens" or "we send only the best picture of the face in HD to save bandwidth but still be able to ID them later") not so much letting you go in and solve your own problem.

My software's more aimed at the home/hobbyist side of things. There some folks go with the canned/cloud stuff (Ring/Nest/whatever) similar to what you're saying. Some do everything at home with e.g. BlueIris and use the on-camera ML stuff as it is. The lack of flexibility (mostly due to closed-source, low-quality software IMHO) is a real problem though. Some folks use something like Frigate that does on-NVR analytics, and I'll eventually add that feature to my own software.

> I also have a sneaking suspicion using lower channel counts let you raise the FPS but the max of 96 channels is the hard limit, tuned to allow up to use cases like recognition from unprocessed feeds but the documentation access seems to be a manual approval process so I can't verify for sure.

I bet you're right.

[1] https://www.cnx-software.com/2023/01/02/150-axelera-m2-ai-ac...