Hacker News new | past | comments | ask | show | jobs | submit login
Object Detection at 1840 FPS with TorchScript, TensorRT and DeepStream (paulbridger.com)
162 points by briggers on Oct 18, 2020 | hide | past | favorite | 55 comments



I see this and I immediately think of "trash sorting" at ultra high speed. If one can combine this with a bunch of accurate (laser precision) air guns, to shoot and move individual pieces of trash you can sort through a truck load of trash in a matter of seconds, perhaps in the air while they are being dumped! compare this approach with how we are currently doing it [0] - Somebody should get Elon Musk on this project right away!

[0] - https://www.youtube.com/watch?v=QbKA9uNgzYQ


Sorting by optical recognition and air guns to separate a falling curtain of product into two output streams is already a product. The development of these machines are the reason that 10 or 15 years ago you stopped seeing bad beans in bulk bean bags. I am involved in the tea industry where they are used to sort tea by grade - stems, bad leaves, broken leaves, full leaves.

Here is a diagram: https://www.satake-usa.com/what-is-optical-sorting.html


Yes but when I went to the recycling (sorting) facility in San Mateo they remarked that their plastic sorting systems work by infrared reflection and so cannot see black plastic. They said that because of this they are unable to process black plastic entirely. I got the sense that there’s room for improvement.

Bonus slow motion footage of their processing machines: https://imgur.com/gallery/IK5zKkO


Reminds me of the library modernization drive in 'Rainbows End'. The book digitizer is basically a wood chipper with lights and high speed cameras in the debris chute.


There was a company in Japan that would digitize your books and destroy the original (to prevent infringement claims), iirc.


Not related to Elon but there is a company called Pellenc ST in the south of France that work exactly on this kind of problem. You can see a video of one of their machines here [0].

I work at an AI consultancy [1] that help them use deep neural nets in these high throughput and low latency conditions. It's an interesting challenge and the performance than can be squeezed from modern hardware are indeed impressive.

0: https://youtu.be/XLciSGE82DY?t=280

1: https://neovision.fr


This is a thing already. In my understanding, it's a staple in several kinds of recycling processes. Random sample of related links (there's a seemingly infinite amount of these though):

https://youtu.be/mLya2NuY4Yk

https://youtu.be/GJeOfHxMWQo?t=87

https://youtu.be/bWUuBz2hWc0?t=83


Trash sorting is probably better than self driving cars. I only see speed talk on this page and nothing about accuracy.

Musk needs like 99.9999% accuracy at near zero latency over several hours of operation. I think Tesla currently is at maybe 99.995% from driving my car. The last 0.005% results in phantom braking etc. It's actually a very hard nut to crack and I don't expect them to achieve the full self driving in all conditions for another 10-15 years maybe. The edge cases are just too many.

I like the trash idea though (or a Q/A robot at a factory etc).


Out of curiosity, what are the possible use cases for object detection at >100 fps? I assume it would have to be objects that move very fast, i.e. nothing ordinary that I can think of.

[edit] actually stupid question. I assume it's more about throughput than fps, i.e. be able to process lots of streams on the same machine, for instance for doing mass analysis of CCTV streams.


While I'm not into object detection such as this, I can easily imagine this being part of a system where you want the rest of the system to have time to act on the information.

As such the point isn't that you can detect objects >N fps, but rather that the object detection shouldn't take more than X% of the time per cycle so that the overall cycle time can run at a given rate.


If your pipeline depends on running inference on a single frame at a time, for example some kind of control loop, then you need to be a bit careful about how you measure speed; you have to use the effective time per batch (ie batch size 1), not the amortised frames per second using as big a batch as will fit. You can still interleave processing though.


Very practical question :) Exactly as you say, multi-stream throughput. Also for faster than realtime offline processing of video. Check the caveats section at the end of the post - DeepStream is probably not well suited to high throughput single-stream inference.


For me it's only exciting because it lowers the barrier of how much I can do with a much smaller system than double 2080ti's.


Yeah. A 2080Ti doesn't fit in your pocket or in your AR glasses but the same techniques and tools scale down.


Self-driving. Ideally you want something around 1000fps and low latency, so it has time to react.

I'm sure military and sports applications are obvious too.


I don't think you'll find a 1000 fps camera on a "standard" AV platform. And if you did, I imagine it would be too noisy to be useful without a ton of illumination.


Smartphones offer 960fps video capture since a while now.


Doubt

Humans reaction times are much slower than that. In fact for some things it can take a whole second https://www.visualexpert.com/Resources/reactiontime.html

Maybe racing sports have shorter reaction times, but I'd be frankly surprised if it was something < 100ms

10fps for your average drive should be more than enough


But machines aren't (yet) as capable as humans at driving-situation-recognition and driving-decision-making. One way they can compensate for those shortcomings is to be superior in other ways: 100% vigilance and super-fast reaction times/decision-making.


Answering my own question. Possibly industrial applications like detecting objects on a fast conveyor belt. A recycling facility for instance.


100ms is generally considered to be the lower bound for a valid start, i.e. 99 would be considered a jump start in F1 iirc


No, that's only if you want the human to be able to make the reaction. If the application was self-driving, you'd prefer the car to react faster than a human. For a military application like projectile detection to avoid or destroy the object, you'd want something even faster.


I'm not sure what your argument is. Any self-driving system should strive to be much better than human drivers. It makes complete sense to have reaction time much better than human.


For reference I am making a farming robot that goes at 1 meter per second and I run the main control loop at 10hz. I would absolutely run a car that goes freeway speeds at at least 100hz.


Cloud service that needs to serve multiple requests or process many video streams in parallel. (faster performance = less hardware required, bigger scale and potentially improve end user experience - save from their data being on the cloud of course).

On device (eg mobile phone) processing with battery usage that respects the user. Older hardware/models inclusion as well.

Of course the above aren't cases were the stream itself is 100+fps, but more of broader general benefits. For a 100+ fps stream.. well there are many things that go fast, imagine you wanted a robot that tracks or catches a fly before it takes off. Flies have a reaction time of 5ms (200fps), that's why it's hard for us to catch! Expand and apply the same concept to other things (that are fast, or happen very quickly) now...


Another aspect is you might get better accuracy with larger input dimensions, but the number of pixels scales quadratically with width/height.


Food processing, recycling separation? I can imagine lots of small parts moving fast


Plenty of robotics tasks could benefit from high FPS tracking of single streams. Generally process tasks where faster=better. But yes, tracking many streams at once is useful too!


tomato / potato sorting while they are on conveyor. bigger fps, more objects to dump on said conveyor.


this would be useful for sorting cells at high speed. https://www.sinobiological.com/category/fcm-facs-facs


Smart missiles and weapons i guess


Shooting down drown swarms.


Roulette spin predictor



I guess, but that's generally illegal, and if it became common & easy then casinos would simply require bets to be placed before the ball is dropped.


Hypersonic missiles


How portable are these techniques to other architectures? Could >100 FPS be realistically achieved today using only CPUs or mobile phones?


> Could >100 FPS be realistically achieved today using only CPUs or mobile phones?

Not yet.

Google's MediaPipe object detector (which is one of the most optimised mobile solutions around) can do "26fps on an Adreno 650 mobile GPU"[1].

The Adreno 650 is the GPU in the Snapdragon 865, ie the current high end SOC used by most non-Apple phones. This gives roughly the same performance as an iPhone 11.

[1] https://google.github.io/mediapipe/solutions/objectron.html

[2] https://www.tweaktown.com/news/69097/qualcomm-adreno-650-gpu...

[3] https://www.tomsguide.com/news/snapdragon-865-benchmarks


Thanks for the links. I think there also isn't an API for accessing high FPS cameras on Android devices that support slow motion video capture.


Converting to ONNX gives you advantage in Intel CPU too, if you convert from ONNX to OpenVINO.


Mobile phones definitely since these days most of them have pretty powerful GPUs.


A weird question, but since there's another article on HN right now about programming language energy efficiency https://news.ycombinator.com/item?id=24816733 any idea whether going from 9fps to 1840fps consumes the same power, 200x the power, or somewhere in between?


Great question, now I wish I'd recorded power consumption for all these experiments. Judging from cumulative hours of watching the output of nvidia-smi I've definitely seen a linearish relationship between utilization and power draw (with a non-zero floor of 30-40W).


I see Rust is almost equal to C if not better in the graph, however, I think equally-skilled programmers in either language would show the Rust programmer spending more 'energy' programming and iterating than the C programmer, but then make the argument that the C program will use more 'energy' downstream if bugs slip in. In any case, an eye-opening metric on what I, and I am sure many, take for granted. Cool.


EDIT: I think Zig would come up pretty good here too:

https://twitter.com/andy_kelley/status/1317586767260774400


Good work getting TensorRT running we had a real pain in the butt recently when working with it and just opted to go with ONNXRuntime, their graph optimizer and their TensorRT backend -- may not be as fast as straight TensorRT from comparisons I've seen but it got us to a competitive inference and latency so we're happy with it.


Nice one! I've long been interested in the ONNX serving path.


Any word on latency? I didn't see anything in the article. I guess, since this is a synthetic test just pumping a single image file through repeatedly instead of an actual video stream, then it wouldn't realistically be measurable. But if latency is particularly low, this would be a boon for AR systems.


BTW, this is pumping the same video file through the network - not just a single file. I don't measure latency, but this is not a deep pipeline so it's easy to calculate.


Ok, I guess I misread that part.


Latency fundamentally limited by the model processing single frame, all-in-all, probably somewhere around 10 to 15ms depends on your input size (assuming VGA type of input). This is a great article talks about system-engineering for the vision pipeline, but to solve the latency issue, you need either a beefier processor (or more specialized processor) or a better tuned algorithm.


> There is evidence (measured using gil_load) that we were throttled by a fundamental Python limitation with multiple threads fighting over the Global Interpreter Lock (GIL).

Can anyone comment on how often this is a problem and if this problem is truly fundamental to Python? Could it be solved in a Python 3.x release?


Yes, this is a fundamental part of Python. By default, a single Python process is single-threaded in the traditional sense. So, using "threads" (i.e. the Threading module) in Python is actually more like using fibers in some other language. They're not OS threads. So, if you're not waiting on I/O, then yes, the threads will fight over the GIL and performance will suffer. This is inherent to Python and will not be changed.

But there's a few more things that can be said about this. Python "threads" are really just a mental construct for designing programs. The selling point is that you can share variables and data between "threads" without having to worry about locks or data corruption or anything like that. It just works. But, even with that advantage, you're relying on Python to switch between "threads" on its own, and that could easily slow things down. If you're willing to drop the mental construct and go for better performance but still use a single process and be able to share variables, the asyncio module will let you control when the main Python process will move between points of code flow.

However, if you really want to use traditional multiple processes/threads just use the Multiprocessing module. It actually launches multiple Python processes and links them together. It's called in a similar fashion to Threading, so there isn't much code change for that part. But because it's no longer a single process - and no longer bound by the GIL - you can't share data between the processes as easily. With Multiprocessing, you'll need to create slightly more complex data structures (like a multiprocessing manager namespace) to share that data. It's not that hard, but it requires a bit of planning ahead of time.


The subject has been debated to death. Google it.


Name clash again… I thought about https://deepstream.io/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: