Hacker News new | more | comments | ask | show | jobs | submit login
How to Easily Detect Objects with Deep Learning on Raspberry Pi (medium.com)
308 points by sarthakjain 10 months ago | hide | past | web | favorite | 50 comments



I am Prathamesh, co-founder of https://nanonets.com

We were working on a project to detect objects using deep learning with raspberry pi and we have benchmarked various deep learning architectures on pi. With ~100-200 images, you can create a detector of your own with this method.

In this post, we have detected vehicles in Indian traffic using pi and also added github links to code to train the model on your own dataset and then script to get inference on pi. Hope this helps!


Hi, I have a question about the performance benchmarks section. The best performaing model from your benchmarks, ssd_mobilenet_v1, has a prediction time in seconds of 0.72 -- Is that the total runtime of the script? I'm wondering if I could achieve ~1 FPS running in realtime (basically looping as fast as possible against the camera input), or is there more overhead? (edit -- made question more specific)


Yeah its total runtime of script. However you can get upto 3-4 FPS with more optimizations. We are going to try more quantization options soon with release of tensorflow 1.7 and will report our findings (Will post updates in blog). Also pi camera code needs to be optimized which will further increase FPS. One big advantage of this method is just by collecting ~200 images for any use case, you can have detector ready in couple of hours.

One advantage of using API based approach is you can get much higher FPS without compromising accuracy and is also independent of pi CPU power and heating etc.


Curious, have you looked at any other embedded platforms besides the Pi? The Tegra might be an interesting comparison point, I wonder what kind of FPS the onboard GPU would buy you.


Hey, good suggestion to benchmark against other SOC's as well. I heard raspberry pi recently added support for external graphics card. Haven't tried yet.



Thanks for feedback. What could we possibly to do make it easier to follow the intermediate steps?


Say I have a bunch of photos of my cat. Want to be able to use Raspberry PI to recognize my cat from the others. How can I create the dataset to feed to your ML engine? Would love to see end-to-end how-to, seriously.


actually, you might try this: https://azure.microsoft.com/en-us/services/cognitive-service...

lets you customize a neural net with a small number of specific images (using a technique called 'transfer learning')


We also use transfer learning a lot. Using transfer learning is but tricky because sometimes you might upset generalised weights of pretrained network with bad hyperparameters. We had written a blog before for using transfer learning as well

https://medium.com/nanonets/nanonets-how-to-use-deep-learnin...


Another great tutorial I have found is the guy who runs an AI on twitch that plays GTA-V. Here is his object detection tutorial playlist.

https://www.youtube.com/watch?v=COlbP62-B-U&list=PLQVvvaa0Qu...


Added a section in blog to explain end to end usage


Maybe I’m missing something but does this blog post conclude with a service to do inference off device? Why explain all of the steps to inference on device if you’re offering an API to do cloud inference?


Off-device and on-device are alternatives to do deep learning inference on pi, both with pros and cons. For an example, with on-device inference, you will need to run a smaller architecture to get decent FPS and will also be dependent on hardware. Using cloud API removes those restrictions but there will be some latency in the web request however you can use much more accurate model and will be independent of pi hardware. Just trying to paint a complete picture. Any suggestions for blog post are welcome.


> Maybe I’m missing something but does this blog post conclude with a service to do inference off device?

You didn't answer this question. Your "sorta-answer" suggests "yes", but the title "How to easily Detect Objects with Deep Learning on Raspberry Pi" suggests that your answer should be "no".

The title wasn't "How to easily Detect Objects with Deep Learning on Raspberry Pi with cloud services".


Hey the blog has a way to implement the entire algorithm yourself in python or implement using a docker image on your own machine or see The source code for the Docker image that uses tensorflow so you can play around with it. To answer your question, yes, the last part of the blog has a way to do the same thing with a cloud based API. Up to the user to pick their preferred method.


> Your "sorta-answer" suggests "yes", but the title "How to easily Detect Objects with Deep Learning on Raspberry Pi" suggests that your answer should be "no".

How am I suggesting "yes"? And how the title is suggesting answer as "no"? There are pros and cons of both methods. If you are doing inference on a remote place with no access to internet, off-device is out of question. We are just trying to give a complete landscape so that if someone has a use case and trying to come up with solution, it might be helpful. Depending on use case, can pick on-device or off-device.


Exactly. I’m certainly interested in “How to easily Detect Objects with Deep Learning on Raspberry Pi”. Because I’m interested in that, I am most definitely not interested in “How to easily Detect Objects with Deep Learning on Raspberry Pi with cloud services”.


Then I hope you found the first 90% of the blog interesting. If there was something we can imporve happy to hear about it.


Because the blog post is advertising for the service. The goal is to make it look like enough of a pain in the butt that it's worth paying someone else to do it, but not such a big pain in the butt that a cloud service would be expensive or impractical.


I agree with the fact that there is potential for bias. The post starts of with a disclaimer explaining the conflict of interest. Anything else we can do to make the post more objective and less prone to bias?


Honestly I didn’t mean that as negatively as it sounds. It’s good advertising.

Squeezing a full blown ML tutorial into a blog post is a tall order. Of course it’s too thin to really do it yourself without a lot of further research, you can’t really expect anything else. But I think the title leads people to expect more detail, hence posts like this and the owl cartoon above.

Maybe add a breakdown of performance for doing this local on a pi vs using the api? Would make it easier for people to weigh pros and cons.


I suspect it'd be fun to play with a hybrid approach there - use the local on-device capability to detect "scenes of interest", then ship those out to the cloud service (with significantly more horsepower) to get more accurate results. Possibly, if it works for your use case, you could detect and store "interesting looking stuff" and ship it to the cloud later for analysis if your device only has intermittent internet connection.


good idea for a follow up on how to use a hybrid approach.


The two pricing tiers for the hosted API don't seem practical for real usage.

$0 for 1,000 slow API calls

$79 for 10,000 fast API calls

To put that into perspective the 10k API calls is less than 10 minutes of 24 fps video. You should have a much higher plan or pay per request overage price.


To be fair - there's probably not a whole lot of real world use cases that aren't highly specialised where there's a requirement to run object detection/identification on every single frame of a 24fps video.

If you want to run hours or days worth of video through an object detector - you probably want to go out and buy a gpu and machine to stick it on of your own...

I'm curious as to what the application you're thinking of where this seems like "real world usage"? (I can imagine applications like vision-controlled drones, but I'm pretty sure places like ATH Zurich have better solutions (as in "less generalised and more applicable to drone control") and in-house hardware to train and run it on.)


There are plenty of real applications for inferring every frame of video. Any real time monitoring application would run all the frames, even if you cut it to 1 FPS with multiple video sources the monthly price doesn't make sense.

One application would be nudity detection for a family friendly site, lots of video would need to be checked.

The argument that you would want to run your own machine validates my point. However the same could have been said for video encoding or any other form of intense processing which all now have cloud alternatives.


OK. I don't think this is the solution you're after if your problem includes "crowd sourced video".

Nudity detection though - I'd probably at least try doing something like "Check every 50+rand(100) frames, and only examine more carefully if you get hits on that sampling". Sure - that's "game-able" - but subliminal nudity isn't something I'd expect trolls or griefers to expend too much effort to slide past your filters...


Hey yes there is a need to run detection at 20+ FPS but you don’t need to run it “one frame at a time” your mostly processing the same information again and again redunduntly.

I agree you neeed 24 FPS output but you don’t need to process all 24 frames raw as images.


ETH Zürich :)

But yes, no disagreement here.


Hey our higher usage plans are:

$299 for 100k images

$499 for 1M images

We are adding plans for video. Since 24 FPS doesn’t always make sense. Especially from a compute perspective and data perspective because there is a huge amount of redundancy. Should see a 1/20 at the minimum on price for video.


Thanks, I didn't see those plans in the app. Video pricing is great, but may be hard to work with for real-time feeds. I would probably be reducing the framerate to 8 FPS and using those images for detection.


For comparison a g2.2x large aws instance is $468 per month. This is the bare minimum you’d need to run the kind of stuff your talking about and if still be skeptical about running it at 8fps unless properly quantised.


Like the direction you are headed. Considering that use of ASICs is going to rise, think you should consider local installs through docker(like machinebox.io) or another technique. Also federated learning would be next thing to take on.


We do provide docker option as well. Federated learning looks like a good way of edge computing + deep learning to offer personalized models as well as use them to improve general model.


The RPi is an asic based device, no? A web search yields no other applications of BCM2837B0.


No, it's still a general purpose ARM CPU. A good example of ASIC would be a bitcoin mining device.


Quoth WP: "Modern ASICs often include entire microprocessors, memory blocks including ROM, RAM, EEPROM, flash memory and other large building blocks. Such an ASIC is often termed a SoC (system-on-chip). "

The ARM core on the SoC is part of the asic. As is the VideoCore part.


I think what OP meant was ASIC specific to deep learning like TPU's. However as I see, current frameworks are not matured enough to support GPU's and TPU's with exact same code. Also there are no standards so every big org is going to build support for their own ASIC interfaces for the framework they manage. Is there an open source interface for ASIC's for deep learning?


Yes. I was talking about AI Chips (FPGA and ASICs - NN Processors). https://github.com/basicmi/Deep-Learning-Processor-List https://www.nextbigfuture.com/2017/11/at-least-16-companies-....

Cambricon is going big in china so its not just google and apples. They claim to be 6 times faster than GPU.

I am more interested in potential of being able to run video processing, voice models effortlessly on tiny devices. and also to train models offline or locally.

I think there is a good scope of solutions (like vision recognition) that port well across AI chips.


>> Also there are no standards so every big org is going to build support for their own ASIC interfaces for the framework they manage.

This is true. Google has TF for tpu and Intel Nervana has neon. Each player will likely publish a software library.


I am very skeptical about pretraining which seems to be the key point of Nanonets. Sure, it will help to work work better than initialization from random weights, but, you will always do better if you collect more data for your problem. This may be fine for problems which do not need optimal classification and fast performance, but I am struggling to see any use case for that.


There is need of custom model in a lot of businesses like you want to identify only a specific kind of product from rest of similar looking ones or find only defective pieces and where you cannot collect 10's of thousands of images from the beginning. Also pretraining is not only for initialization but also to improve generalization with less data.


Are we still worried about splitting the infinitive?


No. Most current grammarians and most style guides do not say that split infinitives are prohibited. Furthermore, in the past when some grammarians said they were not allowed, that were as many or more equally authoritative grammarians who said there was no such rule, and those who did support such a rule never were able to offer a good reason for it.

The only reason one may want to avoid them now is that there were still enough people who were taught from crappy elementary school textbooks that had this bogus rule, and they will think your grammar is bad if you use split infinitives (and they remember enough from elementary school to recognize them).

Just trust your ear. If splitting an infinitive makes a sentence sound clearer, do it.

If someone gives you crap, cite the Oxford Dictionary people [1].

PS: same goes for ending a sentence with a preposition. Sometimes it is clearer to do so. In that case, do it! You can cite Oxford for this, too [2].

[1] https://en.oxforddictionaries.com/grammar/split-infinitives

[2] https://en.oxforddictionaries.com/grammar/ending-sentences-w...


One less thing to worry about.

Haha, just kidding : one _fewer_ thing to worry about.


And the silly thing is that the rule was based on Latin, where to infinitive is marked by a (non-separable) suffix, as opposed to English where it's marked by a preposed preposition "to" (though that may be redundantly put).


"This is the kind of arrant pedantry up with which I will not put." ( -- Churchill, exact wording disputed...)


To boldly split where no man has split before!


to boldly split ;)




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: