
Tensorflow on edge, or – Building a “smart” security camera with a Raspberry Pi - ajsharp
https://chollinger.com/blog/2019/12/tensorflow-on-edge-or-building-a-smart-security-camera-with-a-raspberry-pi/
======
brutus1213
Nice writeup but the Raspberry Pi isn't running tensorflow. It is mentioned in
the article that the author is sending images to an edge machine.

The big question I had was about hardware video encoding/decoding ... doesn't
really cover that. I've found sending single image frames over zeromq to be
fairly limiting if you care about high frame rate/low latency processing.

Key issue I have run into is while many chips support hardware video
encoding/decoding, the APIs to interface with this aren't there or not in open
source. Anyone who has ideas on this, I'd welcome your comment.

As an aside, another option is to run Intel's Movidus USB stick (aka Neural
compute stick) and then you get a smart camera on the raspberry pi itself.
That raises other issues though.

~~~
snowzach
Shameless plug, check out DOODS:
[https://github.com/snowzach/doods](https://github.com/snowzach/doods) It's a
simple REST/gRPC API for doing object detection with Tensorflow or Tensorflow
Lite. It will run on a Raspberry Pi. It actually did support the EdgeTPU
hardware accelerator to make the Pi pretty quick for certain models. They
broke something so I need to fix EdgeTPU support but it's still usable on the
Pi withe the mobilenet models or inception if you're not in a hurry.

~~~
agibsonccc
Few questions:

1\. Did you build this for your own use cases? Interesting side project?

2\. How do you feel about the need for base64 being a requirement on the
endpoints? Isn't GRPC the wrong medium for this? Also, what do you see as the
main limitations right now? The models?

~~~
snowzach
1\. I built it to integrate with Home Assistant and security systems. I was
trying to use Tensorflow on a Raspberry Pi and the dependencies were a
nightmare. Tensorflow in general is a nightmare to compile and run IMO. I got
to thinking, what if I could make all the deps inside of a docker container.
What if I could run it remotely. It was born out of that.

2\. As for base64, I'm not sure of a better way to support sending raw image
data over JSON (in REST mode) In some ways I think GRPC is a better medium
than JSON (it supports either) as GRPC supports sending the RAW bytes. What
leads you to believe GRPC isn't the right transport? Plus you can do it in a
stream format if you want to do a lot of video.

The only limitations I can think of are that Tensorflow supports a myriad of
CPU optimizations so providing a single container image that has all the right
options is basically impossible. I created one that has what I think are some
of the better options (AVX, SSE4.X) and then an image that basically should
run on any 64 bit intel compatible CPU. To get optimized options you need to
build the docker container yourself which can take the better part of a day on
slower CPUs.

With that said, I also provide ARM32 and ARM64 containers that actually run
semi-okay on Raspberry Pis and and other ARM SBCs. I can run the inception
model on a Pi4 on a 1080p image in about 5 seconds which is pretty good IMO.

------
theblackcat2004
Since you already using openCV, you can write a neat motion detection and only
start sending frames for detection when motions are detected

~~~
otter-in-a-suit
Just saw this thread (I'm the author) - great idea, thank you!

------
9nGQluzmnq3M
Coral's Edge TPU products are built specifically for this kind of thing:
[https://coral.ai/](https://coral.ai/)

Hands-on video (4 min):
[https://www.youtube.com/watch?v=-RpNI4ZrfIM](https://www.youtube.com/watch?v=-RpNI4ZrfIM)

~~~
monkeydust
Interesting have RPi's lying around might get the USB Accelerator

------
anp
This reminds me of a hypothetical project I would take up if I still had a dog
and a small yard: building a poop cleanup map from CV processing of camera
footage.

Stepping stone toward a Poopba, obviously.

------
thebruce87m
Is it really edge computing if the pi isn’t running tensorflow? I know the
definition is kind of woolly.

I wonder what the performance would be on a $100 jetson nano.

~~~
hn_check
The Jetson Nano is fantastic, and I use Tensorflow on it with two home
security cameras. It is a wonderful device, and deserves far more attention
than it gets.

I'm going to upgrade in the next week to a Jetson Xavier NX. Not because I
need to, but because I like playing around and it's a silly powerful device.

I also run a NextDNS CLI client on it, various automation stuff, etc.

------
acidburnNSA
This is extraordinarily neat.

Home Assistant does have a tensorflow integration [1] that allows you to run
other home assistant automations (including various alerts, alarms, and scare
sequences) based on person detection with basically any camera (since it's
kind of a hub-and-spoke model to all other possible IoT devices).

[1] [https://www.home-assistant.io/integrations/tensorflow](https://www.home-
assistant.io/integrations/tensorflow)

I struggled recently to get it running on my actual GPU since I run Home
Assistant on a home server. I ended up making a custom component using pytorch
instead on Pop OS 20.04 and it works gloriously. CPU usage way down and GPU
has something to do now.

My super awesome self-hosted alarm system is now extra-super awesome.

Of course burglars are going to all just start wearing AI adversarial
t-shirts.

------
joshu
I've just started down this path using a plain RTSP-serving camera and a low
end box using the Coral EdgeTPU to process the frames. It looks like there are
a variety of solutions available.
[https://github.com/blakeblackshear/frigate](https://github.com/blakeblackshear/frigate)
[https://docs.ambianic.ai/users/configure/](https://docs.ambianic.ai/users/configure/)
etc

------
fareesh
I am trying to achieve something similar but at a higher scale.

I have about 48 different cameras where I want to count people and get their
approximate location in the frame.

I want to run an object detection model on all of those video streams
simultaneously.

My AWS instance maxes out after 7 simultaneous streams so I figured I don't
really need real-time monitoring. One frame every couple of seconds, even
every minute could potentially suffice, since I am dealing with larger time-
frames. Since I don't want to run too many instances at the same time, what
are some viable strategies to achieve this?

My plan is to have 5-6 instances of the ML model loaded up and waiting to
accept a frame. When one of them is ready, it will instruct one of the RTSP
streams to send it a frame, which it will process and store / send the result
to an application server. I feel like I may not even be able to consume so
many RTSP streams at once (I've never tried so I don't know), so I may have to
have some other method of priming the handshake etc. before the model asks for
a frame to process.

Is there a better / non-hacky way of achieving this (i.e. managing the
workload on a single GPU instance) ?

I don't have any control of the camera hardware at all.

~~~
shiftpgdn
48 RTSP streams is a lot of bandwidth to consume at once. Why not use an edge
PC or Jetson system to do it in small blocks? A new Jetson Xavier NX can do
8-12 streams depending on FPS and model.

------
milofeynman
For people running Blue Iris you can do something similar to this with blue
iris, deepstack[0], and an exe[1] someone wrote that sends the images to
deepstack.

Video guide: [https://youtu.be/fwoonl5JKgo](https://youtu.be/fwoonl5JKgo)
(Links in the comments of video as well)

[0] [https://deepstack.cc/](https://deepstack.cc/)

[1] [https://ipcamtalk.com/threads/tool-tutorial-free-ai-
person-d...](https://ipcamtalk.com/threads/tool-tutorial-free-ai-person-
detection-for-blue-iris.37330/) [https://github.com/gentlepumpkin/bi-
aidetection](https://github.com/gentlepumpkin/bi-aidetection)

------
8fingerlouie
I did something similar, but because i had no requirement to playback audio
"real time", i opted for a simpler solution.

I run a simple video capture from a Raspberry Pi Zero W running motion,
meaning all motion events are captured, including leaves blowing in the wind.
The captured files are stored on a NFS share per camera.

On the server i then monitor the parent directory for every camera for new
files, and run my object detection there, which in turn generates push
notifications with a screengrab if certain objects are detected. It also
stores a bounding box annotated version of the file. Not really needed except
for figuring out why you got an alert without any clear reason.

doing it this way however allows me to save a bit on each camera, and use
dedicated hardware for object detection on the server. I currently use an
Intel Neural Compute Stick 2
([https://software.intel.com/content/www/us/en/develop/hardwar...](https://software.intel.com/content/www/us/en/develop/hardware/neural-
compute-stick.html)), and while it is far from dedicated GPU performance, it
is equally far from dedicated GPU power consumption.

------
josteink
> We’ll use a Raspberry Pi 4 with the camera module to detect video. ... Now,
> here’s an issue for you: My old RasPi runs a 32bit version of Raspbian.

So why not just use the 64-bit Ubuntu RPi image instead then?

[https://ubuntu.com/download/raspberry-
pi](https://ubuntu.com/download/raspberry-pi)

------
alkonaut
Would object detection like this work out of the box for deer like he
demonstrates for humans? I need this for deer.

~~~
agibsonccc
Hi, could you describe your use case a bit? Just an alarm trigger for deer in
the backyard?

~~~
alkonaut
Yes, I'd point a camera at my precious vegetables and if a deer walks into the
video feed, something that scares it off is triggered so it runs off before
eating the whole garden.

------
canada_dry
I'm hoping advances like YoloV5 [i] will allow a rpi4 to more ably do this
without piping the video to another processor.

[i]
[https://github.com/ultralytics/yolov5](https://github.com/ultralytics/yolov5)

------
noja
Off topic almost: does anyone know of any (long life) battery powered wifi
cameras (with IR) for a project like this? Off the shelf, with a battery life
of months and nice looking (like Arlo) but not cloud?

------
szczepano
esp32-cam microcontroller costs around $6 and you have face detection build
in. It have bluetooth and wifi and most of it's drivers code if not everything
is on github. Only problem is you need to program it using arduino or other
microcontroller hardware.

~~~
ed25519FUUU
Face detection is nice but body/person detection is much more useful in these
setups.

~~~
szczepano
You mean esp-who ? It uses mobilenetv2 so it's quite possible to train it to
detect person instead of face. Didn't tried myself, just started playing with
it.

------
xchip
the Raspberry Pi isn't running tensorflow

------
simlevesque
Tensorflow on the pi itself is hard but I get great results for a similar
system with just a rpi4 and opencv.

------
b34r
What about package delivery people lol

------
hathym
It's google edge by the way.

------
staycoolboy
I've done this with a Jetson Xavier, 4 CCTV cameras and a PoE hub. You really
want to use DeepStream and C/C++ for inference, not Python and TensorFlow.

I'm streaming ~20 fps (17 to 30) 720P directly from my home IP4 address, and
when a person is in-frame long enough and caught by the tracker, a stream goes
to an AWS endpoint for storage.

I've experimented with both SSDMobileNet and Yolo3, which are both pretty
error prone but they do a much better job filtering out moving tree limbs and
passing clouds, unlike Arlo.

You need way more processing power than an RPi to do this at 30fps, and C/C++,
not Python. (There are literally dozens of projects for the RPi and TFlow
online but they all get like 0.1 fps or less by using Flask and browser reload
of a PNG... great for POC but not for real video)

I wrote very little of the code, honestly: only the capture pipe required a
new C element. I started with NVidia DeepStream which is phenomenally well-
written, and their built-in accelerated RTSP element, and added a custom
GStreamer element that outputs a downsampled MPEG capture to the cloud when
the upstream detector tracks an object. NVidia also wrote the tracker, you
just need to provide an object detector like SSDMobileNet or YOLO. NVidia gets
it.

The main 4 camera-pipe mux splits into the AI engine and into a tee to the
RTSP server on one side and my capture element on the other side.

It was amazingly simple, and If I turn the CCD cameras down to 720P with h265
and a low bitrate, I don't need to turn on the noisy Xavier fan. The onboard
Arm core does the detected downsampling (one camera only, a limitation right
now) and pushes the video with a rest endpoint on a node server in the AWS
cloud.

I'm very pleased with it, I haven't tested scaling but if I turned off the GPU
governors I could easily go to 8 cameras. I went with PoE because WiFi can't
handle the demand.

~~~
scottlamb
> You need way more processing power than an RPi to do this at 30fps, and
> C/C++, not Python. (There are literally dozens of projects for the RPi and
> TFlow online but they all get like 0.1 fps or less by using Flask and
> browser reload of a PNG... great for POC but not for real video)

I think 8 streams at 15 fps (aka 120 fps total) is possible with a ($35)
Raspberry Pi 4 + ($75) Coral USB Accelerator. I say "I think" because I
haven't tested on this exact setup yet. My Macbook Pro and Intel NUC are a lot
more pleasant to experiment on (much faster compilation times). A few notes:

* I'm currently just using the coral.ai prebuilt 300x300 MobileNet SSD v2 models. I haven't done much testing but can see it has notable false negatives and positives. It'd be wonderful to put together some shared training data [1] to use for transfer learning. I think then results could be much better. Anyone interested in starting something? I'd be happy to contribute!

* iirc, I got the Coral USB Accelerator to do about 180 fps with this model. [edit: but don't trust my memory—it could have been as low as 100 fps.] It's easy enough to run the detection at a lower frame rate than the input as well—do the H.264 decoding on every frame but only do inference at fixed pts intervals.

* You can also attach multiple Coral USB Accelerators to one system and make use of all of them.

* Decoding the 8 streams is likely possible on the Pi 4 depending on your resolution. I haven't messed with this yet, but I think it might even be possible in software, and the Pi has hardware H.264 decoding that I haven't tried to use yet.

* I use my cameras' 704x480 "sub" streams for motion detection and downsample that full image to the model's expected 300x300 input. Apparently some people do things like multiple inference against tiles of the image or running a second round of inference against a zoomed-in object detection region to improve confidence. That obviously increases the demand on both the CPU and TPU.

* The Orange Pi AI Stick Lite is crazy cheap ($20) and supposedly comparable to the Coral USB Accelerator in speed. At that price if it works buying one per camera doesn't sound too crazy. But I'm not sure if drivers/toolchain support are any good. I have a PLAI Plug (basically the same thing but sold by the manufacturer). The PyTorch-based image classification on a prebuilt model works fine. I don't have the software to build models or do object detection so it's basically useless right now. They want to charge an unknown price for the missing software, but I think Orange Pi's rebrand might include it with the device?

[1] [https://groups.google.com/g/moonfire-nvr-
users/c/ZD1uS7kL7tc...](https://groups.google.com/g/moonfire-nvr-
users/c/ZD1uS7kL7tc/m/rreU83FBBAAJ)

~~~
serf
>* I use my cameras' 704x480 "sub" streams for motion detection and
downsample..

i've encountered cheap IPTV cameras where the main high-res stream was
actually being offered with a time-shift compared to the sub-stream.

weird shit happens when you have a camera that does that, then you act on data
from the sub-stream to work with data on the main stream. I played with a
'Chinesium' cctv with generic firmware that had such a bad offset that I could
actually use a static offset to remediate it.

I assumed it was just a firmware bug, since the offsets didn't seem to move
around as if it was a decode/encode lag or anything of that sort.

~~~
scottlamb
Yeah, that sucks.

Did the camera send SEI Picture Timing messages? RTCP Sender Reports with NTP
timestamps? Either could potentially help matters if they're trustworthy.

I haven't encountered that exact problem (large fixed offset between the
streams), but I agree in general these cameras' time support is poor and
synchronizing streams (either between main/sub of a single camera or across
cameras) is a pain point. Here's what my software is doing today:

[https://github.com/scottlamb/moonfire-
nvr/blob/master/design...](https://github.com/scottlamb/moonfire-
nvr/blob/master/design/time.md)

Any of several changes to the camera would improve matters a lot:

* using temporal/spatial/quality SVC (Scalable Video Coding) so you can get everything you need from a single video stream

* exposing timestamps relative to the camera's uptime (CLOCK_MONOTONIC) somehow (not sure where you'd cram this into a RTSP session) along with some random boot id

* allow fetching both the main and sub video streams in a single RTSP session

* reliably slewing the clock like a "real" NTP client rather than stepping with SNTP

but I'm not exactly in a position to make suggestions that the camera
manufacturers jump to implement...

------
dheera
Can we all _please_ stop using the term "edge" computing? It's nothing but a
hype term and in reality it's really what we already had for the decades
before the internet.

~~~
hikarudo
I disagree. The term "edge computing" actually adds precision to a description
of a distributed system. Nowadays, with a lot of machine learning inference
happening on the cloud, when seeing the term "edge inference" you immediately
know you don't have to send heavy bandwidth-clogging video streams to the
cloud.

Inference on the edge is a clear trend in computer vision applications, now
that we each year there are better low-power neural network accelerators.

~~~
dheera
> Nowadays, with a lot of machine learning inference happening on the cloud

Right, and if it's not on the cloud, it runs locally, as everything did before
"cloud" became popular. We don't need to call it "edge" just to raise VC money
or put out some PR. We can just say it runs locally, on-device, etc.

If (big if) and when Adobe realizes that their Creative Cloud was a bad idea,
are they going to call the next product "Adobe Edge Edition! Wow you can
actually run PhotoShop on your own desktop!"?

~~~
gspr
> Right, and if it's not on the cloud, it runs locally, as everything did
> before "cloud" became popular. We don't need to call it "edge" just to raise
> VC money or put out some PR. We can just say it runs locally, on-device,
> etc.

To me, "edge" means more than just "not cloud". It's appropriately used when
making the point that computations happen where the data is gathered and the
output is required (which seems actually _not_ to be the case in TFA, but
still). It's when computations are not offloaded elsewhere at all, not just
"not to the cloud".

~~~
dheera
> making the point that computations happen where the data is gathered and the
> output is required

This is how literally everything was done before the internet. It shouldn't be
thought of as a new fancy concept.

