
Video Architecture Search - theafh
https://ai.googleblog.com/2019/10/video-architecture-search.html
======
justinsaccount
Neat.. A while ago I was trying to figure out how to run tensorflow on videos
from the camera I have in front of my house.

I had worked out how to pre-process the video using opencv to mask out
whatever was moving and export that to static images. Then train TF using
those images.

The resulting system worked enough to say that a USPS truck was outside, but
couldn't tell the difference between the truck driving by vs stopping to
deliver something in the mailbox.

I had an idea to use opencv to track the objects, then export a graph of the
X,Y coordinates of the moving object over time, and then train TF on the
graphs. but never got around to testing it.

I wonder if this project would do a better job, or still have issues because
all the videos are almost identical.

~~~
MasterScrat
One way to solve this problem is to give your neural network a stack of
successive frames, instead of given them one by one. If the framerate is too
high, you may need to skip some of the frames to get idea of
speed/acceleration.

See eg the concepts of "frame stack" and "frame skip" in reinforcement
learning: [https://danieltakeshi.github.io/2016/11/25/frame-skipping-
an...](https://danieltakeshi.github.io/2016/11/25/frame-skipping-and-
preprocessing-for-deep-q-networks-on-atari-2600-games/)

------
skirmish
For detecting objects in video, take a look at:
[https://github.com/tensorflow/models/tree/master/research/ls...](https://github.com/tensorflow/models/tree/master/research/lstm_object_detection)

It uses SSD object detector
([https://arxiv.org/abs/1512.02325](https://arxiv.org/abs/1512.02325) ,
[https://lambdalabs.com/blog/how-to-implement-ssd-object-
dete...](https://lambdalabs.com/blog/how-to-implement-ssd-object-detection-in-
tensorflow/)) on each frame, then runs LSTM on top of it to collect
information across video frames.

The trained model can be run on mobile devices via Tflite.

------
ArtWomb
I'll also link up another Google paper (sorry no open access). It's only
tangentially related. But meta-learning of knowledge graphs for recommending
the next video to watch on Youtube. Will possibly be extended for robots
learning to watch humans perform complex tasks ;)

Recommending what video to watch next: a multitask ranking system

[https://dl.acm.org/citation.cfm?id=3346997](https://dl.acm.org/citation.cfm?id=3346997)

------
ironfootnz
The interesting fact about TinyVideoNets is they classified the layers in the
end after identification on each layer, depending on the model of parameters,
is much efficient to classify on each layer? Better sorting of the results? I
wonder why they didn't explain that in the paper.

------
ilaksh
The first performance above 34% on Moments-in-Time.

That MiT test looks like it begins to approximate general intelligence. Can
they get the other 65% (or whatever it actually is) with the existing
paradigm?

~~~
Mathnerd314
I looked for a bit at the MiT dataset paper
([https://arxiv.org/pdf/1801.03150.pdf](https://arxiv.org/pdf/1801.03150.pdf))
and I'm honestly not sure what human / general intelligence would be. There's
only a single ground-truth category per video, but the categories overlap
somewhat and multiple categories can apply. And the ground truth category is
only 75-85% agreed upon by humans for some videos. The dataset was not
constructed to have 100% performance by humans.

I guess to evaluate accuracy fairly you'd have to run another Amazon Turk
project, where you run a classifier over the data, take the top 5 and replace
the bottom one with the ground truth if it's not there, and then quiz the
workers as to the best category. But it's a million videos.

