
Cloud Video Intelligence API - hurrycane
https://cloud.google.com/blog/big-data/2017/03/announcing-google-cloud-video-intelligence-api-and-more-cloud-machine-learning-updates
======
tyre
I think their model should take a second pass on the words and probabilities,
independent of the video.

Look at their example:

    
    
      Animal: 97.76%
      Tiger: 90.11%
      Terrestrial animal: 68.17%
    

So we are 90% sure it is a tiger but only 68% sure it is a land animal? I
don't think that makes sense.

It could be that this is a weakness of seeding AI data with human inputs. I
can believe that 90% of people who saw the video would agree that it is a
tiger, while fewer would agree it is a terrestrial animal, because they don't
know what terrestrial means.

~~~
cs2818
It seems like an intersting next step after producing these output labels
might be to use something like ConceptNet [0] to evaluate the relationship
between the labels and somehow incorporate this as feedback.

[0] [http://conceptnet.io](http://conceptnet.io)

~~~
ragebol
Indeed. I wonder if the networks could learn a more and more general
description of the concepts in the hierarchy when going up that hierarchy.
E.g. there's a bunch of tiger species [0] each with a specific underlying
model, but some traits are common to all tiger species. And some traits are
common to, say, carnivores. Could you share parts of the model/network via
that hierarchy, e.g. that a tiger inherits some parts of the model for
carnivores etc.

For example, can the 'stripe model' or 'fang model' be shared among tigers and
carnivores etc.

I tried something similar for my thesis, but this was before the advent of
DNNs etc.

[0]
[http://conceptnet.io/c/en/tiger?rel=/r/IsA&limit=1000](http://conceptnet.io/c/en/tiger?rel=/r/IsA&limit=1000)

------
sna1l
I wonder if Snapchat is/will become a large user of this service? Depending on
the average response time of this API, Snapchat could get much better ad
targeting analyzing their Stories content.

I imagine that they have something similar in house that they run since it is
pretty vital to their core business, but you never know.

~~~
ganfortran
This API is not cheap

~~~
Jach
How long until someone runs an attack like
[https://arxiv.org/abs/1609.02943](https://arxiv.org/abs/1609.02943) and
provides the model for free?

~~~
nl
This field moves quickly.

 _Embedding Watermarks into Deep Neural Networks_

[https://arxiv.org/abs/1701.04082](https://arxiv.org/abs/1701.04082)

------
wyc
I think the most commercially successful application of computer vision has
been quality-control devices (citation needed). Agriculture is very interested
in CV for a return-optimization technique known as precision farming.
Manufacturers pay for inspection of production throughout the pipeline. To
predict where a mass-market CV could be successful, I think we should look for
industries with similar problems but cannot currently afford a bespoke custom
modeling solution.

~~~
visarga
CV will be the technology that kills what jobs we still have left: in
agriculture (picking fruit, weeding), in logistics (picking items from
shelves), in custodial work (cleaning bots), security, driving, reading MRIs
and x-rays - practically all the jobs that depended on vision and could only
be done by people in the last 60 years are going to be automated. When CV is
fully deployed, the world will be totally different.

I'm pretty excited about precision agriculture, but for plant-life on earth it
will mean that now really there will grow nothing outside the system. Bots are
going to monitor all plant life. It might be plant-utopia for some species,
with timely water and nutrient dispensing, but for other species it might mean
being automatically killed by agrobots.

~~~
ComodoHacker
> When CV is fully deployed, the world will be totally different.

It's really scary to think about the long-term consequences of this. You know
what evolution does to features not required for survival anymore.

------
tambourine_man
It amazes me how smart these guys at google are, and yet, they can't design a
mobile site if their lives depended on it:

[http://imgur.com/bXGuNfL](http://imgur.com/bXGuNfL)

~~~
cobookman
If you could share with me what mobile phone / web browser you used that
produced the styling issue, I'll be sure to pass it on to the relavent people
within google so that it gets resolved.

Also if you send me an email at bookman@google.com I'll be sure to update you
as to when the styling errors are resolved.

(Disclaimer, I work for google cloud)

~~~
tambourine_man
iPhone SE, iOS 10.2.1

Also, see my previous comment on a similar issue with Google Cloud Calculator:

[https://news.ycombinator.com/item?id=13729466](https://news.ycombinator.com/item?id=13729466)

~~~
cobookman
A code change was just pushed out to fix this css styling bug, and is live in
production.

Sorry for the inconvenience, and please don't hesitate to forward any other
bugs you find to me at bookman@google.com

------
skewart
I'm curious about how much use these general-purpose computer vision APIs are
actually getting. How many companies out there really want to sift through a
lot of photos to find ones that contain "sailboat"? I'm inclined to think a
lot more companies would want to find "one of these five different specific
kinds of sailboats performing this action", which is definitely not among the
tens of thousands of predefined labels that Google, and Amazon, offer with
their general purpose models.

High-quality custom model training as a service seems much more compelling.

~~~
djloche
one immediate need is NSFW flagging, esp. things that might indicate abuse.

~~~
danso
This is a feature that Microsoft's computer vision API offers (in contrast to
AWS Rekognition and other services): [https://www.microsoft.com/cognitive-
services/en-us/computer-...](https://www.microsoft.com/cognitive-services/en-
us/computer-vision-api)

~~~
discordance
Content moderator is what you're looking for:
[https://www.microsoft.com/cognitive-services/en-
us/content-m...](https://www.microsoft.com/cognitive-services/en-us/content-
moderator)

------
timc3
I have been on the beta program for this and generally the results in our
testing have been very good. I particularly like how granular the data can
get.

~~~
komali2
What do you guys plan to do with it? Another poster mentioned how it seems
hard to imagine a business that has a model around "finding the sailboats in
this batch of pics."

~~~
ASpring
The business prop is easy and already hinted at by another user here.

People post millions of hours of themselves in natural settings on Snapchat.
If you can recognize their settings (objects, environments) and
cluster/categorize users then you can target advertising even more intrusively
than Google et al already do.

------
bitmapbrother
It was really entertaining listening to Fei-Fei Lee talk about AI and ML at
Google Cloud. If you get the chance check it out on YouTube. I especially
liked how she referred to video as once being the "dark matter" of vision AI.

~~~
davidcgl
Direct link to her talk:
[https://youtu.be/j_K1YoMHpbk?t=2h38m40s](https://youtu.be/j_K1YoMHpbk?t=2h38m40s)

------
imh
The demo picture they chose is interesting. It's obviously a tiger, and is
identified as such with only 90% probability. I appreciate the difficulty of
the problem and how big of a success it is to achieve even that level of
confidence, but that low level of confidence really shows how far we are from
being able to simply trust computer vision. Still useful from an information
retrieval perspective, I expect.

~~~
muzakthings
You realize that softmax scores aren't probabilities, right?

It's just a relative measure of confidence, scaled such that they all sum to
1.0.

~~~
chipperyman573
You can't add any of the numbers in the picture to equal 1.0 (or 100)

~~~
visarga
What you need to do is to take the top prediction and see how accurate it is
compared to a test set. The scores on the picture represent confidence not
accuracy.

------
aub3bhat
I think there is a need for a comprehensive system for image and video data
analytics. Much like how we today have relational databases (postgres, MYSQL)
and full text search engines (lucene/Solr). The approach Google or Amazon have
been taking which involves providing a "tagging" API is frankly unimaginative.

I am working on Deep Video Analytics an Open Source Visual Search and
Analytics platform for images and videos. The goal of Deep Video analytics is
to become a quickly customizable platform for developing visual & video
analytics applications, while benefiting from seamless integration with state
or the art models released by the vision research community. Its currently in
very active development but still well tested and usable without having to
write any code.

[https://github.com/AKSHAYUBHAT/DeepVideoAnalytics](https://github.com/AKSHAYUBHAT/DeepVideoAnalytics)

[https://deepvideoanalytics.com](https://deepvideoanalytics.com)

~~~
timc3
I would be interested to know more about this, particularly the database and
what you plan to do with it in the future (I am thinking the license on the
GitHub project is obviously restrictive for a purpose at the moment).

~~~
aub3bhat
Sorry about the license, I am trying to reach a beta version within a month
along with a system-description paper that outlines the long term vision
behind building such a system. At that point I plan on relaxing the license.
There are certain constraints such as making sure that all underlying models
are correctly licensed. Also FAISS which I use is licensed by Facebook under
an explicit non-commercial license.

------
ar15saveslives
Correct me if I'm wrong, but this is just a frame-by-frame labeling. You can
download whatever pre-trained CNN, pass individual frames through it and get
the same result.

~~~
skewart
True. But then you have to deploy and maintain that CNN yourself.

The value prop is similar to, say, Twilio. Though, arguably, it's easier to
run your own pre-trained CNN than it is to replicate the telephony, VoIP, and
video conferencing stuff Twilio provides.

Also, presumably Google is hoping that they can continue to train and improve
their CNN so that it's always just a little better than the best free-to-
download ones.

~~~
ar15saveslives
Yes, but I mean if you analyze video as individual frames, you can't declare
"we do video analysis", because video is completely different domain.

There're papers like [1] where CNN output is used as input for RNN, which
performs deeper context analysis. Results aren't exciting, though.

[1]
[http://cs231n.stanford.edu/reports2016/221_Report.pdf](http://cs231n.stanford.edu/reports2016/221_Report.pdf)

------
vaiski
There's alternative out there from a company called Valossa. More
comprehensive than what Google is now offering. Https://val.ai

~~~
mikeflynn
I've seen the Valossa offering and it is indeed impressive. Insane amounts of
visual data on videos.

------
frakkingcylons
As a Cloud Prediction API user, it makes me a bit uneasy to see it left out of
the image of their product suite. Is it effectively in maintenance mode now? I
feel like TensorFlow is overkill for what I need and my use case doesn't fit
into image/speech/video detection.

------
soared
Sounds similar to a company I worked with that took security camera footage
from restaurants and identified employee theft and process inefficiencies.

------
jimmcslim
I wonder if you could use this to upload recordings from your DVR and have it
determine the likely timecode of commercial breaks...

------
zitterbewegung
Not the first [https://clarifai.com](https://clarifai.com) has a similar
service .

------
CRUDmeariver
Is there any storage-related cost (i.e. retreival or egress cost) when you
call this on a file stored on Google Cloud Storage?

------
hartator
It's awesome, but I can't really see any application beside content filtering
and supericial content classification.

~~~
timc3
It can help more than superficially, it can become a starting point for humans
to go and refine classification into a controlled taxonomy.

The Google implementation can also detect when different people are speaking
which is useful - though it takes someone to tag who is who.

As I mentioned in another post, it can also highlight where things are
happening in a video - for instance a 2hr video where a scene suddenly
appears. Like in a nature video rushes where an animal appears for only a few
seconds.

Other types of video analysis can also detect problems in the video/audio,
such as dropped frames, noise or colour gamut issues.

------
joaoaccarvalho
When you use these Google APIs, can Google keep/ use your data in any way?

~~~
BrandonY
Here's a link to the Google Cloud Terms of Service:
[https://cloud.google.com/terms/](https://cloud.google.com/terms/) (you may be
interested in section 5.2: Use of Customer Data).

Disclaimer: I work for Google but am definitely not a lawyer and can't
authoritatively speak for Google here.

------
chimtim
what is the "video" bit here? This is just running image recognition on a
bunch of frames.

~~~
catshirt
how do you know the implementation details?

it would be completely naive to implement it that way, considering there is an
entirely new attribute video applies over images which of course is "time".

I don't know shit about ML- talking out of my ass here- but I'd be surprised
if the algorithms didn't account for changes over time or canonical entity
recognition (is this the same boat that was in the last image)?

~~~
chimtim
The linked press release shows an animal is detected -- tiger etc. It does not
say tiger running or hunting, which is where the time component would have
been used.

~~~
catshirt
the press release says:

> nouns such as “dog,” “flower” or “human” or verbs such as “ _run_ ,” “swim"
> or “fly”

that out of the way... i suspect you wouldn't need video to detect those
things...

and the screenshot you're referring to is an specific application of the
API... not a kitchen sink:

> It can even provide contextual understanding of when those entities appear;
> for example, searching for “Tiger” would find all precise shots containing
> tigers across a video collection in Google Cloud Storage.

------
kneel
Cronenberg inception porn is coming

