Hacker News new | past | comments | ask | show | jobs | submit login
Google Cloud Vision API enters Beta (googlecloudplatform.blogspot.com)
350 points by axelfontaine on Feb 18, 2016 | hide | past | web | favorite | 105 comments

Disclosure: I am an evangelist for the Watson Developer Cloud suite of services at IBM.

The new wave of vision services are amazing. There are a lot of players in this field, including IBM Watson, which has a suite of vision APIs available with similar features.

One key differentiator of the Watson offering is that we have a trainable API called Visual Recognition [2]. The pre-trained APIs are excellent and have broad uses, but it's amazing to see the results from even basic training to identify image tags directly relevant to your use case. There is a demo [3] that allows you to try it out by creating a new classifier right in the web page.

You can find some demos at:

http://vision.alchemy.ai/#demo - example images that demonstrate facial detection and identification, label extraction, object identification, and so on.

Another demo at http://visual-insights-demo.mybluemix.net/ uses the Visual Insights [1] API to identify a set of relevant tags.

[1]: https://www.ibm.com/smarterplanet/us/en/ibmwatson/developerc...

[2]: https://www.ibm.com/smarterplanet/us/en/ibmwatson/developerc...

[3]: https://visual-recognition-demo.mybluemix.net/

Do you have pricing information available anywhere? All I can see is [1] but that's not really interesting. Compare with [2] which makes it quite easy to guess how much I'd pay.

[1] http://www.alchemyapi.com/products/contact-sales

[2] https://cloud.google.com/vision/pricing

Hi Teh, I'm a IBM Watson Developer Evangelist. The pricing is available here when you click on Standard plan: https://console.ng.bluemix.net/catalog/alchemy_api/

You make it too inconvenient to get started as compared to the others, its about convenience and simplicity for on boarding people.

Hi Philip, can you drop me a mail at yrezgui [at] uk.ibm.com, I would be happy to have a chat with you :)

We looked at Watson a few month back and it didn't seem as good as some of the other image tagging APIs. Maybe we should look into this again.

Once you're on to training and not using off-the-shelf models, there are several good open source packages out there, from digits to deepdetect. It is not that difficult anymore to get at least a first model with decent accuracy.

Its a big challenge creating a good training set for start-ups or companies without expertise.

I've been using this since the private beta to enrich my eCommerce crawlers with product identifiers not found on the content of an eCommerce product page, but found in the product image itself. Imagine a part number or UPC displayed on a product box, but nowhere in the HTML content of the product page. Using the Google CV OCR feature I can extract meaningful product data from an image to compliment my existing crawl data. It works great.

Out of curiosity, why are there so many people crawling sites for prices like what you're talking about?

All of the freelance sites are... well, crawling with jobs involving web scrapers. What's the business here that I'm missing?

I'm not crawling for price. Knowing where our products are being sold online is important to understand the landscape for a large manufacture. Understanding opportunities to develop sales relationships we may or may not have directly, and knowing category expansion opportunity is also important.

Economically speaking, it's taking advantage of imperfect markets, where even in this age of internet search information still doesn't flow perfectly from seller to buyer. Throw in affiliate and referral incentives, and you've created an information market.

If you can manage to slide -- even just a tiny bit -- between consumers and Ecom sites you can make a lot of money from referrals!

This is what slickeals and camelcamelcamel do, though only one of those uses scrapers.

Affiliate marketing, price arbitrage, integrating with things that don't have APIs or have very poor APIS.

What's the hit rate for extracting meaningful information? Any false positives?

Since I compare the extracted OCR data against known part, catalog, upcs, skus, the variability is boolean. For instance if I OCR a picture of a tape roll and it has a part number printed on the seam I compare that to a trusted data source of part numbers.

I have been struggling with getting the new "service keys" working with this API. This is the first place that Google is mandating its new security infrastructure.

Do you have any sample code that works ?

I feel like this is some really compelling tech. It would be so amazing to build stuff with this in mind. I wouldn't be comfortable doing it, though. This sort of API is available only until Google decide that they don't want it to be available. There's not really anything close to equivalent that you could drop in to replace it if it were being shut down, the price were being hiked, or you had some sort of other issue with it.

I'm not trying to pick on Google for shutting things down; I would feel similarly if this API were from Microsoft or Facebook. It's not the first time there's been an API that I think is really cool, but was very apprehensive about actually using for anything serious.

You are creating a double bind for yourself and others by posting these types of comments. Your comment indicates you want the experience making things with it which would be amazing, but you are also speculating (based on historical evidence) that Google may "take away" the amazing thing from you later.

> I'm not trying to pick on Google for shutting things down

People usually say what they are thinking. In this case, I can certainly appreciate and respect questioning what Google's actions will be in the future, but the problem at hand is that none of us can tell the future. By attempting to do so, and creating a situation where we literally believe two conflicting things at once, we get mired down in illogical arguments that end up making zero sense. Worse, when we post those illogical arguments, others get dragged into the dissonance and end up making similar arguments that make no sense and the result of that is others get lost in the mess we made. For example:

> I think Google understands how big this could be (well, why else do they do anything I suppose?)

Do we really know Google understands how big this could be, or is that just us wishing they wouldn't "shut it down later"? Both thoughts are speculative, at best.

As for me, I have no way to know if I can trust Google will leave these APIs up for as long as I need them or will not change the methods in the APIs at some point breaking my code I've written to talk to it. The latter happens to me all the time!

I realize this comment may stir some emotional responses. All I would ask is that we consider alternative ways of thinking about these illogically binding "feelings" and expand our awareness to the fact that what we really want is our software, regardless of who wrote it, to be open, transparent, trustworthy and capable of running wherever and whenever we want it, regardless of when that is.

Obviously we have a long ways to go before that statement can be a reality. One can hope though!

TensorFlow + TensorFlow Serving + Google ReCeption model plus optionally a SVN on ReCeption features for your custom detection. All that code and the pretrained model is Open Source. There's some engineering to glue it together and some extra work for the easier, non-image classification parts.

There is also http://www.deepdetect.com

+this. I've implemented a subset of this kind of pipeline before on AWS (image tagging + face identification) using the building blocks that existed last year (it was AlexNet at the time, with a pre-release version of MXNet, because Google hadn't released the trained inception model). Implementing this basic functionality at a basic working level, given the tools Google has released, isn't impossible.

Now, making it production-quality, efficient, scalable, and the rest -- well, y'know. That's why people use cloud-based services in the first place.

But I think there's less fundamental lock-in than you think. Cloudinary, for example, will let you upload an image and get a tag out. ABBYY and OmniPage/Nuance and others offer cloud-based OCR.

I'm biased - I'm at Google this year - so take this with a grain of salt, but while I have the feeling that Google can do it better and more affordably than a small team could do it on their own, I don't think that Google pulling the API would leave people up a creek without a paddle.

> the pretrained model is Open Source.

Google's face/landmark/label/text/logo detection models are open source? Or there exist open source pretrained models?

The quality and size of the training set is (at least) as important as the machine learning tools. I imagine Google has access to a pretty big data set, along with the computing resources to process it.

Google's face/landmark/label/text/logo detection models are open source? Or there exist open source pretrained models?

Google's Inception v3 pre-trained image recognition model is open source: https://www.tensorflow.org/versions/r0.7/tutorials/image_rec...

That's the hard part because as you note this is computational intensive (the training data is actually open source as the ImageNet dataset)

There is existing code for the others part that perform pretty adequately (with the possible exception of landmark detection).


Face detection: http://docs.opencv.org/master/d7/d8b/tutorial_py_face_detect...

Logo Detection: http://www.pyimagesearch.com/2015/01/26/multi-scale-template...

can you provide some links ?

TensorFlow: https://www.tensorflow.org/

TensorFlow Serving: https://github.com/tensorflow/serving

ReCeption (actually they call in Inception v3. Not sure where I got the ReCeption name - though I'm sure I read it somewhere?): https://www.tensorflow.org/versions/r0.7/tutorials/image_rec...

Using a SVN on neural network extracted features: http://blog.christianperone.com/2015/08/convolutional-neural...

If you want a quick and dirty version here's some Python to create a web service that calls a Caffe based Image recognizer: https://gist.github.com/nlothian/c3519adb81b3452c1938


The question is, is this your core business? If you want to roll your own ML / CV API and your investors / customers will value you based on it, great. If it's not your core business, then SaaS / API interfaces save you time / money / ability to get into the market. Composability is what we do nowadays and you're always going to be exposed to risk. Recognize that risk and move on or go out of business trying to do it all yourself.

If I were you I'd build a MVP with this API (or IBM Watson's API) and see how it goes. If your product/service starts to take off, you could start looking into implementing your own machine learning / computer vision algorithm, hoping one day it gets good enough to replace whatever API you bootstrapped the product/service with.

IBM's Watson service, provided via Bluemix, does this:


If you want to be more comfortable with it, one really cool API you should check out is Clarifai (someone else already mentioned it in this thread, but I'll bring it up again anyway). They're really highly regarded in terms of classification, and the API is pretty simple to use, too. Like if you just make a quick cURL request:

  curl -H "Authorization: Bearer {access_token}" --data-urlencode "url=YOUR_IMAGE_URL.jpg"  "https://api.clarifai.com/v1/tag/"
...you get the top 20 tags for that image and their confidence levels, too. Their homepage at Clarifai.com has a pretty good demo that lets you see it in a more visual way.

There actually are other APIs, though with a smaller scope. For text extraction for example there is the OnDemand Api, https://dev.havenondemand.com/apis/ocrdocument#overview, backed by HP. They also have logo detection. I'd be surprised if no replacement for the category detection exists.

Though I admit I also hesitate to replace that API with the google offering for the one app where I actually use it. The results would probably be better, but I just got burned again from stuff shutting down, and I remember Google Reader.

But the Vision Api looks cool.

There's also this API by Microsoft Research. https://www.projectoxford.ai/

I will throw ours into the ring as well.

If you need to understand emotional reaction on video sources, our API can fill in the gaps not currently filled by Google's Cloud Vision API: https://www.kairos.com/emotion-analysis-api

Disclosure: I'm CTO of Kairos.com

I feel like there's a lot of scope for big companies to make commitments to do the right thing by developers building on top of their services. Stripe, for example, have a data portability clause where (subject to some conditions) they'll move customer data to other payment processors at your request. Funnily enough, that's the type of commitment that will make me never want to leave Stripe.

Clarifai are quite a big name in the classification field and that's pretty much all their API does.

I understand your apprehension because I feel the same way, to some degree. But I'm excited about the new types of tech that I can already imagine using this API. There are so many fascinating ideas that come to mind, and I'm usually not that creative of a person.

I think Google understands how big this could be (well, why else do they do anything I suppose?). I guess you'd just have to put faith in them. I imagine those that do will build some amazing projects from this.

You're being paranoid for several reasons. First, you will presumably be keeping the classifications you get from the API (if for no other reason than $2/1000 can really add up and why would you do that to yourself?). Second, Google has historically always had very generous end-of-life announcements. Third, there are already a number of competitors, including the trained CNNs Google has already released. Fourth, even if there were no competitors nor pre-trained models, deep learning is increasingly accessible and you could learn to imitate a past ImageNet-winning CNN in Torch/Theano/Tensflow/etc and train it within a few weeks. Fifth, paid Google services tend to hang around longer. Sixth, machine learning is a major part of Google and they are increasingly rolling it out to their services and may well be using this API themselves, making closing it not such a great idea.

So you're passing up what could be substantial benefits, Google isn't going to close it anytime soon, if they do you will have months of warning, and can easily replace it with a competitor or your own.

That makes sense. Cheers!

There's a nice opinion on this from CloudTP:


TL;DR: You don't get the girl if you don't ask for a dance :)

Well, we've actually been using AlchemyAPI for this exact function for some time now. They were recently acquired by the IBM Watson team, so it's kind of a pain to get set up through them now (you have to go in through the Bluemix control panel to create an account and set up billing). They're probably not as accurate as Google, which is why I'm going to be spending tomorrow implementing this and comparing the results, but they are pretty darn good.


What about Clarifai.com ?

Here is an analysis of Google projects that have been closed later: http://www.gwern.net/Google%20shutdowns

I have all kinds of product ideas from this API, but have the same fears as you.. can't rely on an API.

So for now, my best idea is to use it to build something fun with my kids.

If only Google would adopt some sort of policy where they would let you download the dataset+code when they shut something off...

Yeah, if they could make some sort of commitment to open it up in the event that they no longer want to run it, problem solved. (mostly, at least; pricing adjustments could still be disruptive for certain applications)

Want to share some of those ideas? :)

I like how the announcement post already prepares for the end of "our incredibly journey" by saying "Google Cloud Vision API is our first step on the journey".

Yes, a journey has an end, but hopefully it leaves us at a desirable destination.

I was looking into label detection APIs (and Google's offerings as well) for a silly game/website I was thinking of writing, but $5 per 1000 images is way too steep, especially if each user is submitting 1-5 images per interaction with the website. The $2 per 1000 images price they mention on the blog post is only if you're doing 5+ million images a month.

I played with IBM Watson visual recognition API and it didn't look like it did what I needed it to (recognize a hand drawn image of a cat for example -- it just kept labeling it only as a 'cartoon').

Bummer. At least the first 1000 images are free so I can prototype it out of curiosity.

I'm almost 100% sure that the Cloud Vision API will also return "cartoon" or "artwork" or something similar. This is because of how the models are trained.

What you need is a custom model that is trained specifically on hand drawn pictures. You can use something like TensorFlow to build this, but getting the raw data might be challenging.

Disclosure: I work for Google Cloud

Agreed, this is an exciting api but the pricing puts it out of reach for a lot of applications

I'd like to preface this with the fact that I have next to zero experience about the topic of pricing/charging for such service's, so what I say might sound naïve, but…

Is it really that expensive, though? I mean I can see it being expensive if you use it inside a product you offer for free, but if it's a commercial product, I imagine the pricing isn't that much especially if you consider that using your own infrastructure for this type of machine learning and image scanning would be much much more expensive.

Once the networks are trained it's probably very cheap to run the service.

But I guess that's my point, the pricing puts it out of reach of free or cheaper subscription products. Even the robot example they give in the video to achieve that functionality it would have to be taking a picture every couple of seconds or so and uploading them to the cloud - at that rate it would cost almost $5 for 15mins of use!

I wrote once document detection library. It was supposed to recognize corners of a printout in a photo. Is that close?

EDIT: you could send me example images and what you need from them. I could check how much I would need to extend it to handle your case.

Detecting boundaries and straight lines is something that has been "easily" done for a while now. Categorizing complex images (such as differentiating between a cat and a dog) is still extremely difficult for computers. I'm not really surprised that Watson couldn't do it, I've only recently started hearing about preliminary breakthroughs in the feasibility of this sort of tech.

I couldn't find a specific legal SLA for this new service. Does anyone know if:

1) by using the service you grant Google use of the uploaded images. (e.g. they can use your image to increase their corpus, improve the service or use it for advertising, or use it to extract street numbers for their maps, or its always private and never stored)

2) What the resulting copyright is of the returned data. If you were to build a database based on the results, what license or copyright status this would be. Would all rights belong to me, or would Google claim rights over the results.

If there's no service specific T&Cs for this, it falls under their general cloud T&Cs: https://cloud.google.com/terms/. s.5:

> 5.1 Intellectual Property Rights. Except as expressly set forth in this Agreement, this Agreement does not grant either party any rights, implied or otherwise, to the other’s content or any of the other’s intellectual property. As between the parties, Customer owns all Intellectual Property Rights in Customer Data and the Application or Project (if applicable), and Google owns all Intellectual Property Rights in the Services and Software.

> 5.2 Use of Customer Data. Google will not access or use Customer Data, except as necessary to provide the Services to Customer.

Thanks! I also note that the API pages each state "This is a Beta release of Google Cloud Vision. This API is not covered by any SLA"

SLA is not the same as T&C. No SLA means there is no uptime guarantee.

Re (2): The answer to that sort of question doesn't even seem clear for other, more widely used Google products. I'm thinking Google Translate or the fonts in their office suite. Both things that most people probably don't think too much about the copyright implications of, but possibly should.

The copyright status of the vast majority of Google fonts is clear: they're open-source. https://www.google.com/fonts

Regarding translations: https://meta.wikimedia.org/wiki/Wikilegal/Copyright_for_Goog...

If the OCR is good then they're totally burying the lede, it's pricing is 100x cheaper than commercial OCR APIs.

It's potentially a game changer, plenty of industries have piles of scanned documents. Cheap OCR means this data suddenly becomes accessible even if the value per individual document is low (i.e. for input into machine learning).

Paper from a few years ago comparing Google's OCR system to commercially available benchmarks:


A lot better for text in photographs. Comparison might be different on dense document text though.

I don't know for certain, but I suspect that Google utilized images from the web in training this system. Even if they didn't, suppose they had. I think this can raise an interesting question around copyright.

In training an AI system with hundreds/thousands of bits of data, no single piece of training data makes much of a difference. If one of my images on the web that I had captioned with the keyword 'dog' was used to train this system about what a dog looks like, is the model they end up with a derivative work of my captioned image? Yes, but my data would make up an infinitesimally small part of that model. Yet, in aggregate, the trained model might almost wholly rely on lots of copyrighted, rights-reserved images.

Would the resulting model be a copyright infringement? It would seem as though no rights owner would have a substantial enough claim. Yet, without all of the copyrighted works, perhaps the model would be ineffective.

I donno. If I'm composing music, and I heard your music before that, is my music implicitly derived? Your music certainly had some infinitesimal affect on me.

Absolutely, but any music you compose probably also has an element of originality in it, or at least draws inspiration from something that is non-musical in nature. For a trained AI, the entire model is just a derivative work of all the training data. It's not as though the AI had a lover who left them and drew inspiration from that experience to become a more effective AI.

> It's not as though the AI had a lover who left them and drew inspiration from that experience to become a more effective AI.

Well then your ex-lover clearly deserves co-songwriting credits. As well as their parents. And anybody who has influenced them personally, or even anyone who made the food that they've eaten. There's gotta be a point at which the original is just too far removed from the end result for it to be infringing, otherwise you could just keep going.

Also is a model that classifies things isn't the same as those things themselves, or images of those things, and would most I certainly hope it would be considered transformative enough to not be infringing (I am not a lawyer though). I could give it an image of a dog, and it will tell me what it thinks it is. But there doesn't seem to be any way for me to say "show me a dog" and get back any sort of image, infringing or not.

Some thought experiments: Let's say you have a copyrighted photo, and I design an API that allows anyone to upload a photo and get a true/false of whether or not it's the same file as your photo. Is this copyright infringement if I never release the original photo?

So to avoid the copyright situation, I just mix in some random nonce data, infinitesimally small, to make it equivalent?

Maybe that fixes it. Maybe you don't even have to if the copyright question only ever remains theoretical. I have a feeling it might only be a matter of time until this question arises in a case, though.

>but any music you compose probably also has an element of originality in it

Is deep dream original in your opinion?

So people trained google images that Comcast = nazi flag. I guess we need to send in an OCR of a nazi flag and see if it comes back Comcast?

This would be like saying Google's Search Index is a copyright infringement. Or indeed, anything that utilizes any public copyrighted information to infer any details.

To me this is taking copyright protection way way too far and the kind of system that could severely gum up progress and innovation.

By using this API, we're effectively training Google's system to be more and more accurate. Shouldn't Google pay us for using it? Just saying :-)

Not quite. The terms and conditions say that you own the IP and that it isn't used beyond providing the immediate service, meaning they aren't using it to train future models.

Besides, it's a call without any feedback, so it's not that valuable as far as training goes.

They're probably already used Google customers as training unknowingly and now it's time to productise it and charge for it.

Good point!

I would definitely pay for a WordPress plugin that uses this as I manually tag photos on my site with a lot of standard things this could probably just knock out in a flash.

Hi, noticed your comment and just wanted to let you know that, during the alpha testing phase, I actually made this. It's somewhat rough and unreleased now, but I am wondering what you would expect from such a plugin. Could you answer some of these questions?

- Would you primarily use the plugin to auto-tag uploads in the admin area? - Would you have problems with getting your own Google API key (and thus billing via Google), or would you expect the plugin to take care of this as well? - Any other expectations?

As said, what I have now is not really release-worthy, but I can get it to that point in a matter of days.


- Aron

Just tried a couple of images with digits to test out the OCR w/ the TEXT_DETECTION setting, unfortunately it assumes what it reads is a defined language with words. I am looking into using this for digit-recognition and only digits, but that doesn't seem to be a use case (as it is now). Does anybody know of another service/API that can do reliable digits-only OCR on (not the finest clear quality) images?

Weird, their example at https://github.com/GoogleCloudPlatform/cloud-vision/tree/mas... shows digits. Your best bet is probably ABBYY if you want something else.

I hope to play with this also tomorrow, as I want to see if it can extract numbers from pictures of a street. The docs say

"TEXT_DETECTION 1024 x 768 OCR requires more resolution to detect characters"

How big were the images you used?


While this technology is fascinating, I can't help but feel a little unsettled reading that.

Interesting that this is released in source closed, API-only form, rather than the open-code model taken by TensorFlow. I wonder how far you could approximate the model by training a learner on the API responses.

Is the best way of sending an image still base64 encoding it in JSON?

If 'best' means most compatible with varying types of systems large and small, then yes.

you can also provide the image by passing in Google Cloud Storage location

Is there any website where can I just upload a pic and see how it works without trying to figure out how to access their API?

This is great news. I have been working on a Swift framework for using this API in OSX and iOS (https://github.com/mgcm/CloudVisionKit) and I was wondering when it (the API) would become available for public use.

With the Beta release, it is available for public use.

When I hit "go to api console" i get the following: https://www.dropbox.com/s/xsysabgywa4t5mm/Screenshot%202016-...

At Cortex (http://www.meetcortex.com/) we are using this and technology like it to help brands be smarter about marketing content on social media. Really cool stuff.

Affectiva offers SDKs for facial expression and emotion analysis from images that work in realtime and offline without having to send images to the cloud.


disclamer: I work for them.

There is mention of GCV being able to calculate various image properties (dominant colour, being the example) yet there is no reference to what it actually returns in the API docs.

Can someone who has this active shed some light?

Google is scary good at releasing scary technology in a friendly box.

Anyone tried using the API to solve captchas?

"Google Cloud Captcha API"

Would it be possible to do product recognition such as brand and model from images without label?

Does something like this exist for sound? Any open source projects worth looking at?

If there is, I'm totally starting my audiolytics business!

Can we find the dimensions of things in a photo using this api?

Am I the only one that signed up but cant have access to it ?

High price. May be suitable for MVP.


Errata: I'll need a research team and a year and a half.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact