Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Microsoft Computer Vision API or Google Cloud Vision API?
76 points by jharohit on June 19, 2016 | hide | past | favorite | 50 comments
Hi HN community!

I am trying to decide on Microsoft's CV offering vs . Google's CV offering for my B2B startup. Any recommendations from people who have tried both??

Background - We are trying to use images of models uploaded by agencies and deriving labels & image properties. Face detection is something that is an added bonus if possible.


Pricing is more friendly than the other two. The API is nice.

I did like Google's a lot, but the price just wasn't there for me. Especially if you want most processing options.

Microsoft have been in this game a lot longer, but surpisingly a lot of their cool stuff isn't in their APIs. i.e. ability to spot similar images and seamlessly stitch them. This stuff was in their maps products a long time ago, and you can download tools of theirs: http://research.microsoft.com/en-us/um/redmond/projects/ice/ but no APIs. Their basic APIs are just basic... so why not save the dime and go with a smaller player offering just the basics but very well instead.

Thanks for the suggestion. Tried Clarifai's demo - their API & Pricing is nice as you said but they just don't have much to offer in terms of labels and feature detection when you compare to what Microsoft's or Google's version has to offer. Plus, they also don't seem to offer confidence numbers for labels

Disclosure: I work at Clarifai.

Thanks for checking us out! Our demo may not fully represent what we have to offer, but you can see a sample response here with confidence numbers included (we call them probs): https://developer.clarifai.com/guide/tag#tag

In terms of labels, our "general" model has over 11,000 labels, and we also have specialized models with labels tailored for other purposes, including NSFW, travel, and food, among others.

Hope this helps

Depends on the complexity of what you require. I know I might get down-voted for this, but if your task is relatively simple, then roll your own using deep learning. Message me if you want help with this.

I wouldn't rely on either for my own startup, because I dont think these API's will have broad appeal, as a result wont get traction, and will be shutdown with little warning.

I could help you if you want - my email is in my profile

I've had a surprising amount of success rolling my own deep learning vision system too since the community is so open, but I wanted to learn about the field and I've sunk a few weeks of spare time into it at this point so I'm not sure I'd really recommend it to a startup unless the API offerings just don't work for them for performance or cost reasons.

OpenCV includes face detection, and given a reasonably limited corpus of faces, it performs quite well and quite reliably.

(Whether you want to use that or use a service depends on how close to your core business this is.)

I would add to this comment. I recently thought face detection was a difficult service, but its actually really easy to implement yourself.

I found these really useful: http://www.pyimagesearch.com/

A "limited corpus of faces"? That would be face recognition, not face detection. I see OpenCV has both.

Interesting looking at Google's Vision API overview, where they explicitly state that facial recognition is not available.

The technology to do this clearly exists, but I gather they are concerned about the potential for abuse. Which makes sense. You could build some very creepy apps with this.

I won't directly speculate but I doubt the Face Detection functionality is absent due to some ethical quandary.

Why do you say that? http://megaface.cs.washington.edu/results/ shows that Google has one of the best Face Detection models.

Sounds like a competitive advantage to me.

A request to those suggesting "why not X" or "consider X" : If you could mention a reason or two favoring X over Y, that'll help OP & future visitors.


Pay-as-you-go, many APIs, supportive community.

For your use case you might want to check the Computer Vision tag, specifically the "Illustration Tagger" algorithm.


I'm interested in good OCR, preferably local, but I'm close to giving up and using Google Cloud Vision API --- It works well for text that's not prefectly aligned and laid out - unlike e.g. Tesseract or any other local OCR I've used.

As far as I can tell, clarifai.com doesn't have OCR, and neither does anyone else except MS and G.

Hey, I made this [1]. Based on a neural net trained with generated character sets with intentional distortions and/or several font variations.

I would not recommend it for use in production, but maybe you're interested in looking at the code and customizing it to your liking. Could perhaps be combined with OpenCV.

[1] https://github.com/mateogianolio/ocr


Have you had a look at the ABBYY offerings? They work very well with auto-skew detection + correction. But of course, it's a very pricey solution and wouldn't recommend it if you don't have the budget for that sort of thing.

Develop your code so that the API is pluggable. Try both and decide which works best for you.

It really depends what you are attempting to accomplish, and what you wish to detect in the images.

As you mentioned faces:

Are you looking for face detection or recognition? Face detection has been robustly solved before the advent of DL with HAARs/ face models. Now being pushed a bit further with DL.


Current cutting edge face recognition systems rely on DL, and the top performing models are one out of Russia (NTechLAB, facenx_large) and one from Google (FaceNet v8). These were the top two performers in the MegaFace challenge - identification with 1M distractors. Truly remarkable results. http://megaface.cs.washington.edu/results/

As with most DL systems you will need a massive corpus of labeled faces (aka, google or vkontakte - which the NTechLab group used)

Note from personal experience: Haar cascades only work really well on frontal faces in high quality and good light. It sounds like OP has these kinds of photos so it will work well, but if you want to detect faces in any other kind of image/video, you'll need something more powerful. I still haven't found anything that works well.

You may want to check out the face detection approach described in this paper: Joint Cascade Face Detection and Alignment. ECCV 2014(http://www.jiansun.org/papers/ECCV14_JointCascade.pdf)

I'd also encourage you to try out the Face API from Microsoft (full disclosure, I work on it). One of the focuses is on improving detection when challenging lighting and occlusions are present: https://www.microsoft.com/cognitive-services/en-us/face-api

Very interesting paper. Thank you for sharing it.

I had a question after reading it. The parameter rho, that is used to either classify/regress depending on the t cascade stage{1..T}, is said to be set empirically. This, I would assume, could change across data sets, how were you able to decide when classification switched to regression in your tree and how was this adapted when testing on other data?

Also forgot to attach the paper I was referring to: http://arxiv.org/pdf/1502.02766v3.pdf

I have tried both some time ago for an OCR task. In my brief experience, GCV performs better than Microsoft. Also last time I tried, I sometimes randomly get server error from Microsoft, so I guess Google infrastructure is more ready. The downside is GCV is a bit pricier. Also both do not provide parameter to set language models, so that's a minus in my eyes.

I don't know about face recognition but I've quantatively analysed their speech recognition and it came dead last after the 5 or so others that I tested.

Any chance you could share this?

I know others have asked, but I'd also be interested in that ASR analysis. I'm working on something for my masters involving crowd sourced transcribers and we have access to Watson's ASR, which the (non CS) administration likes, but I've suspected its crap and would be cool to have some data.

Yeah, this would be good to share with the team if you have it documented.

Facial recognition is possible using the IBM Watson Visual Recognition service.

Is there a publicly accessible API that can geocode photos, to a degree of accuracy? I'd like to be able to decorate digital photos taken before geocoding was a thing with geo data. I figure photos I have taken off St. Marks Square in Venice have probably been taken a million times by other people, some of whom have probably added GPS coordinates to theirs, so a smart CV offering should be able to figure it out to a sufficient degree of accuracy (for reasonably well photographed and unchanging locations of the earth).

EDIT: I see Google Cloud Vision has landmark detection, that might be useful if the API returns the GPS coordinates of the landmark.

Google Photos actually makes a reasonable estimation of your photo's location based on the content of the image (and the context of the photo, if it was available.)

For example, if I shoot with my non-GPS-enabled DSLR, those images are uploaded to Google Photos, which will reconcile my location history to apply a location to those shots. It'll also do that if it sees DSLR shots in between geotagged cameraphone shots.

But more to your use case, GPhotos will actually recognize landmarks and other information to tag photos, I believe, with a rough location (such that it'll match a location like "Paris" or "Eiffel Tower," but perhaps not lat/long... yet.)

Even more impressive, they're very nearly able to do exactly what you're describing, though my understanding is that it isn't in use in GPhotos yet: https://www.technologyreview.com/s/600889/google-unveils-neu...

We build customised image labelling solutions where you can label many more things like type of neck in a cloth, pattern of label on a mug and many such things which is not supported by Google or Microsoft.

We also offer finding similar images as well as image search capabilities apart from finding tags from images. Please connect at https://twitter.com/adityapatadia to discuss further.


Alternative solution for image moderation and nudity detection. Simple API and simple pricing.

When I checked last, Google API does not allow to identify specific faces. It can detect faces but that's it. Clarify or Microsoft do. Pricing wise almost all are the same. In my view, Watson is a complete no no..

Just curious, what makes you say that about Watson?

I think Microsoft's works better, try also IBM's if you can.

So I tried Microsoft's CV & IBM's AlchemyVision on the same image (since they both have an online demo sandbox). The Microsoft just gave me back more labels and stronger sentiment figures for same labels. Hence narrowed it down to these 2.

If you tried the Watson vision offering when it was "AlchemyVision" then you may have tried a now out-of-date version of the service. The AlchemyVision and Visual Recognition tiles on Bluemix have recently been combined in a way that utilizes their complementary strengths. Consider retrying the updated service if you'd like!

Disclosure: I work at IBM Watson.

ok - my bad. I just saw the AlchemyVision has been merged into Visual Recognition starting May 20th. We will definitely check it out to see what extra features have been added.

qq - Is the API stabalized, by which I mean will there be further changes/merges?

The API has stabilized. There will be future changes, and I would expect to see those as the service gets new features added (like retraining).

May 20 the AlchemyVision and Watson Visual Recognition Beta services merged into Watson Visual Recognition - http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercl...

We're continually improving them (as everyone in computer vision is!) but for now one key feature of Watson Vision is the ability to train a custom classifier for your own data by just giving a bunch of example images. The service also provides general image tagging, text extraction(beta), face detection and a fixed list people for celebrity facial recognition.

We see this as a green-field area where rapid progress is being made across the board... I wouldn't count anyone out yet!

Do you know of a human pose estimation from 2d images service/library? I've seen papers about it. I would like to try it out.

Consider Imagga.


They explicitlessly didn't allow it for Google Glass apps so I wouldn't be surprised.

Can you explain this a bit better? I'm curious.

I think it depends on your dataset and application. It should be easy enough to try both.

What works well for face detection in low light images?

Microsoft's CV API worked better for me.

The last time I checked WolframAlpha worked surprisingly well

Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact