Hacker News new | past | comments | ask | show | jobs | submit login
Losing Confidence in Quality: Unspoken Evolution of Computer Vision Services (arxiv.org)
65 points by mr_tyzic 25 days ago | hide | past | web | favorite | 13 comments

I have worked in industrial machine vision for the last 18 years, and this summary (I didn't read the paper) reflects my comments to management every time they bring up the topic of AI or ML. Our inspection systems operate at anywhere from 300 parts per minute to 3000 parts per minute. AI/ML has way too high of a false reject rate (or even worse, a false accept rate!) The worst scenario I explain to my managers is if we implement a AI/ML system and for some reason it starts rejecting 50% of the customers product at 3am on a Sunday, then there is no practical way to analyze the results to determine a difinitive cause for the reject, and what the corrective action needs to be. The final gut punch is then we would have to tell the customer that it could be several hours to retrain the model(and that's only after we figure out what needs to be represented in the good and bad image sets to account for the new failure mode).

I recently spent months implementing ML CV system for manufacturing and I tried several of the services mentioned.

This paper is comparing AWS Rekognition, Google Cloud Vision, and Azure Computer Vision. I tried Google and Azure. I also tried clarifai.

To get 100% accuracy I ended up building my own service, and doing some really unconventional things. I used tensorflow, but I may rewrite the whole thing in pytorch.

I tried using neural architecture search but it was a dead end.

The key for me was training data distribution search.


Why not do a slow rollout, starting at maybe 5% of product. Then you can catch and fix deficiencies without compromising your entire production line, and ratchet up to 10, 15, ... 100% as you gain confidence in the model?

Because no customer will pay for that scenario. That could potentially be a viable plan with an in-house implementation but not for a turn-key product with service agreements and performance guarantees.

So what are your inspection systems based on?

Vision systems in controlled environments like factories that need to be fast and stupidly reliable are often based on "classical" (non-ML) computer vision techniques.

The ML hype train has led to a frustrating amount of throwing the baby out with the bathwater, where people who should know better decide to use ML for more things than is reasonable (something something "end-to-end").

Ideally, ML should be used for a few very specific tasks, and then classical machine vision / geometric analysis / plain old logic should be doing the rest. If you don't do it this way, you eventually end up with the problems described in this paper, where performance is inconsistent and impossible to debug, and nobody can tell you what's going on or why.

This is years ago and not my specialty, but I worked with a product line that manufactured what were essentially highly precise plastic washers. As they came off the line there was a clever bit of mechanical disbursement that slowly dropped them onto a conveyor under which a flashing light sync'd with a digital camera system took pictures.

Because all of the factors are known you could almost do pixel counts and compare to "true" circles (no need for AI models), any washers that didn't meet the critera were pneumatically puffed off of the line from an air hose.

Thanks for sharing this!

Any tech that you think shows the most promise?

Yes. AI/ML.

Like mentioned in another comment, targeted application of AI/ML can provide excellent results when the expectation is carefully defined and the objective is clearly stated.

Many AI/ML proponents jump directly to a 'black box' attitude where a product enters one side and leaves the other side and a smiley or frown face tells you if the product is good or bad. There are no definitions of dimensional tolerances nor geometric features. It's just good or bad based on the training model. But unfortunately, GIGO applies very strongly to the training model.

This paper shows measured results of using "popular image recognition services" .. that include Azure, AWS, Google, IBM and other current commercial offerings.. (the implication right away is that the tested services are using some DeepLearning system on the server side). The paper specifically says "from the point of view of a software developer".. and spends quite a bit of effort to question the assumptions of a user of these services, and identify potential pitfalls from a mis-match of user assumptions, including consistency over time, consistency between services themselves, and employing a machine that produces deterministic outcomes versus probabilistic ones. The paper looks at the behavior of a Vision-as-Service use from a Software Quality Assurance (SQA) point of view - is the result -of commercial services on the web- reliable over time. Liability within safety-critical environments is questioned.

The comments here (so far) address "does DeepLearning image analysis work" .. which is a broader question than what is being addressed in the paper.. Importantly, other kinds of image analysis methods, including other ML approaches, are not being compared..

The authors seem to be raising a bit of an alarm about services like these, reflected in the paper title (weakly):

[RH1] Computer vision services do not respond with consistent outputs between services, given the same input image.

[RH2] The responses from computer vision services are non-deterministic and evolving, and the same service can change its top-most response over time given the same input image.

[RH3] Computer vision services do not effectively communicate this evolution and instability, introducing risk into engineering these systems

To a non-specialist, this seems like detailed description of a useful real-world investigation, like a lab. The authors' skepticism is healthy, and the paper overall looks good. On the negative side, the discussion of labels in Computer Vision seems to be insufficiently distinguishing between fundamental problems in taxonomy and classification, problems with data grouping in general, and then specifically problems associated with this kind of DeepLearning image identification.

One thing to keep in mind is that any neural net based ML system is essentially just a mathematical function imitator. The observations in this paper are spot on in the sense that many mathematical functions can have the same subset of results (success within your training data set), but can have wildly varying behavior in the general case.

This is known as overfitting, and its one of the main things that should cast doubt on any ML system's capability to reliably produce results outside of a training set.

In a way, these outcomes are to a point predictable (in the sense of "the possibility exists" as opposed to "this set of weights yields these generalizability results") if you take the whole "neural network" thing a bit more straight than many academics are comfortable with you doing. The human brain, or any collection of biological neurons, is in a state of constant flux, creating different networks in order to react to stimuli in the environment and implement actions that bring us closer to achieving $goals.

Who hasn't experienced an off day where the gears of your mind just aren't producing what you darn well know they should be? It's just a fundamental change in the primary set of neural tools you've got to work with that day. When the weights change, so too does the output, and the function modeled. The cerebellum if I recall correctly, actually acts as a QA like functionality built into our own minds.

There's nothing magic about simulating neural networks in silicon that suddenly gets you to a more "free-of-mistakes" state besides being able to condense way more dimensions of data into the networks "sensory space" as it were. Even then though, the possibility of suboptimal functions being imitated is inescapable.

Try explaining that to someone that wants to save millions on workforce, or automate safety-critical tasks without concern for the consequences though. It's amazing the cognitive barriers we can build.

I feel like you can infer these services are stochastic and not deterministic based on the documentation.


"The quality of your classifier depends on the amount, quality, and variety of the labeled data you provide it and how balanced the overall dataset is. "

Good analysis for a real-world problem that software developers face. Customer expectations definitely do not align with the realistic capabilities of the technology.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact