- Some non-technical stakeholder comes to me and says "can we solve this problem with Machine Learning?" usually it's something like "there need to be two supervisors on the factory floor at all times, and I want an email alert everytime there are less than 2 supervisors for more than 20 minutes"
- I ask for some sample footage to build a prototype and get a few very poor quality videos, at a very different standard from what I see in most of these tutorials.
- I find some pre-trained model that is able to do people detection or face detection and return bounding rectangles and download it in whatever form
- After about 30 minutes of fiddling and googling errors, I run it against the sample footage
- I get about 60% accuracy - this is no good. Where do I go from here? Keep trying different models? There are all sorts of models like YOLO and SSD and RetinaNet and YOLO2 and YOLO3.
- At some point I try a bunch of models and all of them are at best 75% good. At this point I figure I should train it with my own dataset, and so I guess I need to arrange to have this stuff labelled. In my experience stakeholders are usually willing to appoint someone to do it but they want to know how much footage they need to label and whether their team will need special training to do the labelling and after it's all done is this even going to work?
What are some effective / opinionated workflows for this part of the overall process that have worked well for you? What's a labelling tool that non-technical users can use intuitively? How good are tools/services like Mechanical Turk and Ground Truth?
This part of the process costs time and money - stakeholders, particularly managers who are non-technical tend to want an answer beforehand - "If we spend all this time and money labelling footage, how well is this going to work? How much footage do we need to label?". How do you handle these kinds of conversations?
I find this space fairly well-populated with ML tutorials and resources but haven't been able to find content that is focused on this part of the process.
I believe your issue can be easily solved - have supervisors wear a distinctive color from a non-supervisor. For example let's say it's yellow.
OK so now you have yellow wearing supervisors and everyone else. To resolve the issue you have described acquire a month or so of footage, with labels per minute describing how many yellow wearing supervisors and how many people (in total) there are.
So the data you have is:
1. Yellow wearing supervisors
2. Total amount of workers on the floor
Then with this data you can train a network to do what you're describing pretty easily. Assuming there are a lot of workers on the floor, trying to do person detection or face detection would require too much data. Just have a uniform enforced and train on the colors/presence.
Imagine, you told a 10 YO child to do this task. Even the child would ask the same question - how do I know who is a supervisor and who is not.
Not only is face recognition hard, it is almost impossible to accomplish in a factory floor like setting. Not totally impossible but it is really really hard. Face detection is still possible but face recognition is far more computationally expensive. You'll need a shit ton of data and you'll need access to the employee database. You'll need a whole new engineering pipeline to make this happen and of course a team.
Compared to that expense and time, you are way better off getting the company to approve special vests for supes.
In your example, take a film of the factory floor when it is empty, then once work begins use a approximately human sized/shaped rectangular sliding window and look for areas that exceed a threshold of difference to the image of the empty floor.
You can then use that window as input to a classifier which will be easier due to the considerable dimension reduction or perhaps you can get sufficient performance using further deterministic techniques.
Remember, the problem is "I need to know when I don't have two managers on the floor," not "how do I use machine learning to know when I don't have two managers on the floor."
If we can make up arbitrary rules and assumptions then just have them jot down on a piece of paper when they come and go, and if they are the last to leave then they have to send an email.
The general point is to capitalize on preexisting information than to do the "true" solution which is error prone and even a human might not have 100% accuracy at, due to the fact that in certain settings (such as this hypothetical) the perfected solution cannot be accomplished without constraints.
Imagine where you worked suddenly introduced this: "Yes, previously everyone could wear whatever they wanted - but from today, just the senior programmers must code while wearing a high-vis jacket around the office so we can track when they at their desks".
The supervisors have now changed their relationship with coworkers - signaling their superiority, while simulataneously feeling stalked by their bosses, and looking "unfashionable"/un-cool - all because someone couldn't figure out how to do deep learning properly... which was the OP was actually asking about!
1. Supervisors are already by definition "superior" than their subordinates.
2. Supervisors on factories already wear distinctive clothing - especially in fully automated factories.
Finally, you have yet to propose a solution to the problem yourself that would be highly accurate and easy to train. You vastly underestimate the difficulty to create a bespoke solution from scratch and no data.
In any case since the supervisor thing was just an example - the original poster's only real choice is to manually label everything, but AI is really problem centric so it's hard to recommend anything without knowing the actual problem. Assuming it really is just [someone in an area for a period of time] kind of problem, and the difficulty is picking apart the 'someone' and you cannot influence their behavior, you just need massive amounts of data. Even then there's no guarantee you'll have high accuracy.
If high accuracy is required the problem itself needs to be examined on a higher level.
A nerd analogy would be making a programmer change OSs (or even text editors) against their will: They could do it, but they won't be very happy about it.
In my experience related to the type of arrangement you're describing - in reality (at least anecdotally speaking) the helmets are often not worn, or the colors are not enforced, or the colors don't get picked up due to poor quality video.
I deal mostly with third-world countries so safety standards are not always the best.
- Yolov3 is state of the art for speed. I think RetinaNet does better if you have the horse power.
- I can't recommend FastAI  enough for learning things to try.
- 60% on a frame by frame basis might be enough as long as you have a low false positive rate you can tell. Combine with OpenCV mean shift if you need real time.
- Start small. Show success with pre-trained models, then move on to transfer learning. Start with a small dataset. Agree on a metric beforehand.
- Use a notebook.  Play around, don't let it run for days then look at the result.
As a practical example, figuring out where a given pixel moves from one video frame to the next one, when working on real-world videos, the best known algorithms get about 50% of the pixels correct. With clever filtering, you can maybe bump that to 60 or 70%, but in any case you will be left with a 30%+ error rate.
NVIDIA / Google / Microsoft / Amazon will tell you that you need to buy or rent more GPUs or Cloud GPU servers and do more training with more data. And there's plenty of companies in cheap labor countries offering to do your data annotation at a very reasonable rate. But both of them are just trying to sell to you. They don't care if it will solve your problem, as long as you're feeling hopeful enough to buy their stuff.
Judging from the bad results that even Google / Facebook / NVIDIA show at benchmarks, having a near-unlimited budget is still not enough to make ML work nicely.
Oh and for these image classification networks like YOLO, they have their own flavor of problems: https://www.inverse.com/article/56914-a-google-algorithm-was...
what do you mean by this? optical flow isn't really a learning problem? it's a classical problem with very good classical algorithms
BTW, also the classical algorithms deal very badly with noise and repetitive textures, e.g. a video of a forest in the afternoon.
> Where do I go from here? Keep trying different models?
> ...after [the labeling is] all done is this even going to work?
> [How to label]
> If we spend all this time and money labelling footage, how well is this going to work? How much footage do we need to label?
Generally, you're discussing the space of model improvement and refinement. This is the costliest and most dangerous part of any ML pipeline. Without good evaluation, stakeholder support, and real reason to believe that the algorithm can be improved this is just a hole to throw money into.
The short answer to most questions is that you don't really know. Generally speaking, more data will improve ML algorithm performance, especially if that data is more specific to your problem. That said, more data may not actually substantially improve performance.
You will get much more leverage by using existing systems, accepting whatever error rate you receive, and building systems and processes around these tools to play to their strengths. People have suggested asking the floor managers to wear a certain color. You could also use the probabilistic bounds implied by the accuracies you're seeing to build a system which doesn't replace manual monitoring, but augments it.
Perhaps you can emit a warning when there's a likelihood exceeding some threshold that there aren't enough people on the floor. This makes it easier for the person monitoring manually, catches the worst case scenarios, and helps improve the accuracy of the entire monitoring system.
Not only can these systems be implemented more cheaply, they will provide early wins for your stakeholders and provide groundwork for a case to invest in the actual ML. They might also reduce the problem space that you're working in to a place where you can judge accuracy better and build theories about why the models might be underperforming. This will support experiments to try out new models, augment the system with other models, or even try to fine-tune or improve the models themselves for your particular situation.
In terms of software development lifecycles, it's relatively late in the game when you can afford the often nearly bottomless investment of "machine learning research". Early stages should just implement existing, simple models with minimal variation and work on refining the problem such that bigger tools can be supported down the line if the value is there.
It has been challenging communicating many of these realities to non-technical folks, who seem to be quite misguided about implementing these types of systems as opposed to "non-ML" systems where there is a less imperfect and more predictable idea of what's possible, how well it will work, and how much effort is required to pull it off.
I personally believe this is false, but also false in a way that we're remarkably far away from that. Even more than software, predictive automation is a process. It often relies on particular customization to your own situation to be successful. It can demand vast resources. It's wildly difficult to debug.
So we should be working to retrain those around us. ML is a process.
I was able to answer my own versions of many of those questions after the first few video lessons. It demonstrated to us that our data is a great fit for machine learning. I didn't feel comfortable turning my experiments into something production-worthy but I feel confident enough to at least have conversations about it and sketch out a possible plan for what a contractor could work on this year.
1. Deep learning (by itself) is often a shitty solution. It takes a lot of fiddling with not just the models, but also the training data — to get anything useful. Often the data generation team/effort becomes larger than the model-building effort.
2. It is hopeless to use neural networks as an end-to-end solution. This example will involve studying whether detections are correlated/independent in neighboring frames... whether information can be pooled across frames... whether you can use that to build a robust real-time of the scene of interest, etc. That will involve lots of judicious software system design using broader ideas from ML / statistical reasoning.
This is why I find it hopelessly misleading to tell people to just find tutorials with TensorFlow/Pytorch and get started. You really need to understand what’s going on to be able to build useful systems.
That’s apart from all the thorny ethical questions raised by monitoring humans.
Apologies - I figured the primary intent of my comment - i.e. the questions at the end, would be the focus of most responses
i haven't used it but microsoft has this
>"If we spend all this time and money labelling footage, how well is the going to work?"
"not well at all because we don't have facebook/google scale training data. let's try to figure out a conventional way to do it". for the supervisors problem i would recommend bluetooth beacons.
Start by labeling some data yourself. If you need to scale things up, you're going to need very clear rubrics for how things should be labeled and you're not going to be able to make them without having labeled some data yourself.
Definitely think about what the easiest form of your task is. Labeling bounding boxes is time intensive, labeling whether there are 2 or more supervisors on the floor should be a lot easier, and you can easily label a bunch of frames all at once.
You're going to need to figure out what tooling you will need for labeling, is this available out of the box, or will you need something custom?
Label X data points yourself and do some transfer learning. Label another X data points and see how much better things get.
The rough rule of thumb is performance increases logarithmically with data. After you have a few points on the curve about how much better things get from more data, fit a logarithmic curve and make a prediction of how much data you will need, though be prepared that you might be off by a factor of 10.
As others have mentioned, it's worth thinking about false positive/negative tradeoffs and how much you care about either.
If the numbers you're extrapolating to aren't satisfactory, then yeah, you need to keep messing around with your training until you bend the curve enough that it seems like you'll get there with labeled data.
As in, if you compute your per-frame score and compare it over bigger chunks of time, is it sufficiently different when 2 are on the floor and 2 are not?
In my case it probably used transfer learning on like a resnet-150 or inception or something. Regardless, it approaches the limits of what an expert in machine learning can accomplish, so you'll know very quickly whether you need higher quality video / yellow vests.
Generally speaking, as classification systems themselves are pretty dumb there isn't really a way to know what architecture will work best for your task, other than trial and error. Of course you can optimize parameters in a less chaotic way (grid-search or AutoML). In my experience it mostly boils down to data. Try augmentation methods, acquiring more data or transfer learning with varying degrees of layer relearning.
In general, annotating data for object detection or segmentation tends to be very hard to do effectively—expect low quality and inconsistent labels.
- Manually scan through a couple of hours of data and setup a human baseline.
- Run standard algorithms and find their accuracy.
- Find errors in the model and analyze why the errors are happening. Is the model classifying some other object as a supervisor? Is the model not classifying the supervisor in certain lighting conditions or scenarios.
- Retrain the model with the failure scenarios so that it learns.
In general, it's much better to not use machine learning at all if at all possible.
You can use Google for labelling (Mechancial Turk style), and AutoML Vision to train your model. It's going to be a bit pricey, but cheaper than your time to do the equivalent and will give you an educated guess at how much work it'll be to beat it. It costs about $100 to train a cloud vision model, I think (not including labelling)? You can also try the API for free to see how well Google does at finding people, they have better off the shelf models than you can get publicly.
You can try exploiting other things. Is your scene static? Try using frame differences as a feature. If it's a fixed environment then you should get a boost when fine tuning a model, versus some general person detector. COCO pretrained models should be quite good at finding people out of the box.
I wrote my own labelling tool specifically for Yolo which you may find useful (ie you label your data and export to a train-ready format): https://github.com/jveitchmichaelis/deeplabel
People who are not experienced are usually terrible at tagging images. They're not consistent, they miss objects and they don't understand why it's an issue. It will be faster to pay an "expert" service like mechanical turk, or do it yourself.
Basically a lot of your questions are open research problems. How much data do you need? Not a clue. It depends how your model is failing, which is always worth checking anyway. Figure out what the model is bad at and try and improve it, it should be doable to figure out where that 25% is going.
You should do better with a model like Faster-RCNN or its ilk. AutoML will do something like this, and you can try Facebook’s Detectron2 toolkit, or the Tensorflow Object Detection API.
Detecting unique people is a hard problem, by the way (eg two people versus the same person detected twice). You're better off just using an established method like RFID tags for presence/absence.
Another sibling made a great point. Don't detect people, train a model to output the number of people in the frame. This is how ML is applied to camera trap data with animals. In your case you can reduce this to a binary classification problem - >= 2 people, positive output.
Deep Learning for Programmers: An Interactive Tutorial with CUDA, OpenCL, DNNL, Java, and Clojure.
I've got a lower-end model, second-hand, still not cheap, but it's so cool.
You get real wire-o binding (not spiral binding!), so your book lays open flat on the table and the pages are right next to each other, not slightly displaced vertically like with spiral binding.
And I just did a quick price check, and https://xpress.lulu.com/ will do it for $10 as well, with shipping in 2 or 3 days (US).
This stuff is pretty fresh, so it's understandable, but the NLP chapter would be greatly enhanced by covering these newer topics