This is just a data pipeline . Multiple tools support what you describe . Or you could do this as a batch job / use spark to perform the task as a stream.
This is how a lot of things work already. Neural nets are bad at strict logic. If you give one net one of two tasks, depending on a state, it takes a lot of training before the learned tasks separate from each other. Ideally, the aspects unique to each task would be completely separate and anything common to both tasks would occur once in the net and get used for both. In practice, it increases training requirements by so much that you are better off creating 2 nets and logically separating them with your own code or even creating a 3rd net to detect which state you are in.
What you describe is how I imagine the first recognizably intelligent AI system being built -- a set of communicating independent modules all running at the same time.
Cool! For any javascript devs this looks like a capable library that attempts the same functionality and looks like it has pretty impressive results: https://github.com/jwagner/smartcrop.js/
Cool tech, but seems like over-engineering. As far as I can tell, Facebook just crops to the center and top of images which works pretty well. Meanwhile Twitter has spent years cropping images awkwardly based on trying to find the relevant regions, and now is doubling down on trying to find relevant regions...
One net to crop the image, one net to recognize it, another to synthesize the words to describe it and presto: a new application.