Methods or framework for complex object segmentation

We are currently developing a software, where we want to classify food on a tray in canteens from one top image. So where we have issues is, that we need a method or existing framework (not OpenCV, we already using it :) ) to be able to segment out for example touching objects. For example you have a tray and you have different items on it, your soup next to it your main dish, something to drink and maybe 2 snickers, and they all touch each other.

What would be your ideas, methods to segment out each object, so that we are able to send it to our neural networks?