Can someone tell me if the following makes sense? I want to use Bayesian networks for sysops. What I have in mind is hooking logging, process information (e.g. resource usage), and other information as nodes to a Bayesian network and then training it in a production environment. From my admittedly small understanding, it seems like a properly configured network like that will be able to triage issues and even infer causes and take potential actions to correct them. Of course you would need to limit it's ability to act so it doesn't decide to hose your entire system. Since it has the power in some sense to observe, it can even train itself by observing the results of its actions.
Anyone know of anything like this being used or have ideas why this would be stupid to do outside of it being difficult to get right?
A Bayesian network will update probabilities based on evidence - You can use data to learn conditional probabilities and an algorithm will update the joint probability distribution. However to suggest actions or further diagnostic tests you will need decision and utility nodes. This is called an influence diagram or decision model. With these extensions you can determine the best action and value of information (test vs act).
The actions are based on utilities (cost/benefit) so if you have a potential very bad outcome of an action and an alternative (eg messaging to someone) then the system will act in a rational way consistent with the utility model.
I suggest you look at a package that can handle both Bayesian networks and influence diagrams, such as Hugin, Norsys or Genie. There are others but I haven't tried them.
Sounds like it could work. With any supervised learning algorithm, the key is to have good labeled data that accurately captures the function you want to learn (i.e. something that exposes errors + their causes and supplies the proper sysops fix).
Anyone know of anything like this being used or have ideas why this would be stupid to do outside of it being difficult to get right?