"Troubleshooting" is the problem of integrated active diagnosis and repair, in which the repairer must repeatedly select between (imperfect) diagnostic tests and repairs to perform to restore functionality of a faulty system, while minimising the expected outage time and/or repair cost. The problem arises in many technical systems, including power networks, cloud computing, vehicle repair, etc.
Given prior probability distributions over faults and test false positives/negatives, the troubleshooting problem is a partially observable Markov decision problem (POMDP). Finding optimal solutions to POMDPs in general is intractable. However, in practical contexts, the way that faults affect the system and the available tests often exhibits a simple structure, such as a tree. We aim to exploit that structure to identify tractable optimal troubleshooting strategies. This may also have application to other problems that can be modelled as POMDPs with specific structure.
- Heckerman, Breese, Rommelse. Troubleshooting under Uncertainty. Workshop on the Principles of Diagnosis, 1994.
- Xu, Zhu, Sun, Tran, Weber, Fu, Bass. Error diagnosis of cloud application operation using bayesian networks and online optimisation. European Dependable Computing Conference, 2015.