Probabilistic inference was shown effective in non-deterministic diagnosis of end-to-end service failures. Since exact probabilistic diagnosis is known to be an NP-hard problem, approximate techniques were investigated. They were shown efficient and accurate in isolating root causes of end-to-end disorder in networks composed of tens of nodes but did not scale well to bigger networks. In addition, the requirement that a centralized manager posess a global knowledge of the system structure and state made the techniques difficult to apply in real-life. This paper investigates an approach to improving the scalability and feasibility of probabilistic diagnosis by exploiting the domain semantics of computer networks. The proposed technique divides the computational effort and system knowledge among multiple, hierarchically organized managers. Each manager performs fault localization in the domain it manages and requires only the knowledge of its own domain. We show through simulation that ...
Malgorzata Steinder, Adarshpal S. Sethi