As the size of a centrally managed IP network increases, the cost of monitoring network devices and the number of reported events increase super-linearly. This in turn degrades the performance of the event correlation engine that is responsible for suppressing dependent events and escalating root cause events to a network administrator. To solve this scalability problem, the paper proposes a distributed framework that partitions the network into smaller management domains and enables concurrent monitoring and event correlation in those domains. The gain in performance, however, comes with the challenge of correlating cross-domain events which occurs when failure in one domain induces events in other domain(s). This paper investigates such situations and show in the worst case it would be impossible to determine the root cause.
IEEE SRDS 2009 - 28th International Symposium on Reliable Distributed Systems
September 27-30, 2009. Niagara Falls, New York
"A Framework for Distributed Monitoring and Root Cause Analysis for Large IP Networks", Dipyaman Banerjee, Venkateswara Reddy M, Mudhakar Srivatsa