Overlay network monitoring enables distributed Internet applications to detect and recover from path outages and periods of degraded performance within seconds. For an overlay network with n end hosts, existing systems either require O(n2 ) measurements, and thus lack scalability, or can only estimate the latency but not congestion or failures. Unlike other network tomography systems, we characterize end-toend losses (this extends to any additive metrics, including latency) rather than individual link losses. We find a minimal basis set of k linearly independent paths that can fully describe all the O(n2 ) paths. We selectively monitor and measure the loss rates of these paths, then apply them to estimate the loss rates of all other paths. By extensively studying synthetic and real topologies, we find that for reasonably large n (e.g., 100), k is only in the range of O(n log n). This is explained by the moderately hierarchical nature of Internet routing. Our scheme only assumes the ...
Yan Chen, David Bindel, Randy H. Katz