Sciweavers

INFOCOM
2011
IEEE

Optimal sampling algorithms for frequency estimation in distributed data

13 years 2 months ago
Optimal sampling algorithms for frequency estimation in distributed data
—Consider a distributed system with n nodes where each node holds a multiset of items. In this paper, we design sampling algorithms that allow us to estimate the global frequency of any item with a standard deviation of εN, where N denotes the total cardinality of all these multisets. Our algorithms have a communication cost of O(n + √ n/ε), which is never worse than the O(n + 1/ε2 ) cost of uniform sampling, and could be much better when n ≪ 1/ε2 . In addition, we prove that one version of our algorithm is instance-optimal in a fairly general sampling framework. We also design algorithms that achieve optimality on the bit level, by combining Bloom filters of various granularities. Finally, we present some simulation results comparing our algorithms with previous techniques. Other than the performance improvement, our algorithms are also much simpler and easily implementable in a largescale distributed system.
Zengfeng Huang, Ke Yi, Yunhao Liu, Guihai Chen
Added 30 Aug 2011
Updated 30 Aug 2011
Type Journal
Year 2011
Where INFOCOM
Authors Zengfeng Huang, Ke Yi, Yunhao Liu, Guihai Chen
Comments (0)