Duke Computer Science Colloquium
Heavy Hitters Via Cluster-Preserving Clustering
||Monday, October 31, 2016
||12:00pm - 1:00pm
||D106 LSRC, Duke
||Pizza will be served prior to Dr. Nelson's talk at 11:45.
In the "heavy hitters" or "frequent items" problem, one must process a stream of items and report those items that occur frequently. For example, a telecommunications company may wish to find popular destination IP addresses in a packet stream across one of their links, or a search engine may wish to report popular query words. A more general problem is when there are two streams and we must report those items whose frequencies significantly deviate between them; the former problem is a special case since we can artificially pretend the first of the two streams the empty stream. Such a problem naturally arises in trend detection and anomaly detection.
Several algorithms were known in the literature solving these problems, such as the CountMin sketch, CountSketch, Hierarchical CountSketch and others. The goal in designing such algorithms is (1) to guarantee finding frequent items for as lax a definition of "frequent" as possible while still limiting output size, and while having low (2) memory consumption, (3) processing time per stream item, (4) query time to report the list of frequent times, and (5) failure probability. Previous solutions could perform well on various subsets of these metrics, but not on all 5 simultaneously. We design a new algorithm, ExpanderSketch, which performs well on all 5. Our main innovation is a novel reduction to a new graph-clustering problem we formulate, in which finding most of every cluster guarantees finding all the frequent items. We then solve this problem by devising a novel spectral clustering algorithm potentially of independent interest, based on divide-and-conquer and local search.
Jelani Nelson is an Assistant Professor of Computer Science at the John A. Paulson School of Engineering and Applied Sciences at Harvard University. His research focuses on theoretical computer science, and more specifically on algorithms for large and high-dimensional data sets.
Hosted by: Pankaj Agarwal