On Gapped Set Intersection Size Estimation

Chen Chen, Jianbin Qin, Wei Wang

Published in CIKM, 2015

There exists considerable literature on estimating the cardinality of set intersection result. In this paper, we consider a generalized problem for integer sets where, given a gap parameter _, two elements are deemed as matches if their numeric difference equals _ or is within _. We call this problem the gapped set intersection size estimation (GSISE), and it can be used to model applications in database systems, data mining, and information retrieval. We first distinguish two subtypes of the estimation problem: the point gap estimation and range gap estimation. We propose optimized sketches to tackle the two problems efficiently and effectively with theoretical guarantees. We demonstrate the usage of our proposed techniques in mining top-K related keywords efficiently, by integrating with an inverted index. Finally, substantial experiments based on a large subset of the ClueWed09 dataset demonstrate the efficiency and effectiveness of the proposed methods.

PDF BibTex