Generalizing the Pigeonhole Principle for Similarity Search in Hamming Space

Jianbin Qin, Chuan Xiao, Yaoshu Wang, Wei Wang, Xuemin Lin, Yoshiharu Ishikawa, and Guoren Wang.

Published in TKDE, 2019

A distance search in Hamming space finds binary vectors whose Hamming distances are no more than a threshold from a query vector. It is a fundamental problem in many applications, such as image retrieval, near-duplicate Web page detection, and scientific databases. State-of-the-art approaches to Hamming distance search are mainly based on the pigeonhole principle to generate a set of candidates and then verify them. We observe that the constraint by the pigeonhole principle is not always tight and may bring about unnecessary candidates. We also observe that the distribution in real data is often skew, but most existing solutions adopt a simple equi-width partitioning and allocate the same threshold to all the parts, hence failing to exploit the data skewness to optimize query processing. In this paper, we propose a new form of the pigeonhole principle which allows variable partitioning and threshold allocation. Based on the new principle, we develop a tight constraint of candidates and devise cost-aware methods for partitioning and threshold allocation to optimize query processing. In addition, we extend our methods to answer Hamming distance join queries. We also discuss the application of the pigeonhole principle in set similarity search, a problem that can be converted to Hamming distance search equivalently. Our evaluation on datasets with various data distributions shows the robustness of our solution and its superior query processing performance to the state-of-the-art methods.

PDF BibTex