Efficient Similarity Query Processing


More information here

Similarity matching is one of the oldest computer science fundamental problems. It tries to recognise similar individuals based on the information known about them. This problem plays a significant role in many data management applications, including schema matching, entity matching, data cleaning, record linkage, plagiarism detection, etc. As a result, over the past few decades, it has received significant attention. Many information systems nowadays store and process a massive amount of data, such as customer profiles, emails and electronic documents, and text data on the Internet (e.g., Web pages, and tweets). These data are usually modelled as strings, sets or binary vectors in Computer Science. Interestingly, string, set and binary are three most common data presentation in modelling real world entities. For example, the human genome and proteins are modelled and analysed as sequences (i.e., strings) of amino acids for sequence alignment and clustering, a web document is modelled as a set of tokens for efficient duplicate detection, and digital photos are modelled as a sequence of binary feature values in Google's Image Search.

In this talk, I focus on similarity query processing in various data types. 1) Similarity Query on String with Edit Similarity. 2) Similarity Query on Set with Overlap similarity. 3) Similarity Query on Binary with Hamming Distance. To accommodate each different data set. Various of different index structure is discussed. In the end, this talk will discuss the prefix filtering and its pro and cons when dealing with similarity queries.