TY - GEN
T1 - Efficient processing of hamming-distance-based similarity-search queries over MapReduce
AU - Tang, Mingjie
AU - Yu, Yongyang
AU - Aref, Walid G.
AU - Malluhi, Qutaibah M.
AU - Ouzzani, Mourad
N1 - Publisher Copyright:
© 2015, Copyright is with the authors.
PY - 2015
Y1 - 2015
N2 - Similarity search is crucial to many applications. Of particular interest are two flavors of the Hamming distance range query, namely, the Hamming select and the Hamming join (Hamming-select and Hamming-join, respectively). Hamming distance is widely used in approximate near neighbor search for high dimensional data, such as images and document collections. For example, using predefined similarity hash functions, high-dimensional data is mapped into one-dimensional binary codes that are, then linearly scanned to perform Hamming-distance comparisons. These distance comparisons on the binary codes are usually costly and, often involves excessive redundancies. This paper introduces a new index, termed the HA-Index, that speeds up distance comparisons and eliminates redundancies when performing the two flavors of Hamming distance range queries. An efficient search algorithm based on the HA-index is presented. A distributed version of the HA-index is introduced and algorithms for realizing Hamming distance-select and Hamming distance-join operations on a MapReduce platform are prototyped. Extensive experiments using real datasets demonstrates that the HA-index and the corresponding search algorithms achieve up to two orders of magnitude speedup over existing state-of-the-art approaches, while saving more than ten times in memory space.
AB - Similarity search is crucial to many applications. Of particular interest are two flavors of the Hamming distance range query, namely, the Hamming select and the Hamming join (Hamming-select and Hamming-join, respectively). Hamming distance is widely used in approximate near neighbor search for high dimensional data, such as images and document collections. For example, using predefined similarity hash functions, high-dimensional data is mapped into one-dimensional binary codes that are, then linearly scanned to perform Hamming-distance comparisons. These distance comparisons on the binary codes are usually costly and, often involves excessive redundancies. This paper introduces a new index, termed the HA-Index, that speeds up distance comparisons and eliminates redundancies when performing the two flavors of Hamming distance range queries. An efficient search algorithm based on the HA-index is presented. A distributed version of the HA-index is introduced and algorithms for realizing Hamming distance-select and Hamming distance-join operations on a MapReduce platform are prototyped. Extensive experiments using real datasets demonstrates that the HA-index and the corresponding search algorithms achieve up to two orders of magnitude speedup over existing state-of-the-art approaches, while saving more than ten times in memory space.
UR - https://www.scopus.com/pages/publications/84976287044
U2 - 10.5441/002/edbt.2015.32
DO - 10.5441/002/edbt.2015.32
M3 - Conference contribution
AN - SCOPUS:84976287044
T3 - EDBT 2015 - 18th International Conference on Extending Database Technology, Proceedings
SP - 361
EP - 372
BT - EDBT 2015 - 18th International Conference on Extending Database Technology, Proceedings
A2 - Popa, Lucian
A2 - Alonso, Gustavo
A2 - Van den Bussche, Jan
A2 - Barcelo, Pablo
A2 - Teubner, Jens
A2 - Paredaens, Jan
A2 - Ugarte, Martin
A2 - Geerts, Floris
PB - OpenProceedings.org, University of Konstanz, University Library
T2 - 18th International Conference on Extending Database Technology, EDBT 2015
Y2 - 23 March 2015 through 27 March 2015
ER -