Fast locality-sensitive hashing

Anirban Dasgupta; Ravi Kumar; Tamas Sarlos

doi:10.1145/2020408.2020578

Outline

Fast locality-sensitive hashing

Anirban Dasgupta

2011, Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining

https://doi.org/10.1145/2020408.2020578

visibility

…

description

9 pages

link

1 file

Abstract

Locality-sensitive hashing (LSH) is a basic primitive in several large-scale data processing applications, including nearest-neighbor search, de-duplication, clustering, etc. In this paper we propose a new and simple method to speed up the widely-used Euclidean realization of LSH. At the heart of our method is a fast way to estimate the Euclidean distance between two d-dimensional vectors; this is achieved by the use of randomized Hadamard transforms in a non-linear setting. This decreases the running time of a (k, L)parameterized LSH from O(dkL) to O(d log d + kL). Our experiments show that using the new LSH in nearest-neighbor applications can improve their running times by significant amounts. To the best of our knowledge, this is the first running time improvement to LSH that is both provable and practical.

Figures (6)

PROOF. Let us define random variables 6; = 1 if 7 € Si, and apply Bernstein’s inequality [29, Theorem 2.7] to the random vari- ables w; = 5; v7. Note that The next theorem shows that the collision probabilities of ACHash are very similar to that of the naive LSH. Let u = ||a — y|| and let

Recall. The first column in Figure 2 shows the recall levels achieved at the chosen parameter tuples for each of the radius values. As dis- cussed above, the recall values are very close to each other, but are not aligned perfectly. The differences are at most a few percentage points. We observe that the recall for DHHash is almost always more than that of Naive at the chosen parameter tuple, thus mak- ing our claims of query time improvements justified.

(a) Recall, LSH query time, and LSH computation time for FLICKR (b) Recall, LSH query time, and LSH computation time for QUERIES.

Figure 3: Number of LSH candidates and space usage for all four datasets.

References (33)

REFERENCES
N. Ailon and B. Chazelle. The fast Johnson-Lindenstrauss transform and approximate nearest neighbors. SIAM J. Comput., 39(1):302-322, 2009.
N. Ailon and E. Liberty. Fast dimension reduction using Rademacher series on dual BCH codes. Discrete and Computational Geometry, 42(4):615-630, 2009.
A. Andoni. Nearest Neighbor Search: the Old, the New, and the Impossible. PhD thesis, MIT, 2009.
A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117-122, 2008.
Y. Bachrach and E. Porat. Fast pseudo-random fingerprints. Arxiv preprint arXiv:1009.5791, 2010.
R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In Proc. 16th WWW, pages 131-140, 2007.
A. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Proc. 6th WWW, pages 391-404, 1997.
A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. J. Comput. Syst. Sci., 60(3):630-659, 2000.
M. S. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. 34th STOC, pages 380-388, 2002.
S. Chien and N. Immorlica. Semantic similarity between search engine queries using temporal correlation. In Proc. 14th WWW, pages 2-11, 2005.
S. Danziger, J. Zeng, Y. Wang, R. Brachmann, and R. Lathrop. Choosing where to look next in a mutation sequence space: Active Learning of informative p53 cancer rescue mutants. Bioinformatics, 23(13):i104, 2007.
M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proc. 20th SOCG, pages 253-262, 2004.
K. Eshghi and S. Rajaram. Locality sensitive hash functions based on concomitant rank order statistics. In Proc. 14th KDD, pages 221-229, 2008.
G. Feigenblat, E. Porat, and A. Shiftan. Even better framework for min-wise based algorithms. Arxiv preprint arXiv:1102.3537, 2011.
I. K. Fodor. A survey of dimension reduction techniques. Technical Report UCRL-ID-148494, Lawrence Livermore National Laboratory, 2002.
A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proc. 25th VLDB, pages 518-529, 1999.
M. X. Goemans and D. P. Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM, 42(6):1115-1145, 1995.
J. He, W. Liu, and S. Chang. Scalable similarity search with optimized kernel hashing. In Proc. 16th KDD, pages 1129-1138, 2010.
M. R. Henzinger. Finding near-duplicate web pages: A large-scale evaluation of algorithms. In Proc. 29th SIGIR, pages 284-291, 2006.
W. Hoeffding. Probability inequalities for sums of bounded random variables. J. ASA, 58(301):13-30, 1963.
R. Horn and C. Johnson. Matrix analysis. Cambridge Univ Press, 1990.
P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc. 30th STOC, pages 604-613, 1998.
P. Indyk, R. Motwani, P. Raghavan, and S. Vempala. Locality-preserving hashing in multidimensional spaces. In Proc. 29th STOC, pages 618-625, 1997.
I. Ipsen and R. Rehman. Perturbation bounds for determinants and characteristic polynomials. SIAM J. Matrix Analysis and Applications, 30(2):762-776, 2008.
N. Koudas and D. Srivatsava. Approximate joins: Concepts and techniques. In Proc. 31st VLDB, page 1363, 2005.
E. Liberty, N. Ailon, and A. Singer. Dense fast random projections and lean Walsh transforms. In Proc. 12th RANDOM, pages 512-522, 2008.
Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multi-probe LSH: Efficient indexing for high-dimensional similarity search. In Proc. VLDB, pages 950-961, 2007.
G. S. Manku, A. Jain, and A. D. Sarma. Detecting near-duplicates for web crawling. In Proc. 16th WWW, pages 141-150, 2007.
C. McDiarmid. Concentration. In M. Habib, C. McDiarmid, J. Ramirez-Alfonsin, and B. Reed, editors, Probabilistic Methods for Algorithmic Discrete Mathematics, volume 16, pages 195-248. Springer, 1998.
R. Panigrahy. Entropy-based nearest neighbor search in high dimensions. In Proc. 17th SODA, pages 1186-1195, 2006.
J. Vybiral. A variant of the Johnson-Lindenstrauss lemma for circulant matrices. J. Functional Analysis, 260(4):1096-1105, 2011.
R. Weber, H. Schek, and S. Blott. A quantitative analysis and performance study for similarity search methods in high dimensional spaces. In Proc. 24th VLDB, pages 194-205, 1998.

Fast locality-sensitive hashing

Sign up for access to the world's latest research

Abstract

Related papers

References (33)

Related papers

Related topics

Cited by