Academia.eduAcademia.edu

Outline

A One-Pass Space-Efficient Algorithm for Finding Quantiles

Abstract

We present an algorithm for nding the quantile values of a large unordered dataset with unknown distribution. The algorithm has the following features: i) it requires only one pass over the data; ii) it is space e cient | it uses a small bounded amount of memory independent of the number of values in the dataset; and iii) the true quantile is guaranteed to lie within the lower and upper bounds produced by the algorithm. Empirical evaluation using synthetic data with various distributions as well as real data show that the bounds obtained are quite tight. The algorithm has several applications in database systems, for example in database governors, query optimization, load balancing in multiprocessor database systems, and data mining.

References (12)

  1. R. Agrawal, T. Imielinski, A. Swami: \Mining Associations between Sets of Items in Massive Databases," ACM SIGMOD 93, May 1993, 207{216.
  2. M. Blum et. al, \Time Bounds for Selection", Journal of Computers and Systems, 7:4, 1972, 448{461.
  3. W. G. Cochran, Sampling Techniques, John Wiley and Sons, New York, NY, 3rd edition, 1977.
  4. D. J. DeWitt, J. F. Naughton, and D. A. Schneider, \Parallel Sorting on a Shared- Nothing Architecture using Probabilistic Splitting," 1st Int'l Conf. on Parallel and Distributed Information Systems, Miami Beach, Florida, December 1991, 280{291.
  5. A. P. Gurajada and J. Srivastava, \Equidepth Partitioning of a Data Set based on Finding its Medians", Technical Report TR 90-24, Computer Science Dept., Univ. of Minnesota, 1990.
  6. R. Jain and I. Chlamtac, \The P 2 Algorithm for Dynamic Calculation of Quantiles and Histograms Without Storing Observations," CACM, Vol. 28, No. 10, Oct. 1985, 1076{1085.
  7. M. Muralikrishna and D. J. DeWitt, \Equi-Depth Histograms for Estimating Selectiv- ity Factors for Multi-dimensional Queries," ACM SIGMOD 88, Chicago, Illinois, June 1988, 28{36.
  8. J. I. Munro and M. S. Paterson, \Selection and Sorting with Limited Storage," Theo- retical Computer Science, Vol. 12, 1980, 315{323.
  9. G. Piatetsky-Shapiro, \Accurate Estimation of the Number of Tuples Satisfying a Condition", ACM SIGMOD 84, Boston, June 1984, 256{276.
  10. B. W. Schmeiser and S. J. Deutsch, \Quantile Estimation from Grouped Data: The Cell Midpoint," Communications in Statistics: Simulation and Computation, B6(3), 1977, 221{234.
  11. P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lories, and T. G. Price, \Access Path Selection in a Relational Database Management System", ACM SIGMOD 79, June 1979.
  12. G. K. Zipf, Human Behavior and the Principle of Least E ort, Addison-Wesley, Read- ing, MA, 1949.