Academia.eduAcademia.edu

Outline

Robust methodologies for modeling web click distributions

2007, Proceedings of the 16th international conference on World Wide Web - WWW '07

https://doi.org/10.1145/1242572.1242642

Abstract

Metrics such as click counts are vital to online businesses but their measurement has been problematic due to inclusion of high variance robot traffic. We posit that by applying statistical methods more rigorous than have been employed to date that we can build a robust model of the distribution of clicks following which we can set probabilistically sound thresholds to address outliers and robots. Prior research in this domain has used inappropriate statistical methodology to model distributions and current industrial practice eschews this research for conservative ad-hoc click-level thresholds. Prevailing belief is that such distributions are scale-free power law distributions but using more rigorous statistical methods we find the best description of the data is instead provided by a scale-sensitive Zipf-Mandelbrot mixture distribution. Our results are based on ten datasets from various verticals in the Yahoo domain. Since mixture models can overfit the data we take care to use the BIC loglikelihood method which penalizes overly complex models. Using a mixture model in the web activity domain makes sense because there are likely multiple classes of users. In particular, we have noticed that there is a significantly large set of "users" that visit the Yahoo portal exactly once a day. We surmise these may be robots testing internet connectivity by pinging the Yahoo main website. Backing up our quantitative analysis is graphical analysis in which empirical distributions are plotted against theoretical distributions in log-log space using robust cumulative distribution plots. This methodology has two advantages: plotting in log-log space allows one to visually differentiate the various exponential distributions and secondly, cumulative plots are much more robust to outliers. We plan to use the results of this work for applications for robot removal from web metrics business intelligence systems. G.3 [Mathematics of Computing]: Probability and Statistics Modern businesses rely on accurate counts of web pageviews and clicks to calculate growth rates and market share. Per-user page-views in the millions (per month) and other Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others.

References (24)

  1. REFERENCES
  2. G. Abdulla. Analysis and Modeling of World Wide Web Traffic. PhD thesis, Virginia Tech, 1998.
  3. A. Agresti. Categorical Data Analysis. Wiley Series in Probability and Statistics, 2002.
  4. L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web caching and zipf-like distributions: Evidence and implications. In INFOCOM (1), pages 126-134, 1999.
  5. R. H. Byrd, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained optimization. Journal of Scientific Computing (SIAM), 16:1190-1208, 1995.
  6. G. Casella and R. L. Berger. Statistical Inference. Duxbury Press, 1990.
  7. U. Frisch and D. Sornette. Extreme deviation and applications. J. Phys. I France 7, 7:1155-1171, 1997.
  8. S. Glassman. A caching relay for the World Wide Web. Computer Networks and ISDN Systems, 27(2):165-173, 1994.
  9. D. C. Heilbron. Zero-altered and other regression models for count data with added zeroes. Biometrics, 36:531-547, 1994.
  10. B. A. Huberman, P. L. T. Pirolli, J. E. Pitkow, and R. M. Lukose. Strong regularities in world wide web surfing. Science, 280:95-97, 1998.
  11. J. Laherrere and D. Sornette. Stretched exponential distributions in nature and economy: "fat tails" with characteristic scales. The European Physical Journal B, 2:525, 1998.
  12. D. Lambert. Zero-inflated poisson regression, with an application to defects in manufacturing. Technometrics, 34:1-14, 1992.
  13. D. Lord, S. P. Washington, and J. N. Ivan. Poisson, poisson-gamma and zero-inflated regression models of motor vehicle crashes: balancing statistical fit and theory. Accident Analysis and Prevention, 37:35-46, 2005.
  14. B. Mandelbrot. An informational theory of the statistical structure of language. In W. Jackson, editor, Communication Theory. Betterworths, 1953.
  15. S. M. Mwalili, E. Lesaffre, and D. Declerck. The zero-inflated negative binomial regression model with correction for misclassification: An example in caries research. Technical Report TR0462, IAP Statistics Network, 2005.
  16. R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, 2004. ISBN 3-900051-00-3.
  17. J. A. Rice. Mathematical Statistics and Data Analysis. Wadsworth & Brooks/Cole, 1988.
  18. G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6:461-464, 1978.
  19. H. A. Simon. On a class of skew distribution functions. Biometrika, 42:425-440, 1955.
  20. E. C. Titchmarsh. The Theory of the Riemann Zeta Function, 2nd ed. Oxford Science Publications, Clarendon Press, Oxford, 1986.
  21. D. G. Uitenbroek. SISA Pairwise tests. http://home.clara.net/ sisa/ pairwhlp.htm, 1997.
  22. D. von Seggern. CRC Standard Curves and Surfaces. CRC Press, 1993.
  23. J. R. Wilson. Logarithmic series distribution and its use in analyzing discrete data. In Proceedings of the Survey Research Methods Section, American Statistical Association, pages 275-280, 1988.
  24. G. K. Zipf. Human Behaviour and the Principle of Least-Effort. Addison-Wesley, Cambridge, MA, 1949.