Academia.eduAcademia.edu

Outline

Streaming Random Forests

2007, 11th International Database Engineering and Applications Symposium (IDEAS 2007)

https://doi.org/10.1109/IDEAS.2007.4318108

Abstract

Recent research addresses the problem of data-stream mining to deal with applications that require processing huge amounts of data such as sensor data analysis and financial applications. Data-stream mining algorithms incorporate special provisions to meet the requirements of stream-management systems, that is stream algorithms must be online and incremental, processing each data record only once (or few times); adaptive to distribution changes; and fast enough to accommodate high arrival rates. We consider the problem of data-stream classification, introducing an online and incremental stream-classification ensemble algorithm, Streaming Random Forests, an extension of the Random Forests algorithm by Breiman, which is a standard classification algorithm. Our algorithm is designed to handle multi-class classification problems. It is able to deal with data streams having an evolving nature and a random arrival rate of training/test data records. The algorithm, in addition, automatically adjusts its parameters based on the data seen so far. Experimental results on real and synthetic data demonstrate that the algorithm gives a successful behavior. Without losing classification accuracy, our algorithm is able to handle multi-class problems for which the underlying class boundaries drift, and handle the case when blocks of training records are not big enough to build/update the classification model.

References (72)

  1. Pandora music station. http://www.pandora.com/.
  2. Random Forest FORTRAN Code. Available at http://www.stat.berkeley.edu/ breiman/RandomForests/cc_software.htm/.
  3. Forest CoverType dataset. Available at http://kdd.ics.uci.edu/.
  4. The Sixth Data Release of the Sloan Digital Sky Survey. The Astrophysical Journal Supplement Series, 175:297-313, April 2008.
  5. D. Abadi, D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and S. Zdonik. Aurora: A new model and architecture for data stream management. Journal of Very Large Data Bases (VLDB), 12(2):120-139, August 2003.
  6. H. Abdulsalam, D. Skillicorn, and P. Martin. Mining data-streams. In P. Poncelet, F. Masseglia, and M. Tessiere, editors, Success and New Directions in Data Mining, pages 302-324. Idea Group Inc. (IGI), October 2007.
  7. H. Abdulsalam, D. Skillicorn, and P. Martin. Streaming random forests. In Proceedings of the 11th International Database Engineering and Applications Symposium(IDEAS), pages 225-232, September 2007.
  8. H. Abdulsalam, D. Skillicorn, and P. Martin. Classifying evolving data streams using dynamic streaming random forests. In Proceedings of the 19th International Conference on Database and Expert Systems Applications (DEXA), to appear September 2008.
  9. C. Aggarwal, J. Han, J. Wang, and P. Yu. A framework for clustering evolving data streams. In Proceedings of 29th International Conference on Very Large Data Bases(VLDB), pages 81-92. Berlin, Germany, September 2003.
  10. C. Aggarwal, J. Han, J.and Wang, and P. Yu. A framework for high dimensional pro- jected clustering of data streams. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), pages 852-863. Toronto, Canada, August 2004.
  11. A. Arasu, B. Babcock, S. Babu, M. Datar, K. Ito, R. Motwani, I. Nishizawa, U. Sri- vastava, D. Thomas, R. Varma, and J. Widom. STREAM: The Stanford Stream Data Manager. IEEE Data Engineering Bulletin, 26(1):19-26, March 2003.
  12. B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In Proceedings of the 21st ACM SIGACT-SIGMOD-SIGART Sym- posium on Principles of Database Systems (PODS), pages 1-16. Madison, Wisconsin, June 2002.
  13. B. Babcock, M. Datar, R. Motwani, and L. O'Callaghan. Maintaining variance and k-medians over data stream windows. In Proceedings of the 22nd ACM SIGACT- SIGMOD-SIGART Symposium on Principles of Database Systems(PODS), pages 234- 243. San Diego, CA, June 2003.
  14. D. Barbará. Requirements for clustering data streams. ACM SIGKDD Knowledge Discovery in Data and Data Mining Explorations Newsletter, 3(2):23-27, January 2002.
  15. J. Blackard. Comparison of Neural Networks and Discriminant Analysis in Predict- ing Forest Cover Types. PhD thesis, Department of Forest Sciences. Colorado State University, Fort Collins, Colorado, 1998.
  16. P. Bonnet, J. E. Gehrke, , and P. Seshadri. Towards sensor database systems. In Proceedings of the 2nd International Conference on Mobile Data Management (MDM), pages 3-14. Hong Kong, China, January 2001.
  17. L. Breiman. Bagging predictors. Technical report, Statistics Department, University of California, Berkeley, September 1994.
  18. L. Breiman. Random forests. Machine Learning, 45(1):5-32, October 2001.
  19. L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth International, Belmont, CA, 1984.
  20. A. Bulut and A. Singh. A unified framework for monitoring data streams in real time. In Proceedings of the 21st International Conference on Data Engineering (ICDE), pages 44-55. Tokyo, Japan, April 2005.
  21. G. C. and L. Grossman. GenIc: A single pass generalized incremental algorithm for clustering. In Proceedings of the 1st Workshop on Secure Data Management (SDM). Toronto, Canada, April 2004.
  22. Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. Multi-dimensional regression analysis of time-series data streams. In Proceedings of the 28th international conference on Very Large Data Bases(VLDB), pages 323-334. Hong Kong, China, August 2002.
  23. F. Chu, Y. Wang, and C. Zaniolo. An adaptive learning approach for noisy data streams. In Proceedings of the 4th IEEE International Conference on Data Mining (ICDM), pages 351-354. Brighton, UK, November 2004.
  24. T. Dasu, S. Krishnan, S. Venkatasubramanian, and K. Yi. An information-theoretic approach to detecting changes in multi-dimensional data streams. In Proceedings of the 38th Symposium on the Interface of Statistics. Pasadena, CA, May 2006.
  25. M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. In Proceedings of the 13th Annual ACM-SIAM Symposium on Dis- crete Algorithms (SODA), pages 635-644. San Francisco, CA, January 2002.
  26. P. Domingos and G. Hulten. Mining high-speed data streams. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 71-80. Boston, MA, August 2000.
  27. M. H. Dunham. Data Mining. Introductory and Advanced Topics. Prentice Hall, 2003.
  28. W. Fan. A systematic data selection to mine concept-drifting data streams. In Pro- ceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 128-137. Seattle, Washington, August 2004.
  29. W. Fan, Y. Huang, and P. S. Yu. Decision tree evolution using limited number of labeled data items from drifting data streams. In Proceedings of the 4th IEEE Interna- tional Conference on Data Mining (ICDM), pages 379-382. Brighton, UK, November 2004.
  30. Y. Freund and R. E. Schapire. Experiments with new boosting algorithms. In Pro- ceedings of the 13th International Conference on Machine Learning (ICML), pages 148-146. Bari, Italy, July 1996.
  31. M. Gaber, S. Krishnaswamy, and A. Zaslavsky. Cost-efficient mining techniques for data streams. In Proceedings of the 1st Australasian Workshop on Data Mining and Web Intelligence (DMWI), pages 81-92. Dunedin, New Zealand, January 2004.
  32. J. Gama, P. Medas, and R. Rocha. Forest trees for on-line data. In Proceedings of the 2004 ACM Symposium on Applied Computing (SAC), pages 632-636. Nicosia, Cyprus, March 2004.
  33. J. Gama, P. Medas, and P. Rodrigues. Learning decision trees from dynamic data streams. In Proceedings of the 2005 ACM symposium on Applied computing (SAC), pages 573-577. Santa Fe, New Mexico, March 2005.
  34. J. Gama, R. Rocha, and P. Medas. Accurate decision trees for mining high-speed data streams. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 523-528. Washington, DC, Au- gust 2003.
  35. W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of American Statistical Association, 58(1):13-30, 1963.
  36. G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In Pro- ceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data mining (KDD), pages 97-106. San Francisco, CA, August 2001.
  37. L. Jia, Z. Wang, N. Lu, X. Xu, D. Zhou, and Y. Wang. Rfiminer: A regression-based algorithm for recently frequent patterns in multiple time granularity data streams. Applied Mathematics and Computation, 185(2):769-783, February 2007.
  38. R. Jin and G. Agrawal. Efficient decision tree construction on streaming data. In Proceedings of International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 571-576. Washington, DC, August 2003.
  39. M. Kantardzic. Data Mining. Concepts, Models, Methods, and Algorithms. IEEE Press, 2003.
  40. R. M. Karp, C. H. Papadimitriou, and S. Shenker. A simple algorithm for finding fre- quent elements in streams and bags. ACM Transactions on Database Systems (TODS), 28(1):51-55, March 2003.
  41. D. Kifer, S. Ben-David, and J. Gehrke. Detecting changes in data streams. In Proceed- ings of the 30th International Conference on Very Large Data Bases (VLDB), pages 180-191. Toronto, Canada, August 2004.
  42. B. Krishnamurthy, S. Sen, Y. Zhang, and Y. Chen. Sketch-based change detection: methods, evaluation, and applications. In Proceedings of the 3rd ACM SIGCOMM In- ternet Measurement Conference (IMC), pages 234-247. Miami Beach, Florida, October 2003.
  43. W. Li, X. Jin, and X. Ye. Detecting change in data stream: Using sampling technique. In Proceedings of the Third International Conference on Natural Computation (ICNC), pages 130-134. Haikou, Hainan, China, August 2007.
  44. Z. Li, T. Wang, R. Wang, Y. Yan, and H. Chen. A new fuzzy decision tree classifi- cation method for mining high-speed data streams based on binary search trees. In Proceedings of the First International Frontiters of Algorithmics WorkShop (FAW), pages 216-227. Lanzhou, China, August 2007.
  45. Y. Liu, J. Cai, J. Yin, and A. W.-C. Fu. Clustering text data streams. Journal of Computer Science Technology, 23(1):112-128, January 2008.
  46. L.Rokach. Ensemble methods for classifiers. In O. Maimon and L. Rokach, editors, The Data Mining and Knowledge Discovery Handbook, pages 957-980. Springer Science and Business Media Inc., 2005.
  47. L.Rokach and O. Maimon. Decision trees. In O. Maimon and L. Rokach, editors, The Data Mining and Knowledge Discovery Handbook, pages 165-192. Springer Science and Business Media Inc., 2005.
  48. S. McConnell and D. Skillicorn. Distributed Data Mining for Astrophysical Datasets. In P. Shopbell, M. Britton, and R. Ebert, editors, Proceedings of the Astronomical Data Analysis Software and Systems XIV, volume 347 of Astronomical Society of the Pacific Conference Series, pages 360-364, December 2005.
  49. G. Melli. (SCDS-A) Synthetic classification data set generator. Simon Fraser Univer- sity, School of Computer Science, 1997.
  50. A. Metwally, D. Agrawal, and A. El-Abbadi. Efficient computation of frequent and top-k elements in data streams. In Proceedings of the 10th International Conference on Database Theory (ICDT), pages 398-412. Edinburgh, UK, January 2005.
  51. S. Nassar and J. Sander. Effective summarization of multi-dimensional data streams for historical stream mining. In Proceedings of the 19th International Conference on Scientific and Statistical Database Management (SSDBM), pages 30-39. Banff, AB, Canada, July 2007.
  52. K. Nishida, K. Yamauchi, and T. Omori. ACE: Adaptive classifiers-ensemble system for concept-drifting environments. In Proceedings of the 6th International Workshop on Multiple Classifier Systems (MCS), pages 176-185. Seaside, CA, USA, June 2005.
  53. C. Olaru and L. Wehenkel. A complete fuzzy decision tree technique. Journal of Fuzzy Sets and Systems, 138(2):221-254, September 2003.
  54. P.Buhlmann. Bagging, boosting and ensemble methods. In J. Gentle, W. Hardle., and Y. Mori, editors, Handbook of Computational Statistics. Concepts and Methods, pages 877-907. Springer Science and Business Media Inc., 2004.
  55. C. E. Shannon. A mathematical theory of communication. Proceedings of the ACM SIGMOBILE Mobile Computing and Communications Review, 5(1):355, January 2001.
  56. W. Street and Y. Kim. A streaming ensemble algorithm (SEA) for large-scale classifica- tion. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 377-382. San Francisco, CA, USA, August 2001.
  57. M. Sullivan and A. Heybey. Tribeca: A system for managing large databases of network traffic. In Proceedings of the USENIX Annual Technical Conference, pages 13-24. New Orleans, Louisiana, June 1998.
  58. Y. Sun, G. Mao, X. Liu, and C. Liu. Mining concept drifts from data streams based on multi-classifiers. In Proceedings of the 21st International Conference on Advanced In- formation Networking and Applications Workshops (AINAW), pages 257-263. Niagara Falls, Canada, May 2007.
  59. W. G. Teng, M. S. Chen, and P. S. Yu. A regression-based temporal pattern mining scheme for data streams. In Proceedings of the 29th international conference on Very large data bases (VLDB), pages 93-104. Berlin, Germany, September 2003.
  60. C. J. Tsai, C. I. Lee, and W. P. Yang. An efficient and sensitive decision tree approach to mining concept-drifting data streams. Informatica, 19(1):135-156, February 2008.
  61. K. Udommanetanakit, T. Rakthanmanon, and K. Waiyamai. E-Stream: Evolution- based technique for stream clustering. In Proceedings of The Third International Con- ference on Advanced Data Mining and Applications (ADMA), pages 605-615. Harbin, China, August 2007.
  62. P. Vorburger and A. Bernstein. Entropy-based concept shift detection. In Proceed- ings of the Sixth International Conference on Data Mining (ICDM), pages 1113-1118. Hong-Kong, December 2006.
  63. H. Wang, W. Fan, P. Yu, and J. Han. Mining concept-drifting data streams using ensemble classifiers. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 226-235. Washington, DC, August 2003.
  64. T. Wang, Z. Li, X. Hu, Y. Yan, and H. Chen. A new decision tree classification method for mining high-speed data streams based on threaded binary search trees. In Proceedings of the 11th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pages 256-267. Nanjing, China, May 2007.
  65. T. Wang, Z. Li, Y. Yan, H. Chen, and J. Yu. An efficient classification system based on binary search trees for data streams mining. In Proceedings of the Second International Conference on Systems (ICONS), pages 15-15. Sainte-Luce, Martinique, April 2007.
  66. X. Wang, H. Liu, and J. Han. Finding frequent items in data streams using hierarchical information. In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics (ISIC), pages 431-436, October 2007.
  67. I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco, 2nd edition, 2005.
  68. Y. Zhang and X. Jin. An automatic construction and organization strategy for ensem- ble learning on data streams. ACM SIGMOD Record, 35(3):28-33, September 2006.
  69. A. Zhou, S. Qin, and W. Qian. Adaptively detecting aggregation bursts in data streams. In Proceedings of the 10th International Conference of Database Systems for Advanced Applications (DASFAA), pages 435-446. Beijing, China, April 2005.
  70. X. Zhu, X. Wu, and Y. Yang. Dynamic classifier selection for effective mining from noisy data streams. In Proceedings of the 4th IEEE International Conference on Data Mining (ICDM), pages 305-312. Brighton, UK, November 2004.
  71. Y. Zhu and D. Shasha. Statistical monitoring of thousands of data streams in real time. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB), pages 358-369. Hong Kong, China, August 2002.
  72. Y. Zhu and D. Shasha. Efficient elastic burst detection in data streams. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 336-345. Washington, DC, August 2003.