Academia.eduAcademia.edu

Outline

Twitter-scale New Event Detection Via Kterm Hashing

Abstract

First Story Detection is hard because the most accurate systems become progressively slower with each document processed. We present a novel approach to FSD, which operates in constant time/space and scales to very high volume streams. We show that when computing novelty over a large dataset of tweets, our method performs 192 times faster than a state-of-the-art baseline without sacrificing accuracy. Our method is capable of performing FSD on the full Twitter stream on a single core of modest hardware.

References (31)

  1. J. Allan, C. Wade, and A. Bolivar. Retrieval and nov- elty detection at the sentence level. In SIGIR 03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 314 -321. ACM Press, 2003
  2. James Allan. 2002. Topic Detection and Track- ing: Event-Based Information Organization. Kluwer Academic Publishers, Norwell, MA, USA.
  3. James Allan, Victor Lavrenko and Hubert Jin. 2000. First story detection in TDT is hard. In Proceedings of the ninth international conference on Information and knowledge management. ACM.
  4. James Allan, Ron Papka and Victor Lavrenko. 1998. On-line new event detection and tracking. Proceed- ings of the 21st annual international ACM SIGIR conference on Research and development in infor- mation retrieval. ACM.
  5. Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7), 422-426.
  6. Cataldi, M., Caro, L. D., and Schifanella, C. (2010). Emerging topic detection on Twitter based on tem- poral and social terms evaluation. In Proceedings of the 10th International Workshop on Multimedia Data Mining, pages 4:1 -4:10. ACM.
  7. Cordeiro, M. (2012). Twitter event detection: Combin- ing wavelet analysis and topic inference summariza- tion. In Doctoral Symposium in Informatics Engi- neering, pages 123 -138.
  8. Leo Egghe. 2007. Untangling Herdan's law and Heaps' law: Mathematical and informetric arguments. Jour- nal of the American Society for Information Science and Technology 58.5: 702-709.
  9. Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbours: towards removing the curse of dimensionality. In Proceedings of the thirtieth an- nual ACM symposium on Theory of computing (STOC '98). ACM, New York, NY, USA.
  10. Shiva Prasad Kasiviswanathan, Prem Melville, Arindam Banerjee, and Vikas Sindhwani. Emerg- ing topic detection using dictionary learning. In Proceedings of the Twentieth ACM interna- tional conference on Information and knowledge management, 2011.
  11. Robert Krovetz. 1993. Viewing morphology as an in- ference process. Proceedings of the 16th annual in- ternational ACM SIGIR conference on Research and development in information retrieval. ACM.
  12. Li, R., Lei, K. H., Khadiwala, R., and Chang, K. C.- C. (2012). TEDAS: A Twitter-based event detection and analysis system. In Proceedings of 28th Interna- tional Conference on Data Engineering, pages 1273 -1276. IEEE Computer Society.
  13. Li, C., Sun, A., and Datta, A. (2012b). Twevent: Segment-based event detection from tweets. In Pro- ceedings of ACM Conference on Information and Knowledge Management. ACM.
  14. Jimmy Lin, Rion Snow, and William Morgan. 2011. Smoothing techniques for adaptive online language models: topic tracking in tweet streams. In Proceed- ings of the 17th ACM SIGKDD international con- ference on Knowledge discovery and data mining (KDD '11). ACM, New York, NY, USA, 422-429.
  15. S. Muthukrishnan. 2005. Data streams: Algorithms and applications. Now Publishers Inc.
  16. Jeffrey Nichols, Jalal Mahmud, and Clemens Drews. 2012. Summarizing sporting events using twitter. In- Proceedings of the 2012 ACM international confer- ence on Intelligent User Interfaces (IUI '12). ACM, New York, NY, USA.
  17. Ozdikis, O., Senkul, P., and Oguztuzun, H. (2012). Se- mantic expansion of hashtags for enhanced event de- tection in Twitter. In Proceedings of the 1st Interna- tional Workshop on Online Social Systems.
  18. Sasa Petrovic, Miles Osborne, and Victor Lavrenko. 2010. Streaming first story detection with applica- tion to Twitter. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Lin- guistics (HLT '10). Association for Computational Linguistics, Stroudsburg, PA, USA.
  19. Sasa Petrovic. 2013. Real-time event detection in mas- sive streams. Ph.D. thesis, School of Informatics, University of Edinburgh.
  20. Sasa Petrovic, Miles Osborne, Richard McCreadie, Craig Macdonald, Iadh Ounis, and Luke Shrimpton. Can Twitter replace Newswire for breaking news? In Proc.of ICWSM, 2013b.
  21. Raymond K. Pon, Alfonso F. Cardenas, David Buttler, and Terence Critchlow. 2007. Tracking multiple top- ics for finding interesting articles. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '07). ACM, New York, NY, USA.
  22. Phuvipadawat, S. and Murata, T. (2010). Breaking news detection and tracking in Twitter. In Pro- ceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, pages 120 -123. IEEE Computer Society.
  23. Deepak Ravichandran, Patrick Pantel, and Eduard Hovy. 2005. Randomized Algorithms and NLP: Us- ing Locality Sensitive Hash Functions for High Speed Noun Clustering. In Proceedings of ACL.
  24. Sankaranarayanan, J., Samet, H., Teitler, B. E., Lieberman, M. D., and Sperling, J. (2009). Twit- terstand: news in tweets. In Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 42 -51. ACM.
  25. Sakaki, T., Okazaki, M., and Matsuo, Y. (2010). Earth- quake shakes Twitter users: real-time event detec- tion by social sensors. In Proceedings of the 19th In- ternational Conference on World Wide Web, pages 851 -860. ACM.
  26. I. Soboroff, I. Ounis, and J. Lin. 2012. Overview of the trec-2012 microblog track. In Proceedings of TREC.
  27. Jintao Tang, Ting Wang, Qin Lu, Ji Wang, and Wenjie Li. 2011. A Wikipedia based semantic graph model for topic tracking in blogosphere. In Proceedings of the Twenty-Second international joint conference on Artificial Intelligence -Volume Three (IJCAI'11). TDT by NIST - 1998-2004. http://www.itl.nist.gov/iad/mig/ tests/tdt/resources.html (Last Update: 2008)
  28. Jianshu Weng, Erwin Leonardi, Francis Lee. Event De- tection in Twitter. 2011. In Proceeding of ICWSM. AAAI Press.
  29. Weng, J., Yao, Y., Leonardi, E., and Lee, F. (2011). Event detection in Twitter. In Proceedings of the 5th International Conference on Weblogs and Social Media, pages 401 -408. The AAAI Press.
  30. Dominik Wurzer, Victor Lavrenko, Miles Osborne. 2015. Tracking unbounded Topic Streams. In Pro- ceedings of the 53rd Annual Meeting of the Associ- ation for Computational Linguistics (ACL) and the 7th International Joint Conference on Natural Lan- guage Processing, pages 1765 -1773.
  31. Xintian Yang, Amol Ghoting, Yiye Ruan, and Srini- vasan Parthasarathy. 2012. A framework for summa- rizing and analysing twitter feeds. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '12). ACM, New York, NY, USA.