Twitter-scale New Event Detection Via Kterm Hashing
Abstract
First Story Detection is hard because the most accurate systems become progressively slower with each document processed. We present a novel approach to FSD, which operates in constant time/space and scales to very high volume streams. We show that when computing novelty over a large dataset of tweets, our method performs 192 times faster than a state-of-the-art baseline without sacrificing accuracy. Our method is capable of performing FSD on the full Twitter stream on a single core of modest hardware.
References (31)
- J. Allan, C. Wade, and A. Bolivar. Retrieval and nov- elty detection at the sentence level. In SIGIR 03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 314 -321. ACM Press, 2003
- James Allan. 2002. Topic Detection and Track- ing: Event-Based Information Organization. Kluwer Academic Publishers, Norwell, MA, USA.
- James Allan, Victor Lavrenko and Hubert Jin. 2000. First story detection in TDT is hard. In Proceedings of the ninth international conference on Information and knowledge management. ACM.
- James Allan, Ron Papka and Victor Lavrenko. 1998. On-line new event detection and tracking. Proceed- ings of the 21st annual international ACM SIGIR conference on Research and development in infor- mation retrieval. ACM.
- Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7), 422-426.
- Cataldi, M., Caro, L. D., and Schifanella, C. (2010). Emerging topic detection on Twitter based on tem- poral and social terms evaluation. In Proceedings of the 10th International Workshop on Multimedia Data Mining, pages 4:1 -4:10. ACM.
- Cordeiro, M. (2012). Twitter event detection: Combin- ing wavelet analysis and topic inference summariza- tion. In Doctoral Symposium in Informatics Engi- neering, pages 123 -138.
- Leo Egghe. 2007. Untangling Herdan's law and Heaps' law: Mathematical and informetric arguments. Jour- nal of the American Society for Information Science and Technology 58.5: 702-709.
- Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbours: towards removing the curse of dimensionality. In Proceedings of the thirtieth an- nual ACM symposium on Theory of computing (STOC '98). ACM, New York, NY, USA.
- Shiva Prasad Kasiviswanathan, Prem Melville, Arindam Banerjee, and Vikas Sindhwani. Emerg- ing topic detection using dictionary learning. In Proceedings of the Twentieth ACM interna- tional conference on Information and knowledge management, 2011.
- Robert Krovetz. 1993. Viewing morphology as an in- ference process. Proceedings of the 16th annual in- ternational ACM SIGIR conference on Research and development in information retrieval. ACM.
- Li, R., Lei, K. H., Khadiwala, R., and Chang, K. C.- C. (2012). TEDAS: A Twitter-based event detection and analysis system. In Proceedings of 28th Interna- tional Conference on Data Engineering, pages 1273 -1276. IEEE Computer Society.
- Li, C., Sun, A., and Datta, A. (2012b). Twevent: Segment-based event detection from tweets. In Pro- ceedings of ACM Conference on Information and Knowledge Management. ACM.
- Jimmy Lin, Rion Snow, and William Morgan. 2011. Smoothing techniques for adaptive online language models: topic tracking in tweet streams. In Proceed- ings of the 17th ACM SIGKDD international con- ference on Knowledge discovery and data mining (KDD '11). ACM, New York, NY, USA, 422-429.
- S. Muthukrishnan. 2005. Data streams: Algorithms and applications. Now Publishers Inc.
- Jeffrey Nichols, Jalal Mahmud, and Clemens Drews. 2012. Summarizing sporting events using twitter. In- Proceedings of the 2012 ACM international confer- ence on Intelligent User Interfaces (IUI '12). ACM, New York, NY, USA.
- Ozdikis, O., Senkul, P., and Oguztuzun, H. (2012). Se- mantic expansion of hashtags for enhanced event de- tection in Twitter. In Proceedings of the 1st Interna- tional Workshop on Online Social Systems.
- Sasa Petrovic, Miles Osborne, and Victor Lavrenko. 2010. Streaming first story detection with applica- tion to Twitter. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Lin- guistics (HLT '10). Association for Computational Linguistics, Stroudsburg, PA, USA.
- Sasa Petrovic. 2013. Real-time event detection in mas- sive streams. Ph.D. thesis, School of Informatics, University of Edinburgh.
- Sasa Petrovic, Miles Osborne, Richard McCreadie, Craig Macdonald, Iadh Ounis, and Luke Shrimpton. Can Twitter replace Newswire for breaking news? In Proc.of ICWSM, 2013b.
- Raymond K. Pon, Alfonso F. Cardenas, David Buttler, and Terence Critchlow. 2007. Tracking multiple top- ics for finding interesting articles. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '07). ACM, New York, NY, USA.
- Phuvipadawat, S. and Murata, T. (2010). Breaking news detection and tracking in Twitter. In Pro- ceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, pages 120 -123. IEEE Computer Society.
- Deepak Ravichandran, Patrick Pantel, and Eduard Hovy. 2005. Randomized Algorithms and NLP: Us- ing Locality Sensitive Hash Functions for High Speed Noun Clustering. In Proceedings of ACL.
- Sankaranarayanan, J., Samet, H., Teitler, B. E., Lieberman, M. D., and Sperling, J. (2009). Twit- terstand: news in tweets. In Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 42 -51. ACM.
- Sakaki, T., Okazaki, M., and Matsuo, Y. (2010). Earth- quake shakes Twitter users: real-time event detec- tion by social sensors. In Proceedings of the 19th In- ternational Conference on World Wide Web, pages 851 -860. ACM.
- I. Soboroff, I. Ounis, and J. Lin. 2012. Overview of the trec-2012 microblog track. In Proceedings of TREC.
- Jintao Tang, Ting Wang, Qin Lu, Ji Wang, and Wenjie Li. 2011. A Wikipedia based semantic graph model for topic tracking in blogosphere. In Proceedings of the Twenty-Second international joint conference on Artificial Intelligence -Volume Three (IJCAI'11). TDT by NIST - 1998-2004. http://www.itl.nist.gov/iad/mig/ tests/tdt/resources.html (Last Update: 2008)
- Jianshu Weng, Erwin Leonardi, Francis Lee. Event De- tection in Twitter. 2011. In Proceeding of ICWSM. AAAI Press.
- Weng, J., Yao, Y., Leonardi, E., and Lee, F. (2011). Event detection in Twitter. In Proceedings of the 5th International Conference on Weblogs and Social Media, pages 401 -408. The AAAI Press.
- Dominik Wurzer, Victor Lavrenko, Miles Osborne. 2015. Tracking unbounded Topic Streams. In Pro- ceedings of the 53rd Annual Meeting of the Associ- ation for Computational Linguistics (ACL) and the 7th International Joint Conference on Natural Lan- guage Processing, pages 1765 -1773.
- Xintian Yang, Amol Ghoting, Yiye Ruan, and Srini- vasan Parthasarathy. 2012. A framework for summa- rizing and analysing twitter feeds. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '12). ACM, New York, NY, USA.