Academia.eduAcademia.edu

Outline

Final Documentation

Abstract
sparkles

AI

This documentation covers the integration of various clustering techniques, specifically K-Means, Gaussian Mixture Models (GMM), and HDBSCAN, to analyze and classify data derived from tweets. The approach incorporates dimensionality reduction through Word2Vec and tf-idf measures, leading to effective data representation. The paper highlights the effectiveness of these methodologies using metrics such as Silhouette Score, Calinski-Harabasz Index, and Davies-Bouldin Index to evaluate the clustering performance.

References (41)

  1. Lastly, we see that the noise contains almost an amalgamation of a number of residual artifacts from previous clusters. Based on these wordclouds, we can identify 4 Major Categories of traffic related inci- dents:
  2. Breakdown related: Cluster 3
  3. Flyover related (can be a part of breakdowns): Cluster 1
  4. City programs: Clusters 2 and 0
  5. Road blockade/diversions: Cluster 4 References
  6. Szymanski & Ciota, 2002)Szymanski, G., & Ciota, Z. (2002). Hidden Markov Models Suitable for Text Generation. WSEAS International Conference on Signal, Speech and Image Processing (WSEAS ICOSSIP 2002), 1, 3081-3084.
  7. Parikh, 2012)Parikh, R. (2012). ET : Events from Tweets. 613-620.
  8. Lofi & Krestel, 2012)Lofi, C., & Krestel, R. (2012). iParticipate: Automatic tweet generation from local government data. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 7239 LNCS(PART 2), 295-298. https://doi.org/10.1007/978-3-642-29035-0_24
  9. Tapiador, Carrera, & Salvachúa, 2012)Tapiador, A., Carrera, D., & Salvachúa, J. (2012). Social stream, a social network framework. 1st International Conference on Future Generation Communication Technologies, FGCT 2012, 52-57. https://doi.org/10.1109/FGCT.2012.6476557
  10. Sakaki, T., Okazaki, M., & Matsuo, Y. (2013). Tweet analysis for real-time event detection and earthquake reporting system development. IEEE Transactions on Knowledge and Data Engineering, Vol. 25, pp. 919-931. https://doi.org/10.1109/TKDE.2012.29
  11. Campello, Moulavi, & Sander, 2013)Campello, R. J. G. B., Moulavi, D., & Sander, J. (2013). Density-based clustering based on hierarchical density estimates. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 7819 LNAI(PART 2), 160-172. https://doi.org/10.1007/978-3-642-37456-2_14
  12. Lloret & Palomar, 2013)Lloret, E., & Palomar, M. (2013). Towards automatic tweet generation: A comparative study from the text summarization perspective in the journalism genre. Expert Systems with Applications, 40(16), 6624-6630. https://doi.org/10.1016/j.eswa.2013.06.021
  13. Rosi, Mamei, & Zambonelli, 2013)Rosi, A., Mamei, M., & Zambonelli, F. (2013). Integrating social sensors and pervasive services: Approaches and perspectives. International Journal of Pervasive Computing and Communications, 9(4), 294-310. https://doi.org/10.1108/IJPCC-09- 2013-0022
  14. Anastasi et al., 2013)Anastasi, G., Antonelli, M., Bechini, A., Brienza, S., D'Andrea, E., De Guglielmo, D., … Segatori, A. (2013). Urban and social sensing for sustainable mobility in smart cities. 2013 Sustainable Internet and ICT for Sustainability, SustainIT 2013. https://doi.org/10.1109/SustainIT.2013.6685198
  15. Mikolov, Chen, Corrado, & Dean, 2013)Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. 1-12. Retrieved from http://arxiv.org/abs/1301.3781
  16. Baroni, Dinuand, & Kruszewski, 2014)Baroni, M., Dinuand, G., & Kruszewski, G. (2014). Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. Proceedings for the 52nd Annual Meeting for the Association of Computational Linguists, 238-247.
  17. Konstas, 2014)Konstas, I. (2014). Joint Models for Concept-to-text Generation.
  18. E., P., B., & F., 2015)E., D., P., D., B., L., & F., M. (2015). Real-Time Detection of Traffic from Twitter Stream Analysis. IEEE Transactions on Intelligent Transportation Systems, 16(4), 2269-2283. https://doi.org/10.1109/TITS.2015.2404431
  19. Farzindar & Wael, 2015)Farzindar, A., & Wael, K. (2015). a Survey of Techniques for Event Detection in Twitter. Computational Intelligence, 31(1), 132-164.
  20. Gutierrez, Figuerias, Oliveira, Costa, & Jardim-Goncalves, 2015)Gutierrez, C., Figuerias, P., Oliveira, P., Costa, R., & Jardim-Goncalves, R. (2015). Twitter mining for traffic events detection. Proceedings of the 2015 Science and Information Conference, SAI 2015, 371-378. https://doi.org/10.1109/SAI.2015.7237170
  21. Sidhaye & Cheung, 2015)Sidhaye, P., & Cheung, J. C. K. (2015). Indicative Tweet Generation: An Extractive Summarization Problem? (September), 138-147. https://doi.org/10.18653/v1/d15- 1014
  22. Campello, Moulavi, Zimek, & Sander, 2015)Campello, R. J. G. B., Moulavi, D., Zimek, A., & Sander, J. (2015). Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection. ACM Transactions on Knowledge Discovery from Data, 10(1), 1-51. https://doi.org/10.1145/2733381
  23. Medina & Ramon, 2015)Medina, C. P., & Ramon, M. R. R. (2015). Using TF-IDF to Determine Word Relevance in Document Queries Juan. New Educational Review, 42(4), 40-51. https://doi.org/10.15804/tner.2015.42.4.03
  24. Tran & Popowich, 2016)Tran, K., & Popowich, F. (2016). Automatic Tweet Generation From Traffic Incident Data. 59-66. https://doi.org/10.18653/v1/w16-3512
  25. McInnes & Healy, 2017)McInnes, L., & Healy, J. (2017). Accelerated Hierarchical Density Based Clustering. IEEE International Conference on Data Mining Workshops, ICDMW, 2017- Novem, 33-42. https://doi.org/10.1109/ICDMW.2017.12
  26. McInnes, Healy, & Astels, 2017)McInnes, L., Healy, J., & Astels, S. (2017). hdbscan: Hierarchical density based clustering. The Journal of Open Source Software, 2(11), 205. https://doi.org/10.21105/joss.00205
  27. Sherstinsky, 2018)Sherstinsky, A. (2018). Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network. 1-39. Retrieved from http://arxiv.org/abs/1808.03314
  28. Natural Language Toolkit NLTK 3.4.3 documentation, https://www.nltk.org/, last accessed 16/05/2019
  29. pandas Python Data Analysis Library, https://pandas.pydata.org/, last accessed 15/05/2019
  30. Industrial-Strength Natural Language Processing in Python, https://spacy.io/, last accessed 16/05/2019
  31. gensim topic modelling for humans, https://radimrehurek.com/gensim/, last accessed 17/05/2019
  32. scikit-learn Machine learning in Python, https://scikit-learn.org/stable/, last accessed 27/05/2019
  33. NumPy, https://www.numpy.org/, last accessed 27/05/2019
  34. matplotlib, https://matplotlib.org/, last accessed 16/05/2019
  35. BokehJS, https://bokeh.pydata.org/en/latest/docs/dev_guide/bokehjs.html, last accessed 16/05/2019
  36. The hdbscan Clustering Library, https://hdbscan.readthedocs.io/en/latest/, last accessed 20/05/2019
  37. HDBSCAN, Fast Density based Clustering, the How and the Why -John Healy, https://www.youtube.com/watch?v=dGsxd67IFiU&feature=youtu.be, last accessed 20/05/2019
  38. tweet-clustering, https://github.com/pksohn/tweet-clustering, last accessed 20/05/2019 38. Comparing Python Clustering Algorithms, https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html, last ac- cessed 20/05/2019
  39. The Illustrated Word2vec, https://jalammar.github.io/illustrated-word2vec/, last accessed 17/05/2019
  40. Do More with Twitter Data, https://twitterdev.github.io/do_more_with_twitter_data/clustering-users.html, last accessed 14/05/2019
  41. plt.ylabel('# of tweets') plt.xlabel('cluster label') plt.title('Classification of tweets with kmeans');