Classification of Textual Data in Distributed Environment
2018 Second International Conference on Advances in Computing, Control and Communication Technology (IAC3T)
Nowadays data is generating at a very fast pace through internet usage and other sources in large... more Nowadays data is generating at a very fast pace through internet usage and other sources in large volumes termed as Big Data. A large portion of generated data is in text form collected through emails, blogs, social networking sites, e-commerce reviews etc. which requires deep analysis to extract meaningful patterns from it for applications such as business decision making, social media monitoring, spam detection etc. This results in incapability for processing and storing this data. So it must be handled or processed using parallel computing tools and machine learning algorithms. In this work, we have used Naive Bayes classifier to classify textual data in Hadoop environment using Mahout. This experiment is carried out by using 20 news group dataset and achieved accuracy with 88.38%. After evaluating results we have found that when we increase the number of Hadoop clusters, the processing speed on clusters increase as Apache Hadoop can process large volume of datasets efficiently using map-reduce paradigm.
Uploads
Papers by Bhaskar Pant