Key research themes
1. How can semantic and structural attributes improve context-based email classification?
This research area focuses on leveraging the rich semantic and structural characteristics of emails to enhance classification accuracy. By representing emails not merely as text but as structured entities (e.g., graphs capturing semantic roles and event types), classifiers can better differentiate among nuanced classes like social, personal, and professional emails. This approach moves beyond traditional bag-of-words or keyword models to embrace the contextual and layout features inherent in emails, which is crucial for applications such as event management and prioritization.
2. What machine learning models and feature engineering techniques yield high performance in spam email detection?
The surge in spam emails necessitates robust, efficient spam detection systems. This research theme investigates various supervised learning algorithms—such as Naive Bayes, Support Vector Machines (SVM), Random Forests, and ensemble methods like boosting—and feature extraction strategies like TF-IDF, bag-of-words, and word embeddings. It explores how these algorithms perform on benchmark datasets (e.g., Enron, Spambase, Ling-Spam) in terms of precision, recall, and accuracy, with considerations for computational efficiency and adaptability to evolving spam tactics.
3. How do specialized language and regional characteristics influence email classification techniques?
This research question addresses the challenges and methodologies involved in classifying emails written in specific languages, particularly Arabic, which has unique morphological and syntactic traits compared to widely studied languages like English. The focus is on adapting deep learning and natural language processing approaches to handle limited training data, complex morphology, and language-specific lexicons to classify business emails effectively. Understanding these tailored models is essential for enabling accurate automatic email classification and filtration in regional and resource-constrained language contexts.