International Journal of Electrical and Computer Engineering (IJECE), 2024
Heart disease (HD) accounts for more deaths every year than other illnesses. World Health Organiz... more Heart disease (HD) accounts for more deaths every year than other illnesses. World Health Organization (WHO) assessed 17.9 million life losses caused by heart disease in 2016, demonstrating 31% of all international life losses. Three-quarters of these life losses occur in low and middle-income nations. Machine learning (ML), due to advanced precision in pattern recognition and classification, demonstrates to be in effect in complementing decisionmaking and threat prediction from the huge number of HD data created by the healthcare sector. Thus, this study aims to develop a logistic regression model (LRM) for predicting the risk of getting HD in ten years. The study explores the different methodologies for improving the performance of base LRM for predicting whether a person gets HD after ten years or not. The result demonstrates the capability of LRM in predicting the risks of getting HD after ten years. The LRM achieves 97.35% accuracy with the recursive feature elimination and random under-sampling. This implies that the LRM can play an important role in precautionary methods to avoid the risk of HD.
International Journal of Power Electronics and Drive Systems/International Journal of Electrical and Computer Engineering, Apr 1, 2024
Heart disease (HD) accounts for more deaths every year than other illnesses. World Health Organiz... more Heart disease (HD) accounts for more deaths every year than other illnesses. World Health Organization (WHO) assessed 17.9 million life losses caused by heart disease in 2016, demonstrating 31% of all international life losses. Three-quarters of these life losses occur in low and middle-income nations. Machine learning (ML), due to advanced precision in pattern recognition and classification, demonstrates to be in effect in complementing decisionmaking and threat prediction from the huge number of HD data created by the healthcare sector. Thus, this study aims to develop a logistic regression model (LRM) for predicting the risk of getting HD in ten years. The study explores the different methodologies for improving the performance of base LRM for predicting whether a person gets HD after ten years or not. The result demonstrates the capability of LRM in predicting the risks of getting HD after ten years. The LRM achieves 97.35% accuracy with the recursive feature elimination and random under-sampling. This implies that the LRM can play an important role in precautionary methods to avoid the risk of HD.
This paper compares the performance of different supervised learning algorithms for email spam de... more This paper compares the performance of different supervised learning algorithms for email spam detection. The comparison considered performance measures such as the area under the curve (AUC), Fscore, precision, and confusion matrix. The paper evaluated the performance of eight supervised learning algorithms for email spam detection. The rst stage collected the dataset of emails from the Kaggle repository. The second stage involves pre-processing, duplicate removal, and the dataset features scaling. After the pre-processing stage, the study employed a synthetic minority technique for balancing samples representing the spam and no spam emails in the dataset. In the nal stage, the supervised learning algorithms are trained on the pre-processed dataset and then the test result is analyzed. The comparison shows the random forest (RF) model performing at higher accuracy than the other models. The RG model achieved 96.6% accuracy in email spam detection. Thus, the result demonstrates that different models tend to perform differently in email spam detection.
Chronic kidney disease is one of the leading causes of death around the world. Early detection of... more Chronic kidney disease is one of the leading causes of death around the world. Early detection of chronic kidney disease is crucial to the reduction of mortality caused as a result of the disease. Machine learning methods are recently becoming popular for the detection of chronic kidney disease. This study investigates the influence of resampling for chronic kidney disease detection using an imbalanced chronic kidney disease dataset. Choosing an optimal feature subset for medical datasets is important for improving the performance of data-driven predictive models. The influence of imbalanced class distribution on predictive models has become an increasingly important topic due to the recent advances in automatic decision-making processes and the continuous expansion in the volume of the data collected by medical institutions. To address the identified research gap, an experimental evaluation of synthetic minority oversampling and near miss undersampling technique was performed on a real-world chronic kidney disease dataset using several classification methods such as decision tree, random forest, K-nearest neighbor, adaptive boosting, and support vector machine. The results demonstrate that a number of variables, including performance metrics, classification algorithm, and dataset characteristics, influence the best class distribution. The study also offers useful information about resampling methods for an imbalanced classification problem which will help improve classification accuracy.
International Journal of Electrical and Computer Engineering (IJECE)
This study investigates the Shapley additive explanation (SHAP) of the extreme boosting (XGBoost)... more This study investigates the Shapley additive explanation (SHAP) of the extreme boosting (XGBoost) model for breast cancer diagnosis. The study employed Wisconsin’s breast cancer dataset, characterized by 30 features extracted from an image of a breast cell. SHAP module generated different explainer values representing the impact of a breast cancer feature on breast cancer diagnosis. The experiment computed SHAP values of 569 samples of the breast cancer dataset. The SHAP explanation indicates perimeter and concave points have the highest impact on breast cancer diagnosis. SHAP explains the XGB model diagnosis outcome showing the features affecting the XGBoost model. The developed XGB model achieves an accuracy of 98.42%.
International Journal of Power Electronics and Drive Systems, Jun 1, 2023
Feature selection improves the classification performance of machine learning models. It also ide... more Feature selection improves the classification performance of machine learning models. It also identifies the important features and eliminates those with little significance. Furthermore, feature selection reduces the dimensionality of training and testing data points. This study proposes a feature selection method that uses a multivariate sample similarity measure. The method selects features with significant contributions using a machine-learning model. The multivariate sample similarity measure is evaluated using the University of California, Irvine heart disease dataset and compared with existing feature selection methods. The multivariate sample similarity measure is evaluated with metrics such as minimum subset selected, accuracy, F1-score, and area under the curve (AUC). The results show that the proposed method is able to diagnose chest pain, thallium scan, and major vessels scanned using X-rays with a high capability to distinguish between healthy and heart disease patients with a 99.6% accuracy.
Heart disease identification is one of the most challenging task that requires highly experienced... more Heart disease identification is one of the most challenging task that requires highly experienced cardiologists. However, in developing nations such as Ethiopia, there are a few cardiologists and heart disease detection is more challenging. As an alternative solution to cardiologist, this study proposed a more effective model for heart disease detection by employing random forest and sequential feature selection (SFS). SFS is an effective approach to improve the performance of random forest model on heart disease detection. SFS removes unrelated features in heart disease dataset that tends to mislead random forest model on heart disease detection. Thus, removing inappropriate and duplicate features from the training set with sequential feature selection approach plays significant role in improving the performance of the proposed model. The proposed feature selection approach is evaluated using real world clinical heart disease dataset collected from University of California Irvine (...
Bulletin of Electrical Engineering and Informatics
Distributed denial of service is a form of cyber-attack that involves sending several network tra... more Distributed denial of service is a form of cyber-attack that involves sending several network traffic to a target system such as DHCP, domain name server (DNS), and HTTP server. The attack aims to exhaust computing resources such as memory and the processor of a target system by blocking the legitimate users from getting access to the service provided by the server. Network intrusion prevention ensures the security of a network and protects the server from such attacks. Thus, this paper presents a predicitive model that identifies distributed denial of service attacks (DDSA) using Bernoulli-Naive Bayes. The developed model is evaluated on the publicly available Kaggle dataset. The method is tested with a confusion matrix, receiver operating characteristics (ROC) curve, and accuracy to measure its performance. The experimental results show an 85.99% accuracy in detecting DDSA with the proposed method. Hence, Bernoulli-Naive Bayes-based method was found to be effective and significant...
Bulletin of Electrical Engineering and Informatics
The objective of this study is to evaluate the effectiveness of different regression models in co... more The objective of this study is to evaluate the effectiveness of different regression models in concrete compressive strength estimation. A concrete compressive strength dataset is employed for the estimation of the regressor models. Regression models such as linear regressor, ridge regressor, k-neighbors regressor, decision tree regressor, random forest regressor, gradient boosting regressor, AdaBoost regressor, and support vector regressor are used for developing the model that predicts the concrete strength. Cross-validation techniques and grid search are used to tune the parameters for better model performance. Python 3.8 programming language is used to conduct the experiment. The Performance evaluation result reveals that the gradient boosting regressor has better performance as compared to other models using root mean square error (RMSE).
Bulletin of Electrical Engineering and Informatics
This article evaluates the performance of the support vector machine (SVM), decision tree (DT), a... more This article evaluates the performance of the support vector machine (SVM), decision tree (DT), and random forest (RF) on the dataset that contains the medical records of 299 patients with heart failure (HF) collected at the Faisalabad Institute of Cardiology and the Allied hospital in Pakistan. The dataset contains 13 descriptive features of physical, clinical, and lifestyle information. The study compared the performance of three classification algorithms employing pre-processing techniques such as min-max scaling, and principal component analysis (PCA). The simulation result shows that the performance of the DT, and RF decreased with dimensionality reduction while the SVM improved with dimensionality reduction. The SVM achieved 84.44%. Thus, feature scaling improves the performance of the SVM. The RF performs at 82.22%, the DT at 81.11%, and the SVM shows an improvement of 1.64% with scaled features, compared to the original dataset.
Indonesian Journal of Electrical Engineering and Computer Science
The existing heart failure risk prediction models are developed based on machine learning predict... more The existing heart failure risk prediction models are developed based on machine learning predictors. The objective of this study is to identify the key risk factors that affect the survival time of heart patients and to develop a heart failure survival prediction model using the identified risk factors. A cox proportional hazard regression method is applied to generate the proposed heart failure survival model. We used the dataset from the University of California Irvine (UCI) clinical heart failure data repository. To develop the model we have used multiple risk factors such as age, anemia, creatinine phosphokinase, diabetes history, ejection fraction, presence of high blood pressure, platelet count, serum creatinine, sex, and smoking history. Among the risk factors, high blood pressure is identified as one of the novel risk factors for heart failure. We have validated the performance of the model via statistical and empirical validation. The experimental result shows that the pro...
International Journal of Informatics and Communication Technology (IJ-ICT), 2021
In this study, the author proposed k-nearest neighbor (KNN) based heart disease prediction model.... more In this study, the author proposed k-nearest neighbor (KNN) based heart disease prediction model. The author conducted an experiment to evaluate the performance of the proposed model. Moreover, the result of the experimental evaluation of the predictive performance of the proposed model is analyzed. To conduct the study, the author obtained heart disease data from Kaggle machine learning data repository. The dataset consists of 1025 observations of which 499 or 48.68% is heart disease negative and 526 or 51.32% is heart disease positive. Finally, the performance of KNN algorithm is analyzed on the test set. The result of performance analysis on the experimental results on the Kaggle heart disease data repository shows that the accuracy of the KNN is 91.99%
Recent years have seen an upsurge in the acceptance of illness diagnosis and prediction utilizing... more Recent years have seen an upsurge in the acceptance of illness diagnosis and prediction utilizing ML algorithms. A ML model can be employed in the diagnosis of breast cancer illness. In this research, an effective breast cancer prediction model with grid search approach is provided. Using the random forest approach, grid search is used to find the best n-estimator, which may provide the highest possible accuracy for predicting breast cancer. The accuracy of the suggested model can then be utilised to contrast its effectiveness to that of a standard RFM. The experimental result analysis demonstrates that the optimized model has 97.07 percent accuracy whereas the regular random forest technique has an accuracy of 94.73 percent in breast cancer detection.
Proceedings of Engineering and Technology Innovation
This study aims to explore the effectiveness of the Shapley additive explanation (SHAP) technique... more This study aims to explore the effectiveness of the Shapley additive explanation (SHAP) technique in developing a transparent, interpretable, and explainable ensemble method for heart disease diagnosis using random forest algorithms. Firstly, the features with high impact on the heart disease prediction are selected by SHAP using 1025 heart disease datasets, obtained from a publicly available Kaggle data repository. After that, the features which have the greatest influence on the heart disease prediction are used to develop an interpretable ensemble learning model to automate the heart disease diagnosis by employing the SHAP technique. Finally, the performance of the developed model is evaluated. The SHAP values are used to obtain better performance of heart disease diagnosis. The experimental result shows that 100% prediction accuracy is achieved with the developed model. In addition, the experiment shows that age, chest pain, and maximum heart rate have positive impact on the pre...
Bulletin of Electrical Engineering and Informatics
Explaining the reason for model’s output as diabetes positive or negative is crucial for diabetes... more Explaining the reason for model’s output as diabetes positive or negative is crucial for diabetes diagnosis. Because, reasoning the predictive outcome of model helps to understand why the model predicted an instance into diabetes positive or negative class. In recent years, highest predictive accuracy and promising result is achieved with simple linear model to complex deep neural network. However, the use of complex model such as ensemble and deep learning have trade-off between accuracy and interpretability. In response to the problem of interpretability, different approaches have been proposed to explain the predictive outcome of complex model. However, the relationship between the proposed approaches and the preferred approach for diabetes prediction is not clear. To address this problem, the authors aimed to implement and compare existing model interpretation approaches, local interpretable model agnostic explanation (LIME), shapely additive explanation (SHAP) and permutation f...
IAES International Journal of Artificial Intelligence (IJ-AI), 2021
In this study, breast cancer prediction model is proposed with decision tree and adaptive boostin... more In this study, breast cancer prediction model is proposed with decision tree and adaptive boosting (Adboost). Furthermore, an extensive experimental evaluation of the predictive performance of the proposed model is conducted. The study is conducted on breast cancer dataset collected form the kaggle data repository. The dataset consists of 569 observations of which the 212 or 37.25% are benign or breast cancer negative and 62.74% are malignant or breast cancer positive. The class distribution shows that, the dataset is highly imbalanced and a learning algorithm such as decision tree is biased to the benign observation and results in poor performance on predicting the malignant observation. To improve the performance of the decision tree on the malignant observation, boosting algorithm namely, the adaptive boosting is employed. Finally, the predictive performance of the decision tree and adaptive boosting is analyzed. The analysis on predictive performance of the model on the kaggle b...
Indonesian Journal of Electrical Engineering and Computer Science, 2022
Breast cancer is the most common type of cancer occurring mostly in females. In recent years, man... more Breast cancer is the most common type of cancer occurring mostly in females. In recent years, many researchers have devoted to automate diagnosis of breast cancer by developing different machine learning model. However, the quality and quantity of feature in breast cancer diagnostic dataset have significant effect on the accuracy and efficiency of predictive model. Feature selection is effective method for reducing the dimensionality and improving the accuracy of predictive model. The use of feature selection is to determine feature required for training model and to remove irrelevant and duplicate feature. Duplicate feature is a feature that is highly correlated to another feature. The objective of this study is to conduct experimental research on three different feature selection methods for breast cancer prediction. Sequential, embedded and chi-square feature selection are implemented using breast cancer diagnostic dataset. The study compares the performance of sequential embedde...
Heart disease is one of the causes for death throughout the world. Heart disease cannot be easily... more Heart disease is one of the causes for death throughout the world. Heart disease cannot be easily identified by the medical experts and practitioners as the detection of heart disease requires expertise and experience. Hence, developing better performing models for heart disease detection using machine-learning algorithms is crucial for detecting heart disease in an early stage. However, employing machine learning algorithm involves determining the relationship between the heart failure dataset features. In this study, correlation analysis is employed to identify the relationship among the heart failure dataset features and a predictive model for heart failure detection is developed with K-nearest neighbor (KNN). Pearson correlation is employed to identify the relationship between the features in the heart failure dataset and the effect of strong correlation to the target feature on the performance of K-nearest neighbor (KNN) model is analyzed. The experimental result shows that highly correlated feature significantly affected the performance of K-nearest neighbor (KNN) for heart failure detection. Finally, the performance of KNN is evaluated and result reveals that the model has acceptable level of performance with highest accuracy of 97.07% on heart failure prediction.
Uploads
Papers by Tsehay Assegie