Papers by Georgios K . Kostopoulos

A Semi-Supervised Regression Algorithm for Grade Prediction of Students in Distance Learning Courses
International Journal on Artificial Intelligence Tools, Jun 1, 2019
Applying data mining methods in the educational field has gained a lot of attention among researc... more Applying data mining methods in the educational field has gained a lot of attention among researchers in recent years. Educational Data Mining has turned into an effective tool for uncovering hidden relationships in educational data and predicting students’ learning outcomes. Several supervised methods have been successfully applied with the purpose of identifying students at risk of failing or of predicting their academic performance. Recently, the implementation of Semi-Supervised Learning (SSL) methods in the educational process indicated their superiority over the supervised ones. SSL is an emerging subfield of machine learning seeking to effectively exploit a small pool of labeled examples together with a large pool of unlabeled ones. On this basis, a small number of students’ data from previous years may be used as the training set of a learning model to predict future outcomes of current students. A number of rewarding studies deal with the implementation of classification methods in the educational field in contrast to regression, which is deemed to be a slightly touched task. In this paper, a novel semi-supervised regression (SSR) algorithm is presented for predicting the final grade of undergraduate students in a distance online course. To the best of our knowledge there is no study dealing with the implementation of SSR methods in the educational field. A plethora of attributes related to students’ characteristics, academic performance and interaction within the course online platform form the training set, while several experiments were carried out confirming the superiority of the proposed algorithm over familiar regression methods. The experiment results show that the predictive performance of the proposed algorithm is increasing significantly over time, achieving a MAE value of less than 1.2358 before the middle of the academic year, which provides the advantage of early warnings and interventions.

IEEE Transactions on Learning Technologies, Apr 1, 2019
Educational Data Mining has gained a lot of attention among scientists in recent years and consti... more Educational Data Mining has gained a lot of attention among scientists in recent years and constitutes an efficient tool for unraveling the concealed knowledge in educational data. Recently, Semi-Supervised Learning (SSL) methods have been gradually implemented in the educational process demonstrating their usability and effectiveness. Co-training is a representative SSL method aiming to exploit both labeled and unlabeled examples, provided that each example is described by two features views. Nevertheless, it is yet to be used in various scientific fields, among which the educational field as well, since the assumption about the existence of two feature views cannot be easily put into practice. Within this context, the main purpose of the present study is to evaluate the efficiency of a proposed co-training method for early prognosis of undergraduate students' performance in the final examinations of a distance course based on a plethora of attributes which are naturally divided into two distinct views, since they are originated from different sources. More specifically, the first view consists of attributes regarding students' characteristics and academic achievements which are manually filled out by their tutors, while the second one consists of attributes tracking students' online activity in the course learning management system and which are automatically recorded by the system. The experimental results demonstrate the superiority of the proposed co-training method as opposed to state-of-the-art semi-supervised and supervised methods.

Applied sciences, Mar 21, 2020
Transferring knowledge from one domain to another has gained a lot of attention among scientists ... more Transferring knowledge from one domain to another has gained a lot of attention among scientists in recent years. Transfer learning is a machine learning approach aiming to exploit the knowledge retrieved from one problem for improving the predictive performance of a learning model for a different but related problem. This is particularly the case when there is a lack of data regarding a problem, but there is plenty of data about another related one. To this end, the present study intends to investigate the effectiveness of transfer learning from deep neural networks for the task of students' performance prediction in higher education. Since building predictive models in the Educational Data Mining field through transfer learning methods has been poorly studied so far, we consider this study as an important step in this direction. Therefore, a plethora of experiments were conducted based on data originating from five compulsory courses of two undergraduate programs. The experimental results demonstrate that the prognosis of students at risk of failure can be achieved with satisfactory accuracy in most cases, provided that datasets of students who have attended other related courses are available.

Journal of Intelligent and Fuzzy Systems, Aug 26, 2018
Nowadays, Semi-Supervised Learning lies at the core of the Machine Learning field trying to effec... more Nowadays, Semi-Supervised Learning lies at the core of the Machine Learning field trying to effectively exploit unlabeled data as much as possible, together with a small amount of labeled data aiming to improve the predictive performance. Depending on the nature of the output class, Semi-Supervised Classification and Semi-Supervised Regression constitute the basic components of Semi-Supervised Learning. Various studies deal with the implementation of Semi-Supervised Classification techniques in many real world problems over the last two decades in contrast with Semi-Supervised Regression, which is deemed to be a more general and slightly touched case. This survey aims to provide a detailed review of Semi-Supervised Regression methods and implemented algorithms in recent years. Our in-depth study reveals the relatively few studies that deal with this specific problem. Moreover, we seek to classify these methods by proposing a schema and categorizing all the related methods that have been developed in recent years according to specific criteria.

Journal of Intelligent and Fuzzy Systems, Jul 27, 2018
Semi-supervised learning is an emerging subfield of machine learning, with a view to building eff... more Semi-supervised learning is an emerging subfield of machine learning, with a view to building efficient classifiers exploiting a limited pool of labeled data together with a large pool of unlabeled ones. Most of the studies regarding semisupervised learning deal with classification problems, whose goal is to learn a function that maps an unlabeled instance into a finite number of classes. In this paper, a new semi-supervised classification algorithm, which is based on a voting methodology, is proposed. The term attributed to this ensemble method is called CST-Voting. Ensemble methods have been effectively applied in various scientific fields and often perform better than the individual classifiers from which they are originated. The efficiency of the proposed algorithm is compared to three familiar semi-supervised learning methods on a plethora of benchmark datasets using three representative supervised classifiers as base learners. Experimental results demonstrate the predominance of the proposed method, outperforming classical semi-supervised classification algorithms as illustrated from the accuracy measurements and confirmed by the Friedman Aligned Ranks nonparametric test.

IEEE Access, 2020
In many real-world applications scientists are often confronted with the problem of incomplete da... more In many real-world applications scientists are often confronted with the problem of incomplete datasets due to several reasons. The direct analysis of datasets with missing values in attributes inevitably results in inaccurate learning models and erroneous results. Facing effectively the challenge of missing values is an essential step of the data mining process. Imputation is often employed to overcome the shortcomings incurred by missing data during the pre-process stage of data analysis. Therefore, a plethora of statistical and machine learning methods have been proposed and employed with a view to imputing the missing values in incomplete data with their potential or actual values. In this context, the main objective of this paper is to put forward an iterative stepwise imputation method based on the semi-supervised learning approach, called IRSSI. Semi-supervised methods have proved to be particularly effective for exploiting incomplete or partially labeled data with regard to the values of the target attribute. The proposed algorithm was experimentally evaluated on real-world benchmark datasets and artificially generated datasets using different high ratios of missing data. The experimental results demonstrate the efficiency of IRSSI algorithm compared to typical imputation methods.

Predicting Student Performance in Distance Higher Education Using Active Learning
Communications in computer and information science, 2017
Students’ performance prediction in higher education has been identified as one of the most impor... more Students’ performance prediction in higher education has been identified as one of the most important research problems in machine learning. Educational data mining constitutes an important branch of machine learning trying to effectively analyze students’ academic behavior and predict their performance. Over recent years, several machine learning methods have been effectively used in the educational field with remarkable results, and especially supervised classification methods. The early identification of in case fail students is of utmost importance for the academic staff and the universities. In this paper, we investigate the effectiveness of active learning methodologies in predicting students’ performance in distance higher education. As far as we are aware of there exists no study dealing with the implementation of active learning methodologies in the educational field. Several experiments take place in our research comparing the accuracy measures of familiar active learners and demonstrating their efficiency by the exploitation of a small labeled dataset together with a large pool of unlabeled data.

Estimating student dropout in distance higher education using semi-supervised techniques
Nowadays, distance higher education has rapidly increased due to advance and integration of infor... more Nowadays, distance higher education has rapidly increased due to advance and integration of information and communications' technology. Students who attend online distance courses have often family obligations and job commitments and are usually in 'high risk' of dropout during their attendance. It is of a highly importance to identify such students, through paying extra attention and support to them could possibly minimize the possibility of student failure or even dropout. The present research intends to study whether semi-supervised techniques could be useful in student dropout prediction in distance higher education. Semi-supervised learning aims to generate reliable predictions using few labeled and many unlabeled data. Labeled data are difficult obtainable quite often, as they require many experts, a lot of human effort and time in experiments. As far as, we are aware in several studies propose and compare supervised methods for students' dropout prediction rates in higher education, but none of them investigates the effectiveness of semi-supervised methods. The results of our experiments reveal that a good predictive accuracy can be achieved using few labeled data in comparison to well known supervised learning algorithms. For that purpose we have developed a web-based tool to estimate if an individual student is going to dropout.

Applied sciences, Dec 20, 2019
Educational Data Mining (EDM) has emerged over the last two decades, concerning with the developm... more Educational Data Mining (EDM) has emerged over the last two decades, concerning with the development and implementation of data mining methods in order to facilitate the analysis of vast amounts of data originating from a wide variety of educational contexts. Predicting students' progression and learning outcomes, such as dropout, performance and course grades, is regarded among the most important tasks of the EDM field. Therefore, applying appropriate machine learning algorithms for building accurate predictive models is of outmost importance for both educators and data scientists. Considering the high-dimensional input space and the complexity of machine learning algorithms, the process of building accurate and robust learning models requires advanced data science skills, while is time-consuming and error-prone in most cases. In addition, choosing the proper method for a given problem formulation and configuring the optimal parameters' values for a specific model is a demanding task, whilst it is often very difficult to understand and explain the produced results. In this context, the main purpose of the present study is to examine the potential use of advanced machine learning strategies on educational settings from the perspective of hyperparameter optimization. More specifically, we investigate the effectiveness of automated Machine Learning (autoML) for the task of predicting students' learning outcomes based on their participation in online learning platforms. At the same time, we limit the search space to tree-based and rule-based models in order to achieving transparent and interpretable results. To this end, a plethora of experiments were carried out, revealing that autoML tools achieve consistently superior results. Hopefully our work will help nonexpert users (e.g., educators and instructors) in the field of EDM to conduct experiments with appropriate automated parameter configurations, thus achieving highly accurate and comprehensible results.

Multi-objective Optimization of C4.5 Decision Tree for Predicting Student Academic Performance
Applying data mining methods in the educational field has gained a lot of attention among scienti... more Applying data mining methods in the educational field has gained a lot of attention among scientists over the last years. Educational Data Mining forms an ever-developing research area aiming to unveil the hidden knowledge in educational data and improve students’ learning behavior and outcomes. To this end, a plethora of data mining methods have already been implemented in various educational settings solving a variety of tasks, among which the prediction of students’ academic performance as well. Decision trees have proven to be a quite effective method for both classification and regression problems showing a number of considerable advantages, such as efficiency, simplicity, flexibility and interpretability. Moreover, configuration of parameter values has often a material impact on building optimal trees in terms of accuracy and/or size. In this context, the main objective of our study is to yield a highly accurate and interpretable classification tree for the early prognosis of students at risk of failing in a university course. Thereby, effective intervention and support actions could be initiated to motivate students and enhance their performance. The experimental results demonstrate that the induction of the C4.5 decision tree classifier through an evolutionary algorithm, such as the Speed -constrained Multi-objective Particle Swarm Optimization algorithm, yields more accurate and easier to construe trees.

Self-trained eXtreme Gradient Boosting Trees
Semi-Supervised Learning (SSL) is an ever-growing research area offering a powerful set of method... more Semi-Supervised Learning (SSL) is an ever-growing research area offering a powerful set of methods, either single or multi-view, for exploiting both labeled and unlabeled instances in the most effective manner. Self-training is a representative SSL algorithm which has been efficiently implemented for solving several classification problems in a wide range of scientific fields. Moreover, self-training has served as the base for the development of several self-labeled methods. In addition, gradient boosting is an advanced machine learning technique, a boosting algorithm for both classification and regression problems, which produces a predictive model in the form of decision trees. In this context, the principal objective of this paper is to put forward an improved self-training algorithm for classification tasks utilizing the efficacy of eXtreme Gradient Boosting (XGBoost) trees in a self-labeled scheme in order to build a highly accurate and robust classification model. A number of experiments on benchmark datasets were executed demonstrating the superiority of the proposed method over representative semi-supervised methods, as statistically verified by the Friedman non-parametric test.

Fuzzy-based active learning for predicting student academic performance using autoML: a step-wise approach
Journal of Computing in Higher Education, May 12, 2021
Predicting students’ learning outcomes is one of the main topics of interest in the area of Educa... more Predicting students’ learning outcomes is one of the main topics of interest in the area of Educational Data Mining and Learning Analytics. To this end, a plethora of machine learning methods has been successfully applied for solving a variety of predictive problems. However, it is of utmost importance for both educators and data scientists to develop accurate learning models at low cost. Fuzzy logic constitutes an appropriate approach for building models of high performance and transparency. In addition, active learning reduces both the time and cost of labeling effort, by exploiting a small set of labeled data along with a large set of unlabeled data in the most efficient way. In addition, choosing the proper method for a given problem formulation and configuring the optimal parameter setting is a demanding task, considering the high-dimensional input space and the complexity of machine learning algorithms. As such, exploring the potential of automated machine learning (autoML) strategies from the perspective of machine learning adeptness is important. In this context, the present study introduces a fuzzy-based active learning method for predicting students’ academic performance which combines, in a modular way, autoML practices. A lot of experiments was carried out, revealing the efficiency of the proposed method for the accurate prediction of students at risk of failure. These insights may have the potential to support the learning experience and be useful the wider science of learning.

Applied sciences, Nov 26, 2020
Multi-view learning is a machine learning app0roach aiming to exploit the knowledge retrieved fro... more Multi-view learning is a machine learning app0roach aiming to exploit the knowledge retrieved from data, represented by multiple feature subsets known as views. Co-training is considered the most representative form of multi-view learning, a very effective semi-supervised classification algorithm for building highly accurate and robust predictive models. Even though it has been implemented in various scientific fields, it has not adequately used in educational data mining and learning analytics, since the hypothesis about the existence of two feature views cannot be easily implemented. Some notable studies have emerged recently dealing with semi-supervised classification tasks, such as student performance or student dropout prediction, while semi-supervised regression is uncharted territory. Therefore, the present study attempts to implement a semi-regression algorithm for predicting the grades of undergraduate students in the final exams of a one-year online course, which exploits three independent and naturally formed feature views, since they are derived from different sources. Moreover, we examine a well-established framework for interpreting the acquired results regarding their contribution to the final outcome per student/instance. To this purpose, a plethora of experiments is conducted based on data offered by the Hellenic Open University and representative machine learning algorithms. The experimental results demonstrate that the early prognosis of students at risk of failure can be accurately achieved compared to supervised models, even for a small amount of initially collected data from the first two semesters. The robustness of the applying semi-supervised regression scheme along with supervised learners and the investigation of features' reasoning could highly benefit the educational domain.

Deep Dense Neural Network for Early Prediction of Failure-Prone Students
Learning and analytics in intelligent systems, 2020
In recent years, the constant technological advances in computing power as well as in the process... more In recent years, the constant technological advances in computing power as well as in the processing and analysis of large amounts of data have given a powerful impetus to the development of a new research area. Deep Learning is a burgeoning subfield of machine learning which is currently getting a lot of attention among scientists offering considerable advantages over traditional machine learning techniques. Deep neural networks have already been successfully applied for solving a wide variety of tasks, such as natural language processing, text translation, image classification, object detection and speech recognition. A great deal of notable studies has recently emerged concerning the use of Deep Learning methods in the area of Educational Data Mining and Learning Analytics. Their potential use in the educational field opens up new horizons for educators so as to enhance their understanding and analysis of data coming from varied educational settings, thus improving learning and teaching quality as well as the educational outcomes. In this context, the main purpose of the present study is to evaluate the efficiency of deep dense neural networks with a view to early predicting failure-prone students in distance higher education. A plethora of student features coming from different educational sources were employed in our study regarding students’ characteristics, academic achievements and activity in the course learning management system. The experimental results reveal that Deep Learning methods may contribute to building more accurate predictive models, whilst identifying students in trouble soon enough to provide in-time and effective interventions.

Algorithms, Jan 16, 2020
In recent years, a forward-looking subfield of machine learning has emerged with important applic... more In recent years, a forward-looking subfield of machine learning has emerged with important applications in a variety of scientific fields. Semi-supervised learning is increasingly being recognized as a burgeoning area embracing a plethora of efficient methods and algorithms seeking to exploit a small pool of labeled examples together with a large pool of unlabeled ones in the most efficient way. Co-training is a representative semi-supervised classification algorithm originally based on the assumption that each example can be described by two distinct feature sets, usually referred to as views. Since such an assumption can hardly be met in real world problems, several variants of the co-training algorithm have been proposed dealing with the absence or existence of a naturally two-view feature split. In this context, a Static Selection Ensemble-based co-training scheme operating under a random feature split strategy is outlined regarding binary classification problems, where the type of the base ensemble learner is a soft-Voting one composed of two participants. Ensemble methods are commonly used to boost the predictive performance of learning models by using a set of different classifiers, while the Static Ensemble Selection approach seeks to find the most suitable structure of ensemble classifier based on a specific criterion through a pool of candidate classifiers. The efficacy of the proposed scheme is verified through several experiments on a plethora of benchmark datasets as statistically confirmed by the Friedman Aligned Ranks non-parametric test over the behavior of classification accuracy, F 1-score, and Area Under Curve metrics.

Enhancing high school students' performance based on semi-supervised methods
High school educators evaluate students' performance on a daily basis using several assessmen... more High school educators evaluate students' performance on a daily basis using several assessment methods. Identifying weak and low performance students as soon as possible during the academic year is of utmost importance for teachers and educational institutions. Well planned assignments and activities, additional learning material and supplementary lessons may motivate students and enhance their performance. Over recent years, educational data mining has led to the development of several efficient methods for the prediction of students' performance. Semi-supervised learning constitutes the appropriate tool to exploit data originated from educational institutions, since there is often a lack of labeled data, while unlabeled data is vast. In our study, several well-known semi-supervised techniques are used for the prognosis of high school students' performance in the final examinations of the “Mathematics” module. The experiments results demonstrate the efficiency of semi-supervised learning methods, and especially Self-training, Co-training and Tri-training algorithms, compared to familiar supervised methods.

Early dropout prediction in distance higher education using active learning
2017 8th International Conference on Information, Intelligence, Systems & Applications (IISA), 2017
Students' dropout prediction in higher education is an important and challenging research top... more Students' dropout prediction in higher education is an important and challenging research topic for universities. The successful implementation of a distance learning course is fundamental for educational institutions, for this reason the reduction of dropout rates is of vital importance. Although the use of machine learning methods in the educational field is relatively new, significant studies have been presented in recent years dealing with the dropout phenomenon. These studies point out several factors influencing the successful course completion, while indicating the complexity and difficulty of accurate early dropout prediction. The main purpose of this research is to investigate the efficiency of active learning methodologies to predict students' dropout rates in a distance web-based course in a timely manner. Active learning is a typical of methods trying to effectively use unlabeled data along with a small amount of labeled ones. A plethora of experiments are conducted using a variety of active learners indicating that an early prediction of high-risk students can be obtained.

Brain Function Assessment in Learning, 2017
In recent years, there is a growing research interest in applying data mining techniques in educa... more In recent years, there is a growing research interest in applying data mining techniques in education. Educational Data Mining has become an efficient tool for teachers and educational institutions trying to effectively analyze the academic behavior of students and predict their progress and performance. The main objective of this study is to classify junior high school students' performance in the final examinations of the "Geography" module in a set of five pre-defined classes using active learning. The exploitation of a small set of labeled examples together with a large set of unlabeled ones to build efficient classifiers is the key point of the active learning framework. To the best of our knowledge, no study exist dealing with the implementation of active learning methods for predicting students' performance. Several assessment attributes related to students' grades in homework assignments, oral assessment, short tests and semester exams constitute the dataset, while a number of experiments are carried out demonstrating the advantage of active learning compared to familiar supervised methods, such as the Naïve Bayes classifier.

Evaluating Active Learning Methods for Bankruptcy Prediction
Brain Function Assessment in Learning, 2017
The prediction of corporate bankruptcy has been addressed as an increasingly important financial ... more The prediction of corporate bankruptcy has been addressed as an increasingly important financial problem and has been extensively analyzed in the accounting literature. Over recent years, several machine learning methods have been effectively applied to build accurate predictive models for detecting business failure with remarkable results, such as neural networks (NNs) and ensemble methods. This paper investigates the effectiveness of the active learning framework to predict bankruptcy using financial data from a set of Greek firms. Active learning is an emerging subfield of machine learning exploiting a small amount of labeled data together with a large pool of unlabeled data to improve learning accuracy. From what we know so far there exists no study dealing with the implementation of active learning methodologies in the financial field. Several experiments take place in our research comparing the accuracy measures of familiar active learners and demonstrating their efficiency in contrast to representative supervised methods.

Electronics, 2021
Over recent years, massive open online courses (MOOCs) have gained increasing popularity in the f... more Over recent years, massive open online courses (MOOCs) have gained increasing popularity in the field of online education. Students with different needs and learning specificities are able to attend a wide range of specialized online courses offered by universities and educational institutions. As a result, large amounts of data regarding students’ demographic characteristics, activity patterns, and learning performances are generated and stored in institutional repositories on a daily basis. Unfortunately, a key issue in MOOCs is low completion rates, which directly affect student success. Therefore, it is of utmost importance for educational institutions and faculty members to find more effective practices and reduce non-completer ratios. In this context, the main purpose of the present study is to employ a plethora of state-of-the-art supervised machine learning algorithms for predicting student dropout in a MOOC for smart city professionals at an early stage. The experimental re...
Uploads
Papers by Georgios K . Kostopoulos