Papers by Negin Daneshpour

Extracting Urgent Questions from MOOC Discussions: A BERT-Based Multi-output Classification Approach
Arabian journal for science and engineering, May 31, 2024
Online discussion forums are widely used by students to ask and answer questions related to their... more Online discussion forums are widely used by students to ask and answer questions related to their learning topics. However, not all questions posted by students receive timely and appropriate feedback from instructors, which can affect the quality and effectiveness of the online learning experience. Therefore, it is important to automatically identify and prioritize student questions from online discussion forums, so that instructors can provide better support and guidance to the students. In this paper, we propose a novel hybrid convolutional neural network (CNN) + bidirectional gated recurrent unit (Bi-GRU) multi-output classification model, which can perform this task with high accuracy and efficiency. Our model consists of two outputs: the first one classifies whether the post is a question or not, and the second one classifies whether the classified question is urgent or not urgent. Our model leverages the advantages of both CNN and Bi-GRU layers to capture both local and global features of the input data, as well as the Bidirectional Encoder Representations from Transformers (BERT) model to provide rich and contextualized word embeddings. The model achieves an F1-weighted score of 94.8% when classifying whether the posts are questions or not, and obtains an 88.5% F1-weighted score while classifying the question into urgent and non-urgent. Distinguishing and classifying urgent student questions with high accuracy and coverage can help instructors provide timely and appropriate feedback, a key factor in reducing dropout rates and improving completion rates.

Several techniques exist to select and materialize a proper set of data in a suitable structure t... more Several techniques exist to select and materialize a proper set of data in a suitable structure that manage the queries submitted to the online analytical processing systems. These techniques are called view management techniques, which consist of three research areas: 1) view selection to materialize, 2) query processing and rewriting using the materialized views, and 3) maintaining materialized views. There are several parameters should be considered in order to find the most important algorithm for view management. As various researches have been done to propose view selection algorithms, we should select and modify the most suitable algorithm for view materialization based on the properties of the applications. In this paper, we investigate and find relevant parameters to view selection algorithms and classify them based on these parameters. We also present a system to evaluate algorithms and compare them with respect to the values of the evaluation parameters. Based on the results of these activities, we propose a roadmap that helps us choose the most efficient view selection algorithm concerning application types.

DBHC: A DBSCAN-based hierarchical clustering algorithm
Data and Knowledge Engineering, Sep 1, 2021
Abstract Clustering is the process of partitioning objects of a dataset into some groups accordin... more Abstract Clustering is the process of partitioning objects of a dataset into some groups according to similarities and dissimilarities between its objects. DBSCAN is one of the most important clustering algorithms in the density based approach of clustering. In spite of the numerous advantages of the DBSCAN algorithm, it has two important input parameters, MinPts and Eps, which determining their values is still a great challenge. This problem arises because values of these parameters are heavily dependent on data distribution. To overcome this challenge, firstly features of these parameters are investigated and the data distribution are analyzed. Then a DBSCAN-based hierarchical clustering (DBHC) method is proposed in this paper in order to fix this challenge. For this purpose, DBHC first determines values of these parameters using the notion of k nearest neighbor and k-dist plot. Because most of the real world data is not distributed uniformly, it is needed to be produced several values for the Eps parameter. Then, DBHC executes the DBSCAN algorithm several times based on the number of Eps produced earlier. Finally, DBHC method merges obtained clusters if the number of produced clusters is larger than the number which has estimated by the user. To evaluate the performance of the DBHC method, several experiments were performed on some of benchmark datasets of UCI database. Obtained results were compared with other previous works. The obtained results consistently showed that the DBHC method led to better results in comparison to the other works.

Statistics, Optimization and Information Computing, Jun 9, 2021
Data quality has diverse dimensions, from which accuracy is the most important one. Data cleaning... more Data quality has diverse dimensions, from which accuracy is the most important one. Data cleaning is one of the preprocessing steps in data mining which consists of detecting errors and repairing them. Noise is a common type of error, that occur in database. This paper proposes an automated method based on the k−means clustering for noise detection. At first, each attribute (A j) is temporarily removed from data and the k−means clustering is applied to other attributes. Thereafter, the k−nearest neighbors is used in each cluster. After that a value is predicted for A j in each record by the nearest neighbors. The proposed method detects noisy attributes using predicted values. Our method is able to identify several noises in a record. In addition, this method can detect noise in fields with different data types, too. Experiments show that this method can averagely detect 92% of the noises existing in the data. The proposed method is compared with a noise detection method using association rules. The results indicate that the proposed method have improved noise detection averagely by 13%.
Automatic Error Detecting in Databases, Based on Clustering and Nearest Neighbor
پردازش علائم و دادهها, Sep 1, 2022
Multivariate time series data are found in a variety of fields such as bioinformatics, biology, g... more Multivariate time series data are found in a variety of fields such as bioinformatics, biology, genetics, astronomy, geography and finance. Many time series datasets contain missing data. Multivariate time series missing data imputation is a challenging topic and needs to be carefully considered before learning or predicting time series. Frequent researches have been done on the use of different techniques for time series missing data imputation, which usually include simple analytic methods and modeling in specific applications or univariate time series.
Dynamic View Management System for Query Prediction to View Materialization
IGI Global eBooks, 2013
The New Approach toward Refreshing Data Warehouse
International Conference on Computational Intelligence, 2004
پردازش علائم و دادهها, Jun 1, 2017
Expert Systems With Applications, Nov 1, 2020
This is a PDF file of an article that has undergone enhancements after acceptance, such as the ad... more This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
پردازش علائم و دادهها, Dec 1, 2017

Data engineering approach to efficient data warehouse: Life cycle development revisited
Data warehouse (DW) refers to technologies for collecting, integrating, analyzing large volume of... more Data warehouse (DW) refers to technologies for collecting, integrating, analyzing large volume of homogeneous/heterogeneous data to provide information to enable better decision making. To achieve the main purpose of data warehouse to present analytical response to online queries it is necessary to consider many parameters in development life cycle. Among all factors involved in DW efficiency the quality of data should be taken more seriously. Today data warehouse architecture typically consists of several components which consolidate data from several operational and historical databases to support a variety of front-end query reporting and analytical tools. The back-end of the architecture is mainly relying on Extract-Transform-Load (ETL) process which we usually prefer to have it as a tool. The design and implementation application dependent ETL to pipeline validated and verified data is a labor intensive and typically consumes a large fraction of effort in data warehouse projects. Outcome of our experiment to build DW based on recommended methodology on thirty three million actual population records confirms that the life cycle of DW development has to be revisited. Many works have been reported regarding to data quality impact on efficiency of DW, but less attentions have been made to recognize data engineering aspects to revise the development life cycle for having efficient DW. Our investigation through last experiment shows 3 following steps facilitate life cycle process, and resulted DW is more tailored. 1) Data cleaning as a pre-process phase before data cleansing on ETL. 2) Identifying query type and their operation before transforming phase on ETL. 3) Identifying and materializing suited view for each query before load phase on ETL. The result regarding, to accuracy, effort and time has been tested and is significantly promising.

پردازش علائم و دادهها, Mar 1, 2022
Accuracy and validity of data are prerequisites of appropriate operations of any software system.... more Accuracy and validity of data are prerequisites of appropriate operations of any software system. Always there is possibility of occurring errors in data due to human and system faults. One of these errors is existence of duplicate records in data sources. Duplicate records refer to the same real world entity. There must be one of them in a data source, but for some reasons like aggregation of data sources and human faults in data entry, it is possible to appear several copies of an entity in a data source. This problem leads to error occurrence in operations or output results of a system; also, it costs a lot for related organization or business. Therefore, data cleaning process especially duplicate record detection, became one of the most important area of computer science in recent years. Many solutions presented for detecting duplicates in different situations, but they almost are all time-consuming. Also, the volume of data is growing up every day. hence, previous methods don't have enough performance anymore. Incorrect detection of two different records as duplicates, is another problem that recent works are being faced. This becomes important because duplicates will usually be deleted and some correct data will be lost. So it seems that presenting new methods is necessary. In this paper, a method has been proposed that reduces required volume of process using hierarchical clustering with appropriate features. In this method, similarity between records has been

Statistics, Optimization and Information Computing, May 19, 2019
Data warehouse is designed for answering analytical queries. Data warehouse saves historical data... more Data warehouse is designed for answering analytical queries. Data warehouse saves historical data. In the data warehouse, the response time to analytical queries is long. So reducing the response time is a critical problem. There are a lot of algorithms to solve the problem. Some of them, materialize frequent views. The previously posed queries have important information that will be used in the future. This paper proposes an algorithm for view materialization. The proposed algorithm finds proper views using previous queries and materializes them. The views are able to answer future queries. The view selection algorithm has four steps. At first, it clusters previous queries by SOM method. Then frequent queries are found by Apriori algorithm. In the third step the problem is converted to 0/1 knapsack equations and finally, optimal queries are joined to create only one view for each cluster. This paper improves the first and third step. This paper uses the SOM algorithm for clustering previous queries in the first step and it solves the 0/1 knapsack equations according to shuffled frog leaping algorithm in the third step. Experimental results show that it improves the previous view selection algorithms according to response time and storage space factor.

TABRIZ JOURNAL OF ELECTRICAL ENGINEERING, Nov 22, 2018
The presence of missing values in the real world data is a very prevalent and inevitable problem.... more The presence of missing values in the real world data is a very prevalent and inevitable problem. So, it's necessary to fill up these missing values accurately, before they are used for knowledge discovery process. This paper proposes three novel methods to fill numeric missing values. All of the proposed methods apply regression models on subsets of data which there are strong correlations among them. These subsets are selected using forward selection based approaches. In the selection of the desired subsets, it is tried to maximize the correlation between missing attribute and other attributes. The correlation coefficient is used to measure the relationships between attributes. The priority of each missing attribute for imputation purpose is also considered in the proposed methods. The performance of proposed methods is evaluated on five real world datasets with different missing ratios. The efficiency of the proposed methods is compared with five different estimation methods, namely, the mean imputation, the k nearest neighbours imputation, a fuzzy c-means based imputation, a decision tree based imputation, and a regression based imputation algorithm, called "Incremental Attribute Regression Imputation" (IARI) method. Two well-known evaluation criteria, namely, Root Mean Squared Error (RMSE) and Coefficient of Determination (CoD) are used to compare the performance of proposed methods with other imputation methods. Experimental results show that the proposed methods perform better than other compared methods, even when the missing ratio is high.
A new method for privacy preserving association rule mining using homomorphic encryption with a secure communication protocol
Wireless Networks, Nov 21, 2022
A Near Real-Time Data Warehouse Architecture Based on Ontology
پردازش علائم و دادهها, Mar 1, 2019
Uploads
Papers by Negin Daneshpour