Papers by Mikhail Zymbler
Вестник Южно-Уральского государственного университета, Mar 1, 2013
Параллельный алгоритм решения задачи анализа рыночной корзины на процессорах Cell
Вестник Южно-Уральского государственного университета. Серия: Математическое моделирование и программирование, 2010

arXiv (Cornell University), Dec 28, 2018
Frequent itemset mining leads to the discovery of associations and correlations among items in la... more Frequent itemset mining leads to the discovery of associations and correlations among items in large transactional databases. Apriori is a classical frequent itemset mining algorithm, which employs iterative passes over database combining with generation of candidate itemsets based on frequent itemsets found at the previous iteration, and pruning of clearly infrequent itemsets. The Dynamic Itemset Counting (DIC) algorithm is a variation of Apriori, which tries to reduce the number of passes made over a transactional database while keeping the number of itemsets counted in a pass relatively low. In this paper, we address the problem of accelerating DIC on the Intel Xeon Phi many-core system for the case when the transactional database fits in main memory. Intel Xeon Phi provides a large number of small compute cores with vector processing units. The paper presents a parallel implementation of DIC based on OpenMP technology and thread-level parallelism. We exploit the bit-based internal layout for transactions and itemsets. This technique reduces the memory space for storing the transactional database, simplifies the support count via logical bitwise operation, and allows for vectorization of such a step. Experimental evaluation on the platforms of the Intel Xeon CPU and the Intel Xeon Phi coprocessor with large synthetic and real databases showed good performance and scalability of the proposed algorithm.

Vyčislitelʹnye metody i programmirovanie, Jun 29, 2023
В настоящее время обработка данных временных рядов осуществляется в широком спектре научных и пра... more В настоящее время обработка данных временных рядов осуществляется в широком спектре научных и практических приложений, в которых актуальной является задача восстановления единичных точек или блоков значений временного ряда, пропущенных из-за аппаратных или программных сбоев либо ввиду человеческого фактора. В статье представлен метод SANNI (Snippet and Artificial Neural Network-based Imputation) для восстановления пропущенных значений временного ряда, обрабатываемого в режиме офлайн. SANNI включает в себя две нейросетевые модели: Распознаватель и Реконструктор. Распознаватель определяет сниппет (типичную подпоследовательность) ряда, на который наиболее похожа данная подпоследовательность с пропущенной точкой, и состоит из следующих трех групп слоев: сверточные, рекуррентный и полносвязные. Реконструктор, используя выход Распознавателя и входную подпоследовательность c пропуском, восстанавливает пропущенную точку. Реконструктор состоит из трех групп слоев: сверточные, рекуррентные и полносвязные. Топологии слоев Распознавателя и Реконструктора параметризуются относительно соответственно количества сниппетов и длины сниппета. Представлены методы подготовки обучающих выборок указанных нейросетевых моделей. Проведены вычислительные эксперименты, показавшие, что среди передовых аналитических и нейросетевых методов SANNI входит в тройку лучших. Ключевые слова: временной ряд, восстановление пропущенных значений, сниппеты временного ряда, мера MPdist, рекуррентные нейронные сети. Благодарности: Работа выполнена при финансовой поддержке Российского научного фонда (грант № 23-21-00465).

Vyčislitelʹnye metody i programmirovanie, Jan 20, 2019
Nowadays, the subsequence similarity search is required in a wide range of time series mining app... more Nowadays, the subsequence similarity search is required in a wide range of time series mining applications: climate modeling, financial forecasts, medical research, etc. In most of these applications, the Dynamic Time Warping (DTW) similarity measure is used, since DTW is empirically confirmed as one of the best similarity measures for the majority of subject domains. Since the DTW measure has a quadratic computational complexity with respect to the length of query subsequence, a number of parallel algorithms for various many-core architectures are developed, namely FPGA, GPU, and Intel MIC. In this paper we propose a new parallel algorithm for subsequence similarity search in very large time series on computer cluster systems with nodes based on Intel Xeon Phi Knights Landing (KNL) many-core processors. Computations are parallelized on two levels as follows: by MPI at the level of all cluster nodes and by OpenMP within a single cluster node. The algorithm involves additional data structures and redundant computations, which make it possible to efficiently use the capabilities of vector computations on Phi KNL. Experimental evaluation of the algorithm on real-world and synthetic datasets shows that the proposed algorithm is highly scalable.
A Parallel Discord Discovery Algorithm for a Graphics Processor
Pattern Recognition and Image Analysis, Jun 1, 2023

Mathematics
Currently, discovering subsequence anomalies in time series remains one of the most topical resea... more Currently, discovering subsequence anomalies in time series remains one of the most topical research problems. A subsequence anomaly refers to successive points in time that are collectively abnormal, although each point is not necessarily an outlier. Among numerous approaches to discovering subsequence anomalies, the discord concept is considered one of the best. A time series discord is intuitively defined as a subsequence of a given length that is maximally far away from its non-overlapping nearest neighbor. Recently introduced, the MERLIN algorithm discovers time series discords of every possible length in a specified range, thereby eliminating the need to set even that sole parameter to discover discords in a time series. However, MERLIN is serial, and its parallelization could increase the performance of discord discovery. In this article, we introduce a novel parallelization scheme for GPUs called PALMAD, parallel arbitrary length MERLIN-based anomaly discovery. As opposed to...

Вестник Южно-Уральского государственного университета. Серия: Вычислительная математика и информатика, 2019
Data Mining is aimed to discovering understandable knowledge from data, which can be used for dec... more Data Mining is aimed to discovering understandable knowledge from data, which can be used for decision-making in various fields of human activity. The Big Data phenomenon is a characteristic feature of the modern information society. The processes of cleaning and structuring Big data lead to the formation of very large databases and data warehouses. Despite the emergence of a large number of NoSQL DBMSs, the main database management tool is still relational DBMS. Integration of Data Mining into relational DBMS is one of the promising directions of development of relational databases. Integration allows both to avoid the overhead of exporting the analyzed data from the repository and importing the analysis results back to the repository, as well as using system services embedded in the DBMS architecture for data analysis. The paper provides an overview of methods and approaches to solving the problem of integrating data mining in a DBMS. A classification of approaches to solving the problem of integrating data mining in a DBMS is given. The SQL database language extensions to provide syntactic support for data mining in a DBMS are introduced. Examples of the implementation of data mining algorithms for SQL and data analysis systems in relational databases are considered.

Numerical Methods and Programming (Vychislitel'nye Metody i Programmirovanie), 2019
Алгоритм PAM (Partitioning Around Medoids) представляет собой разделительный алгоритм кластеризац... more Алгоритм PAM (Partitioning Around Medoids) представляет собой разделительный алгоритм кластеризации, в котором в качестве центров кластеров выбираются только кластеризуемые объекты (медоиды). Кластеризация на основе техники медоидов применяется в широком спектре приложений: сегментирование медицинских и спутниковых изображений, анализ ДНК-микрочипов и текстов и др. На сегодня имеются параллельные реализации PAM для систем GPU и FPGA, но отсутствуют таковые для многоядерных ускорителей архитектуры Intel Many Integrated Core (MIC). В настоящей статье предлагается новый параллельный алгоритм кластеризации PhiPAM для ускорителей Intel MIC. Вычисления распараллеливаются с помощью технологии OpenMP. Алгоритм предполагает использование специализированной компоновки данных в памяти и техники тайлинга, позволяющих эффективно векторизовать вычисления на системах Intel MIC. Эксперименты, проведенные на реальных наборах данных, показали хорошую масштабируемость алгоритма. The PAM (Partitioning ...

Bulletin of the South Ural State University. Ser. Computer Technologies, Automatic Control & Radioelectronics, 2019
Today, fingerprint identification is the most common method of biometric identification. Existing... more Today, fingerprint identification is the most common method of biometric identification. Existing fingerprint identification models have some defects that reduce the speed and quality of identification. So most of the models do not take into account the topological characteristics of images, for example, the classical method of measuring the ridge count value may produce incorrect results in areas of significant curvature of the ridge lines. This paper presents a new mathematical model for fingerprint identification, taking into account their topological characteristics. Identification is performed on the basis of templates. The templates contain a list of minutiae detected on the image and a list of ridge lines. For the ridge lines and minutiae, sets of topological vectors are constructed. The result of building topological vectors does not depend on the location of minutiae and takes into account their possible mutations, which increases the stability of the proposed mathematical model. Additionally, the stability of the model is ensured by combining the base topological vectors constructed for all minutiae and ridge lines into an expanded topological vector. This view allows you to significantly reduce the size of the template and optimize the use of memory. To compare the fingerprints the Delaunay triangulation is used based on the list of constructed topological vectors. 112 possible classes for topological vectors are defined. This approach allows you to increase the speed of identification up to 10 times while maintaining its accuracy. The proposed classification is resistant to rotation and displacement of images.
Vyčislitelʹnye metody i programmirovanie, Jun 14, 2019
Integration of Fuzzy c-Means Clustering algorithm with PostgreSQL database management system
Numerical Methods and Programming (Vychislitel'nye Metody i Programmirovanie), May 17, 2012

Computers
Botanical plants suffer from several types of diseases that must be identified early to improve t... more Botanical plants suffer from several types of diseases that must be identified early to improve the production of fruits and vegetables. Mango fruit is one of the most popular and desirable fruits worldwide due to its taste and richness in vitamins. However, plant diseases also affect these plants’ production and quality. This study proposes a convolutional neural network (CNN)-based metaheuristic approach for disease diagnosis and detection. The proposed approach involves preprocessing, image segmentation, feature extraction, and disease classification. First, the image of mango leaves is enhanced using histogram equalization and contrast enhancement. Then, a geometric mean-based neutrosophic with a fuzzy c-means method is used for segmentation. Next, the essential features are retrieved from the segmented images, including the Upgraded Local Binary Pattern (ULBP), color, and pixel features. Finally, these features are given into the disease detection phase, which is modeled using ...

Mathematics, 2022
Crude oil market analysis has become one of the emerging financial markets and the volatility eff... more Crude oil market analysis has become one of the emerging financial markets and the volatility effect of the market is paramount and has been considered as an issue of utmost importance. This study examines the dynamics of this volatile market of crude oil by employing a hybrid approach based on an extreme learning machine (ELM) as a regressor and the improved grey wolf optimizer (IGWO) for prophesying the crude oil rate for West Texas Intermediate (WTI) and Brent crude oil datasets. The datasets are augmented using technical indicators (TIs) and statistical measures (SMs) to obtain better insight into the forecasting ability of this proposed model. The differential evolution (DE) strategy has been used for evolution and the survival of the fittest (SOF) principle has been used for elimination while implementing the GWO to achieve better convergence rate and accuracy. Whereas, the algorithmic simplicity, use of less parameters, and easy implementation of DE efficiently decide the evo...

Bulletin of the South Ural State University. Series "Computational Mathematics and Software Engineering", 2020
Временной ряд представляет собой последовательность хронологически упорядоченных числовых значени... more Временной ряд представляет собой последовательность хронологически упорядоченных числовых значений, отражающих течение некоторого процесса или явления. В настоящее время одним из наиболее актуальных классов задач обработки временных рядов являются приложения Индустрии 4.0 и Интернета вещей. В данных приложениях типичной является задача обеспечения умного управления и предиктивного технического обслуживания сложных машин и механизмов, которые оснащаются различными сенсорами. Такие сенсоры имеют высокую дискретность снятия показаний и за сравнительно короткое время продуцируют временные ряды длиной от десятков миллионов до миллиардов элементов. Получаемые с сенсоров данные накапливаются и подвергаются интеллектуальному анализу для принятия стратегически важных решений. Обработка временных рядов требует специфического системного программного обеспечения, отличного от имеющихся реляционных СУБД и NoSQL-систем. Системы обработки временных рядов должны обеспечивать, с одной стороны, эффективные операции добавления новых атомарных значений, поступающих в потоковом режиме, а с другой стороны, эффективные операции интеллектуального анализа, в рамках которых временной ряд рассматривается как единое целое. В статье рассмотрены особенности обработки временных рядов в сравнении с данными реляционной и нереляционной природы, и даны формальные определения основных задач интеллектуального анализа временных рядов. Представлен обзор основных возможностей трех наиболее популярных современных систем обработки временных рядов: InfluxDB, OpenTSDB, TimescaleDB. Ключевые слова: обработка и анализ временных рядов, NoSQL, реляционная СУБД, InfluxDB, OpenTSDB, TimescaleDB.

Parallel Computational Technologies
Communications in Computer and Information Science, 2019
Efficiency is a major weakness in modern supercomputers. Low efficiency of user applications is o... more Efficiency is a major weakness in modern supercomputers. Low efficiency of user applications is one of the main reasons for that. There are many software tools for analyzing and improving the performance of parallel applications. However, supercomputer users often do not have sufficient knowledge and skills to apply these tools correctly in their specific case. Moreover, users often do not know that their applications work inefficiently. The main goal of our project is to help any HPC user to detect performance flaws in their applications and find out how to deal with them. To this end, we plan to develop an open-source software solution that performs automatic massive analysis of all jobs running on a supercomputer to identify those with efficiency issues and helps users to conduct a detailed analysis of an individual program (using existing software tools) to identify and eliminate the root causes of the loss of efficiency.

Mathematics
Summarization of a long time series often occurs in analytical applications related to decision-m... more Summarization of a long time series often occurs in analytical applications related to decision-making, modeling, planning, and so on. Informally, summarization aims at discovering a small-sized set of typical patterns (subsequences) to briefly represent the long time series. Apparent approaches to summarization like motifs, shapelets, cluster centroids, and so on, either require training data or do not provide an analyst with information regarding the fraction of the time series that a typical subsequence found corresponds to. Recently introduced, the time series snippet concept overcomes the above-mentioned limitations. A snippet is a subsequence that is similar to many other subsequences of the time series with respect to a specially defined similarity measure based on the Euclidean distance. However, the original Snippet-Finder algorithm has cubic time complexity concerning the lengths of the time series and the snippet. In this article, we propose the PSF (Parallel Snippet-Find...
Вестник Южно-Уральского государственного университета, Jun 13, 2019

Вестник Южно-Уральского государственного университета, Sep 15, 2021
Currently, large time series are used in a wide range of subject areas. Modern time series DBMSs ... more Currently, large time series are used in a wide range of subject areas. Modern time series DBMSs (TSDBMS) offer, however, a modest set of built-in tools for data mining. The use of third-party time series mining systems to undesirable overhead costs for exporting data outside the TSDBMS, converting data and importing analysis results. At the same time, there is a topical issue of the embedding of data mining methods into relational DBMSs (RDBMS), which dominate the market of data management tools. However, there are still no developments of time series mining methods in RDBMS. The article proposes an approach to the management and mining of time series data within the RDBMS based on the matrix profile concept. A matrix profile is a data structure that, for each subsequence of a time series, stores the index of and the distance to its nearest neighbor. The matrix profile serves as the basis for detecting motifs, anomalies and other primitives of time series mining. The proposed approach is implemented in the PostgreSQL RDBMS. The experimental results showed a higher efficiency of the proposed approach compared to the TSDBMS InfluxDB and OpenTSDB.
Uploads
Papers by Mikhail Zymbler