Unstructured Data

2023, Journal of Big Data

Key finding: Systematic literature review revealed task-specific and data-type-related challenges in IE subtasks such as named entity recognition and relation extraction for text, visual relationship detection for images, audio event... Read more

Limitations of information extraction methods and techniques for heterogeneous unstructured big data

2023, International Journal of Engineering Business Management

Key finding: This structured literature review highlights that existing IE techniques are predominantly designed for single unstructured data types and struggle to efficiently process multifaceted unstructured big data. It identifies the... Read more

An analytical study of information extraction from unstructured and multidimensional big data

2023, Journal of Big Data

Key finding: By conducting an exhaustive survey on IE across text, images, audio, and video, the paper synthesizes knowledge on subtasks and associated techniques, illustrating gaps in handling unstructured big data's multidimensionality... Read more

keyboard_arrow_downShow more

2. How can the usability of unstructured text in big data analytics be enhanced to improve insight extraction?

This research area focuses on understanding and improving the practical usability of unstructured textual data within big data analytics workflows. Researchers recognize that unstructured text presents unique technical and conceptual challenges that degrade usability in analytics contexts. Addressing these issues with models and validation techniques aims to optimize the process from raw data to insightful knowledge extraction by accounting for subjective intentions and contextual needs.

Usability enhancement model for unstructured text in big data

by Khor Wang

2024, Journal of Big Data

Key finding: Through a systematic literature review and Delphi validation method with industry and academic experts, this work developed and validated a usability enhancement model for unstructured text data. The model identifies key... Read more

An analytical study of information extraction from unstructured and multidimensional big data

2023, Journal of Big Data

Key finding: The paper's identification of challenges in IE subtasks for unstructured text supports the need for usability enhancement by highlighting data preprocessing and transformation limitations. These insights contribute to... Read more

Data Preparation: A Technological Perspective and Review

by Pavel Pankin

2023, SN Computer Science

Key finding: The review elucidates the critical role of data preparation, including profiling, matching, format transformation, and cleaning, as foundational processes enhancing unstructured textual data usability for analysis. It... Read more

keyboard_arrow_downShow more

3. What algorithmic and machine learning methods can reconstruct or extract structure from partially structured or uncertain unstructured data?

This theme addresses the problem of restoring structure—a core challenge when dealing with messy, semi-structured, or corrupted unstructured datasets that lack consistent schemas. Techniques that infer latent table or relational structures using supervised or unsupervised machine learning provide a pathway to convert unstructured data into analyzable formats. This has broad implications where data export or storage processes disrupt original organization, necessitating intelligent recovery mechanisms.

Restoration of Data Structures Using Machine Learning Techniques

by Branislava Cvijetic

2023, IEEE Access

Key finding: The paper proposes the STCExtract algorithm, a two-phase machine learning-based approach for reconstructing table structures and columns from messy, unstructured delimited files. Using clustering algorithms (k-means,... Read more

Algorithmic methods to explore the automation of the appraisal of structured and unstructured digital data

by Basma Makhlouf Shabou

2022, Records Management Journal

Key finding: This interdisciplinary study developed computational scoring methods and 33 software functionalities for appraisal of diverse record formats combining archival metrics with data mining techniques. The evaluation showed that... Read more

ORA-SS: An object-relationship-attribute model for semi-structured data

by Gillian Dobbie

2025, Citeseer

Key finding: The ORA-SS model advances the semantic representation of semi-structured data by explicitly distinguishing objects, relationships, and attributes including attributes of relationships. This richer schema supports more... Read more

keyboard_arrow_downShow more

All papers in Unstructured Data

Abordagem para integração automática de dados estruturados e não estruturados em um contexto Big Data

by Keylla Saes

2025

Versão corrigida contendo as alterações solicitadas pela comissão julgadora em 22 de novembro de 2018. A versão original encontra-se em acervo reservado na Biblioteca da EACH-USP e na Biblioteca Digital de Teses e Dissertações da USP... more

Conversion of Unstructured to Structured: A Solution Using Data Science and NOSQL

by shagufta praveen

2025, Revista GEINTEC

Exponentially rise of unstructured data is a question to all data scientists today. The graph of unstructured data is at such height that it consumes most of the storage of all clouds present today. Analysis through unstructured data is... more

A Big Data and Semantics Assisted Knowledge Representation Model for Skill Mapping using Qualitative Approach

by Hinweis Science and Engineering

2025, Hinweis Science and Engineering

Conventional systems are inundated by unstructured Big Data and solicit for models adept of managing such data. Recruitment related data is represented using Semantic technologies such as Resource Description Framework and ResumeRDF... more

The quantification of unstructured information and its use in predictive modeling

by Matthias Blume

2025, IEEE Systems and Information Engineering Design Symposium, 2003

Managing text-based information is crucial when trying to extract valuable information from documents. Assigning a numerical value to the text-based (unstructured) information is one of the ways to extract value. This research studied the... more

Incorporating topic membership in review rating prediction from unstructured data: a gradient boosting approach

by Dimitris Zissis

2025, Annals of Operations Research

Rating prediction is a crucial element of business analytics as it enables decision-makers to assess service performance based on expressive customer feedback. Enhancing rating score predictions and demand forecasting through... more

Integration and Analysis of Unstructured Data for Decision Making: Text Analytics Approach

by Ise A Orobor

2025, International Journal of Open Information Technologies

Relational Database Management System (RDBMS) which is highly relied on by organizations for decision making are limited in their design to integrate and analyze data from unstructured sources. Research has shown that large part of... more

The GOOSE Dataset for Perception in Unstructured Environments

by Thorsten Luettel

2025, arXiv (Cornell University)

Fig. 1. The GOOSE dataset was recorded over the course of a year and covers all seasons and a wide range of weather conditions. The first column shows color images recorded with the RGB+NIR (near-infrared) camera. The second column is the... more

Applying Causal Machine Learning to Spatiotemporal Data Analysis: An Investigation of Opportunities and Challenges

by christian mulomba

2025, IEEE Access

Traditional spatiotemporal data analysis often relies on predictive models that overlook causal relationships, making it difficult to identify true drivers and formulate effective interventions. To bridge this gap, we review causal machine learning (CML) techniques for spatiotemporal data, aiming to provide robust insights into their unique advantages. Our literature review reveals that fewer than 1% of studies in major databases explicitly integrate CML with spatiotemporal analysis. After rigorous screening, we analyze 51 relevant papers, categorizing their contributions into four key areas (totaling 62 methodological approaches due to multi-category papers): 1) causal effect discovery and estimation (32 approaches), 2) prediction accuracy enhancement (19), 3) pattern recognition limitations (10), and 4) interpretability (1). This distribution highlights a critical research gap, particularly in interpretability and comprehensive frameworks. We further examine unique challenges in spatiotemporal data, such as spatial autocorrelation and temporal dependencies, that complicate causal inference but also present opportunities for innovation. Promising approaches include the synergy of spatiotemporal Granger causality and structural equation modeling with spatial lags, which capture complex interdependencies while preserving interpretability. Future directions include developing interpretable causal models, advancing real-time causal inference in dynamic environments, and addressing computational challenges (scalability, efficiency, and complexityinterpretability trade-offs). We also discuss ethical considerations, such as bias mitigation in causal discovery and societal implications of spatiotemporal causal inference. By synthesizing challenges and opportunities, this work advances the application of CML in spatiotemporal analysis, with implications for climate science, economics, epidemiology, and urban planning. INDEX TERMS Causal machine learning, spatiotemporal data analysis, synergy methods, ethics.

Delta lake

by mostafa mokhtar

2025, Proceedings of the VLDB Endowment

Cloud object stores such as Amazon S3 are some of the largest and most cost-effective storage systems on the planet, making them an attractive target to store large data warehouses and data lakes. Unfortunately, their implementation as... more

Integrating Structured and Unstructured EHR Data Using an FHIR-based Type System: A Case Study with Medication Data

by Hongfang Liu

2025, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science

Standards-based modeling of electronic health records (EHR) data holds great significance for data interoperability and large-scale usage. Integration of unstructured data into a standard data model, however, poses unique challenges... more

A common type system for clinical natural language processing

by Hongfang Liu

2025

Background: One challenge in reusing clinical data stored in electronic medical records is that these data are heterogenous. Clinical Natural Language Processing (NLP) plays an important role in transforming information in clinical text... more

NLP Based Clinical Data Analysis for Assessing Readmissions of Patients with COPD

by Priyanka Medhe

2025

Natural language processing is a computer science field, which focuses on interactions between computers and human (natural) languages. The human languages are ambiguous unlike computer languages, which make its analysis and processing... more

A formalism to model Spanish texts as algebraic structures

by Edgardo Samuel S A M U E L Barraza Verdesoto

2025

The text has been treated all the time like unstructured information because, apparently, its components, do not have properties that permit management like numbers. This short paper resumes a proposal that changes this approach by... more

AI Driven Productivity Enhancing Blue-Collar Workers Efficency in Smart Warehousing -A Comphrensive Review in Logistic Industry

by IJRASET Publication

2025, International Journal for Research in Applied Science & Engineering Technology (IJRASET)

The logistics industry is undergoing a transformative shift due to the integration of Industry 4.0 technologies, particularly artificial intelligence (AI). Smart warehouse management systems (WMS) are increasingly utilizing AI to enhance... more

Enhancing Machine Learning Methods for Robust Real-Time Classification of Bilingual Documents

by santhosh SG

2025

Understanding the critical role of auditing and compliance in complex and regulated environments

by Dwaraka Nath Kummari

2025, Deep Science Publishing

Information Fusion for Entity Matching in Unstructured Data

by omar ali

2025, IFIP Advances in Information and Communication Technology

Every day the global media system produces an abundance of news stories, all containing many references to people. An important task is to automatically generate reliable lists of people by analysing news content. We describe a system... more

A semantic approach to enable data integration for the domain of flood risk management

by Barry Hankin

2025, Environmental Challenges

With so many things around us continuously producing and processing data, be it mobile phones, or sensors attached to devices, or satellites sitting thousands of kilometres above our heads, data is becoming increasingly heterogeneous.... more

Creation of unstructured big data from customer service

by Arda Gezdur

2025, The International Journal of Logistics Management

Purpose Customer service provision is a growing phenomenon on social media and parcel shipping companies have been among the most prominent adopters. This has coincided with greater interest in the development of analysis techniques for... more

Utilizing Cloud Technologies To Reduce Bottlenecks In Retirement Claim Approvals For Scalable And Efficient Processing

by Akshay Sharma

2025, INTERNATIONAL JOURNAL OF CURRENT SCIENCE (IJCSPUB)

The processing of retirement claims is complicated and ineffective because of human labor processes, disjointed data systems, and outdated technology. Retirees are frustrated, get an error or delay and feel like they have been left behind... more

Utilizing Cloud Technologies To Reduce Bottlenecks In Retirement Claim Approvals For Scalable And Efficient Processing

by Satish Kabade

2025, INTERNATIONAL JOURNAL OF CURRENT SCIENCE - (IJCSPUB)

Evaluating the Role of RDBMS and Data Warehousing in Modern Database Management

by Riya Raj Singh

2025

Automate Data Classification in an Unstructured Data Flow using Self-Organizing Maps

by Dilushinie Fernando

2025, Zenodo (CERN European Organization for Nuclear Research)

Nowadays, when protecting the information of an organization, professionals would consider the level of confidentiality and sensitivity of the data as a major concern. This is reflected in a manual process where ideas, decisions, and... more

Text Classification Using Hybrid Machine Learning Algorithms on Big Data

by Ikechukwu E . Onyenwe

2025, arXiv (Cornell University)

Recently, there are unprecedented data growth originating from different online platforms which contribute to big data in terms of volume, velocity, variety and veracity (4Vs). Given this nature of big data which is unstructured,... more

Survey on Sentiment Analysis and its Classification Technique

by Preeti Suryawanshi

2025

In today's world everyone is trying different product and services. They are always commenting on such things on microblogging sites like Twitter and Facebook. Sentiment analysis also known as opinion mining used for finding the polarity... more

House price prediction with Convolutional Neural Network (CNN)

by Mohit Jain

2025, World Journal of Advanced Engineering Technology and Sciences

In this paper, we examine the applicability of Convolutional Neural Networks (CNNs) for predicting the cost of houses with an inclusion of visual and non-visual elements. Traditional end-to-end patterns of supervised machine learning... more

Artificial Intelligence-Powered Precision Medicine for Cardiovascular Disease Prevention and Management

by Yudi Kurniawan Budi Susilo

2025

Artificial intelligence (AI) is transforming precision medicine, particularly in cardiovascular disease prevention and management. This bibliometric analysis examines the research landscape from 2020 to 2024, focusing on AI's role in... more

Ontology Based Hotel Information Extraction from Unstructured Text

by Amy Aung

2025, International Conference on Advances in Engineering and Technology (ICAET'2014) March 29-30, 2014 Singapore

Ontologies play a central role in the Semantic Web and in many other technological developments. Multiple ontologybased approaches, loosely grouped under the heading 'semantic interoperability', have come to the fore as potential... more

Unlocking the Power of EHRs: Harnessing Unstructured Data for Machine Learning-based Outcome Predictions

by omid jafarinezhad

2025, medRxiv (Cold Spring Harbor Laboratory)

The integration of Electronic Health Records (EHRs) with Machine Learning (ML) models has become imperative in examining patient outcomes due to the vast amounts of clinical data they provide. However, critical information regarding... more

Extracting social determinants of health from electronic health records using natural language processing: a systematic review

by Myrna Weissman

2025, Journal of the American Medical Informatics Association

Objective: Social determinants of health (SDoH) are nonclinical dispositions that impact patient health risks and clinical outcomes. Leveraging SDoH in clinical decision-making can potentially improve diagnosis, treatment planning, and... more

ENHANCING R&D KNOWLEDGE MANAGEMENT: INTEGRATING LARGE LANGUAGE MODELS WITH NOSQL DATABASES FOR EXPERIMENT DOCUMENTATION ACCESS

by Tanmoy Biswas

2025, IAEME Publication

The integration of Large Language Models (LLMs) with NoSQL databases offers a novel solution for managing and retrieving research and development (R&D) experiment documentation. This paper investigates the architectural design, real-world... more

Performance Evaluation of Structured and Unstructured Data in PIG/HADOOP and MONGO-DB Environments

by Sri Ram

2025

The exponential development of data initially exhibited difficulties for prominent organizations, for example, Google, Yahoo, Amazon, Microsoft, Facebook, Twitter and so forth. The size of the information that needs to be handled by cloud... more

Learning Deep Parameterized Skills from Demonstration for Re-targetable Visuomotor Control

by Devesh jha

2025, ArXiv

Robots need to learn skills that can not only generalize across similar problems but also be directed to a specific goal. Previous methods either train a new skill for every different goal or do not infer the specific target in the... more

Automated ETL Pipelines for Modern Data Warehousing: Architectures, Challenges, and Emerging Solutions

by Deepak Chanda

2025, The Eastasouth Journal of Information System and Computer Science

The paper addresses the evolution of automated Extract, Transform, Load (ETL) pipelines in contemporary data warehousing environments, highlighting their essential role in enabling timely analytics and business intelligence. Recent... more

Scaling-up assessment from a contextual behavioral science perspective: Potential uses of technology for analysis of unstructured text data

by Karen Kate Kellum

2025, Journal of Contextual Behavioral Science

With technological advancement, we have increased access to unstructured text data and the means to analyze it. Such information is available from electronic health records, blogs, social media posts, and other sources and is being used... more

INTERNAL WORKLOAD OF NOSQL RELATIONAL DATABASE

by Mirzakhmet Syzdykov

2025

In this article we present the scheme of internal workload of typical relational database supporting NoSQL human interaction protocol, we state that the production consumption of query complexity cannot be avoided and modern techniques... more

Deep Learning Frameworks for Multi-Modal Data Fusion in Retail Supply Chains: Enhancing Forecast Accuracy and Agility

by Phanish Lakkarasu

2025, JAIBDD

Traffic flow forecasting is a key problem of intelligent transport systems and represents a challenging task due to the spatial-temporal correlation features and long temporal interdependence of the considered data. Conventional methods deal with this either by spatial forecasting given observed counts at previous times or by temporal forecasting given observed traffic counts in neighbouring locations. In order to fully exploit the spatio-temporal properties observed in the data, a hybrid multimodal deep learning method for short-term traffic flow forecasting called HaMDeepT is proposed. Specifically, the HaMDeepT method can jointly and adaptively learn the spatial-temporal correlation features and long temporal interdependence of multi-modality traffic data through an attention-based auxiliary multimodal deep learning architecture. The base module of this method consists of a 1D CNN and GRU with the attention mechanism. The forecasted spatio-temporal traffic demand (counts of traffic passing through different locations at regular time intervals) is dependent on far more critical spatial factors than other point sensors such as weather stations. This HaMDeepT method, in terms of 3D CNN-GRU, which uses a stack of 3D Convolutional Neural Networks (3D CNN) and Gated Recurrent Units (GRUs) combined with the Correlation and Relative Operation layers to model both the spatial context features and the temporal dependencies of traffic count data at all locations, has a better performance compared to other network architectures. It overcomes the drawback of a fixed and handcrafted graph Laplacian matrix representation of the spatial relationships of the locations used by the ST-Graph. It uses the Correlation layer to estimate the spatial correlation features for each traffic count data point with others, focusing on the stations with major impacts on the target location, and the Relative Operation layer to model the relative distances thereafter. Using these novel methods, the traffic flow forecasting results for the miniNYC dataset are more accurate and more intuitive visualisation of the spatial structure that affects the performance of the predictions.

Deep Learning Frameworks for Multi-Modal Data Fusion in Retail Supply Chains: Enhancing Forecast Accuracy and Agility

by Srinivas Kalisetty

2025, Journal of Artificial intelligence and Big Data Disciplines (JAIBDD)

Question answering systems: the story till the Arabic linked data

by N. Doumi

2025, International Journal of Artificial Intelligence and Soft Computing

Question answering system (QAS) is essential to satisfy the need to query information available in various formats, including structured data (ontology, databases) or unstructured data (document, web). The QAS provides a correct response... more

Oracle ETL Tools and AI Integration: New Data Management Approach

by Kiran Veernapu

2025, International Journal of Multidisciplinary Research and Growth Evaluation

Industries like Healthcare produce enormous amounts of data. Collecting, cleaning, and processing the data to make the data available for deep insights is a greater need in today's competitive world. This process of data integration and... more

Leveraging MongoDB Multi-Sharding to Decrease Latency to Store and Retrieve Fuel Transaction

by Rohith Varma Vegesna

2025, Journal of Artificial Intelligence, Machine Learning and Data Science

The exponential growth of fuel transactions necessitates highly efficient storage and retrieval systems to facilitate real-time operational analytics, fraud detection and decision-making. Traditional relational database systems face... more

Using Text Mining to Analyze Quality Aspects of Unstructured Data: A Case Study for "stock- touting" Spam Emails

by Mohamed Zaki

2025

This material is brought to you by the Americas Conference on Information Systems (AMCIS) at AIS Electronic Library (AISeL). It has been accepted for inclusion in AMCIS 2010 Proceedings by an authorized administrator of AIS Electronic... more

Dai Medicine Knowledge Graph Based on Bidirectional Extraction Technology: Precise Resolution of Entity Relations and Graph Construction

by liming liu

2025, Dai Medicine Knowledge Graph Based on Bidirectional Extraction Technology: Precise Resolution of Entity Relations and Graph Construction

This paper proposes a novel approach to constructing a Dai medicine knowledge graph, which is based on a bidirectional entity-relation joint extraction framework. Dai medicine, recognized as one of the four major traditional ethnic... more

Dai Medicine Knowledge Graph Based on Bidirectional Extraction Technology: Precise Resolution of Entity Relations and Graph Construction

by liming liu

2025, Dai Medicine Knowledge Graph Based on Bidirectional Extraction Technology: Precise Resolution of Entity Relations and Graph Construction

Graph Model for Detection of text unstructured data such as Sarcasm

by Armando Jipsion

2025, Zenodo (CERN European Organization for Nuclear Research)

Sarcasm is frequently characterized as verbal incongruity to communicate scorn. It is a nuanced type of language with which people express something contrary to what is suggested. Perhaps the greatest test in building frameworks to... more

Relation extraction in structured and unstructured data: a comparative investigation on smartphone titles in the e-commerce domain

by Livy Real

2025, Anais do XIII Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana (STIL 2021)

As large amounts of unstructured data are generated on a regular basis, expressing or storing knowledge in a way that is useful remains a challenge. In this context, Relation Extraction (RE) is the task of automatically identifying... more

An Appropriate Big Data Clusters with K-Means Method

by Sandip Kahate

2025, International Journal of Advance Research and Innovative Ideas in Education

The problems related to big data are increasing now days since the arrival of data becomes faster and existing techniques are not capable to handle such big data. Big data shows various attributes due to that complexity and problems... more

Understanding Benefits and Limitations of Unstructured Data Collection for Repurposing Organizational Data

by Roman Lukyanenko

2025, Lecture notes in business information processing

With the growth of machine learning and other computationally intensive techniques for analyzing data, new opportunities emerge to repurpose organizational information sources. In this study, we explore the effectiveness of unstructured... more

Toward theory and method of hybrid data collection

by Roman Lukyanenko

2025, Design Science Research in Information Systems and Technology

Enhanced Feature Model Based Hybrid Neural Network for Text Detection on Signboard, Billboard and News Tickers

by SATHISHKUMAR Veerappampalayam Easwaramoorthy

2025, IEEE Access

Recognizing text from the nature scene images and videos has been the challenging task of computer vision and machine learning research community in recent years. These texts are difficult to recognize because of their shapes, complex... more