Academia.eduAcademia.edu

Unstructured Data

description600 papers
group14 followers
lightbulbAbout this topic
Unstructured data refers to information that does not have a predefined data model or organization, making it challenging to collect, process, and analyze. It encompasses various formats, including text, images, audio, and video, and is often characterized by its lack of a specific structure or schema.
lightbulbAbout this topic
Unstructured data refers to information that does not have a predefined data model or organization, making it challenging to collect, process, and analyze. It encompasses various formats, including text, images, audio, and video, and is often characterized by its lack of a specific structure or schema.

Key research themes

1. What are the challenges and limitations of information extraction methods across different types of unstructured big data?

This line of research investigates the effectiveness and limitations of information extraction (IE) techniques when applied to various unstructured data types such as text, images, audio, and video, especially in the context of large-scale (big) data. Its importance stems from the need to transform heterogeneous, high-volume unstructured data into structured formats usable for analytics and decision-making. Understanding these challenges is crucial to improve IE systems' scalability, accuracy, and applicability across multidimensional unstructured datasets.

Key finding: Systematic literature review revealed task-specific and data-type-related challenges in IE subtasks such as named entity recognition and relation extraction for text, visual relationship detection for images, audio event... Read more
Key finding: This structured literature review highlights that existing IE techniques are predominantly designed for single unstructured data types and struggle to efficiently process multifaceted unstructured big data. It identifies the... Read more
Key finding: By conducting an exhaustive survey on IE across text, images, audio, and video, the paper synthesizes knowledge on subtasks and associated techniques, illustrating gaps in handling unstructured big data's multidimensionality... Read more

2. How can the usability of unstructured text in big data analytics be enhanced to improve insight extraction?

This research area focuses on understanding and improving the practical usability of unstructured textual data within big data analytics workflows. Researchers recognize that unstructured text presents unique technical and conceptual challenges that degrade usability in analytics contexts. Addressing these issues with models and validation techniques aims to optimize the process from raw data to insightful knowledge extraction by accounting for subjective intentions and contextual needs.

Key finding: Through a systematic literature review and Delphi validation method with industry and academic experts, this work developed and validated a usability enhancement model for unstructured text data. The model identifies key... Read more
Key finding: The paper's identification of challenges in IE subtasks for unstructured text supports the need for usability enhancement by highlighting data preprocessing and transformation limitations. These insights contribute to... Read more
Key finding: The review elucidates the critical role of data preparation, including profiling, matching, format transformation, and cleaning, as foundational processes enhancing unstructured textual data usability for analysis. It... Read more

3. What algorithmic and machine learning methods can reconstruct or extract structure from partially structured or uncertain unstructured data?

This theme addresses the problem of restoring structure—a core challenge when dealing with messy, semi-structured, or corrupted unstructured datasets that lack consistent schemas. Techniques that infer latent table or relational structures using supervised or unsupervised machine learning provide a pathway to convert unstructured data into analyzable formats. This has broad implications where data export or storage processes disrupt original organization, necessitating intelligent recovery mechanisms.

Key finding: The paper proposes the STCExtract algorithm, a two-phase machine learning-based approach for reconstructing table structures and columns from messy, unstructured delimited files. Using clustering algorithms (k-means,... Read more
Key finding: This interdisciplinary study developed computational scoring methods and 33 software functionalities for appraisal of diverse record formats combining archival metrics with data mining techniques. The evaluation showed that... Read more
Key finding: The ORA-SS model advances the semantic representation of semi-structured data by explicitly distinguishing objects, relationships, and attributes including attributes of relationships. This richer schema supports more... Read more

All papers in Unstructured Data

Versão corrigida contendo as alterações solicitadas pela comissão julgadora em 22 de novembro de 2018. A versão original encontra-se em acervo reservado na Biblioteca da EACH-USP e na Biblioteca Digital de Teses e Dissertações da USP... more
Exponentially rise of unstructured data is a question to all data scientists today. The graph of unstructured data is at such height that it consumes most of the storage of all clouds present today. Analysis through unstructured data is... more
Conventional systems are inundated by unstructured Big Data and solicit for models adept of managing such data. Recruitment related data is represented using Semantic technologies such as Resource Description Framework and ResumeRDF... more
Managing text-based information is crucial when trying to extract valuable information from documents. Assigning a numerical value to the text-based (unstructured) information is one of the ways to extract value. This research studied the... more
Rating prediction is a crucial element of business analytics as it enables decision-makers to assess service performance based on expressive customer feedback. Enhancing rating score predictions and demand forecasting through... more
Relational Database Management System (RDBMS) which is highly relied on by organizations for decision making are limited in their design to integrate and analyze data from unstructured sources. Research has shown that large part of... more
Fig. 1. The GOOSE dataset was recorded over the course of a year and covers all seasons and a wide range of weather conditions. The first column shows color images recorded with the RGB+NIR (near-infrared) camera. The second column is the... more
Traditional spatiotemporal data analysis often relies on predictive models that overlook causal relationships, making it difficult to identify true drivers and formulate effective interventions. To bridge this gap, we review causal... more
Cloud object stores such as Amazon S3 are some of the largest and most cost-effective storage systems on the planet, making them an attractive target to store large data warehouses and data lakes. Unfortunately, their implementation as... more
Standards-based modeling of electronic health records (EHR) data holds great significance for data interoperability and large-scale usage. Integration of unstructured data into a standard data model, however, poses unique challenges... more
Background: One challenge in reusing clinical data stored in electronic medical records is that these data are heterogenous. Clinical Natural Language Processing (NLP) plays an important role in transforming information in clinical text... more
Natural language processing is a computer science field, which focuses on interactions between computers and human (natural) languages. The human languages are ambiguous unlike computer languages, which make its analysis and processing... more
The text has been treated all the time like unstructured information because, apparently, its components, do not have properties that permit management like numbers. This short paper resumes a proposal that changes this approach by... more
The logistics industry is undergoing a transformative shift due to the integration of Industry 4.0 technologies, particularly artificial intelligence (AI). Smart warehouse management systems (WMS) are increasingly utilizing AI to enhance... more
Every day the global media system produces an abundance of news stories, all containing many references to people. An important task is to automatically generate reliable lists of people by analysing news content. We describe a system... more
With so many things around us continuously producing and processing data, be it mobile phones, or sensors attached to devices, or satellites sitting thousands of kilometres above our heads, data is becoming increasingly heterogeneous.... more
Purpose Customer service provision is a growing phenomenon on social media and parcel shipping companies have been among the most prominent adopters. This has coincided with greater interest in the development of analysis techniques for... more
The processing of retirement claims is complicated and ineffective because of human labor processes, disjointed data systems, and outdated technology. Retirees are frustrated, get an error or delay and feel like they have been left behind... more
The processing of retirement claims is complicated and ineffective because of human labor processes, disjointed data systems, and outdated technology. Retirees are frustrated, get an error or delay and feel like they have been left behind... more
Nowadays, when protecting the information of an organization, professionals would consider the level of confidentiality and sensitivity of the data as a major concern. This is reflected in a manual process where ideas, decisions, and... more
Recently, there are unprecedented data growth originating from different online platforms which contribute to big data in terms of volume, velocity, variety and veracity (4Vs). Given this nature of big data which is unstructured,... more
In today's world everyone is trying different product and services. They are always commenting on such things on microblogging sites like Twitter and Facebook. Sentiment analysis also known as opinion mining used for finding the polarity... more
In this paper, we examine the applicability of Convolutional Neural Networks (CNNs) for predicting the cost of houses with an inclusion of visual and non-visual elements. Traditional end-to-end patterns of supervised machine learning... more
Artificial intelligence (AI) is transforming precision medicine, particularly in cardiovascular disease prevention and management. This bibliometric analysis examines the research landscape from 2020 to 2024, focusing on AI's role in... more
Ontologies play a central role in the Semantic Web and in many other technological developments. Multiple ontologybased approaches, loosely grouped under the heading 'semantic interoperability', have come to the fore as potential... more
The integration of Electronic Health Records (EHRs) with Machine Learning (ML) models has become imperative in examining patient outcomes due to the vast amounts of clinical data they provide. However, critical information regarding... more
Objective: Social determinants of health (SDoH) are nonclinical dispositions that impact patient health risks and clinical outcomes. Leveraging SDoH in clinical decision-making can potentially improve diagnosis, treatment planning, and... more
The integration of Large Language Models (LLMs) with NoSQL databases offers a novel solution for managing and retrieving research and development (R&D) experiment documentation. This paper investigates the architectural design, real-world... more
The exponential development of data initially exhibited difficulties for prominent organizations, for example, Google, Yahoo, Amazon, Microsoft, Facebook, Twitter and so forth. The size of the information that needs to be handled by cloud... more
Robots need to learn skills that can not only generalize across similar problems but also be directed to a specific goal. Previous methods either train a new skill for every different goal or do not infer the specific target in the... more
The paper addresses the evolution of automated Extract, Transform, Load (ETL) pipelines in contemporary data warehousing environments, highlighting their essential role in enabling timely analytics and business intelligence. Recent... more
With technological advancement, we have increased access to unstructured text data and the means to analyze it. Such information is available from electronic health records, blogs, social media posts, and other sources and is being used... more
In this article we present the scheme of internal workload of typical relational database supporting NoSQL human interaction protocol, we state that the production consumption of query complexity cannot be avoided and modern techniques... more
Traffic flow forecasting is a key problem of intelligent transport systems and represents a challenging task due to the spatial-temporal correlation features and long temporal interdependence of the considered data. Conventional methods... more
Traffic flow forecasting is a key problem of intelligent transport systems and represents a challenging task due to the spatial-temporal correlation features and long temporal interdependence of the considered data. Conventional methods... more
Question answering system (QAS) is essential to satisfy the need to query information available in various formats, including structured data (ontology, databases) or unstructured data (document, web). The QAS provides a correct response... more
Industries like Healthcare produce enormous amounts of data. Collecting, cleaning, and processing the data to make the data available for deep insights is a greater need in today's competitive world. This process of data integration and... more
The exponential growth of fuel transactions necessitates highly efficient storage and retrieval systems to facilitate real-time operational analytics, fraud detection and decision-making. Traditional relational database systems face... more
This material is brought to you by the Americas Conference on Information Systems (AMCIS) at AIS Electronic Library (AISeL). It has been accepted for inclusion in AMCIS 2010 Proceedings by an authorized administrator of AIS Electronic... more
This paper proposes a novel approach to constructing a Dai medicine knowledge graph, which is based on a bidirectional entity-relation joint extraction framework. Dai medicine, recognized as one of the four major traditional ethnic... more
This paper proposes a novel approach to constructing a Dai medicine knowledge graph, which is based on a bidirectional entity-relation joint extraction framework. Dai medicine, recognized as one of the four major traditional ethnic... more
Sarcasm is frequently characterized as verbal incongruity to communicate scorn. It is a nuanced type of language with which people express something contrary to what is suggested. Perhaps the greatest test in building frameworks to... more
As large amounts of unstructured data are generated on a regular basis, expressing or storing knowledge in a way that is useful remains a challenge. In this context, Relation Extraction (RE) is the task of automatically identifying... more
The problems related to big data are increasing now days since the arrival of data becomes faster and existing techniques are not capable to handle such big data. Big data shows various attributes due to that complexity and problems... more
With the growth of machine learning and other computationally intensive techniques for analyzing data, new opportunities emerge to repurpose organizational information sources. In this study, we explore the effectiveness of unstructured... more
Recognizing text from the nature scene images and videos has been the challenging task of computer vision and machine learning research community in recent years. These texts are difficult to recognize because of their shapes, complex... more
Download research papers for free!