This paper describes Bibster, a Peer-to-Peer system for exchanging bibliographic metadata among researchers. We show how Bibster exploits ontologies in data-representation, query formulation, query routing, and query result presentation.... more
This paper describes the design and implementation of Bibster, a Peer-to-Peer system for exchanging bibliographic data among researchers. Bibster exploits ontologies in data-storage, query formulation, query-routing and answer... more
This paper describes Bibster, a Peer-to-Peer system for exchanging bibliographic metadata among researchers. We show how Bibster exploits ontologies in data-representation, query formulation, query routing, and query result presentation.... more
This paper describes the design and implementation of Bibster, a Peer-to-Peer system for exchanging bibliographic data among researchers. Bibster exploits ontologies in data-storage, query formulation, query-routing and answer... more
A novel access structure for similarity search in metric data, called Similarity Hashing (sH), is proposed. Its multi-level hash structure of separable buckets on each level supports easy insertion and bounded search costs, because at... more
We present sequential and parallel algorithms for Frontier A* (FA*) algorithm augmented with a form of Delayed Duplicate Detection (DDD). The sequential algorithm, FA*-DDD, overcomes the leak-back problem associated with the combination... more
We present sequential and parallel algorithms for Frontier A* (FA*) algorithm augmented with a form of Delayed Duplicate Detection (DDD). The sequential algorithm, FA*-DDD, overcomes the leak-back problem associated with the combination... more
With rapid advancement in technology enables high uses of database which causes duplication of database management. The replicated data records generate multiple copies of similar data is associated with record, in completed and also... more
In this paper, we introduced a statistical rule-based method to create rules for SpamAssassin to detect spams in different languages. The theoretical framework of generating and maintaining multilingual rules were also illustrated. The... more
Intestinal duplications are congenital malformations of the gastrointestinal tract which contain a muscular wall of two layers and a lining which resembles some part of the gastrointestinal tract (1-4). The following is a case of bleeding... more
Today, bibliographical information is kept in a variety of data sources world wide, some of them publically available, and some of them also offering information about citations made in publications. But as most of those sources cover... more
The research field of data integration is an area of growing practical importance, especially considering the increasing availability of huge amounts of data from more and more source systems. According current research includes... more
Similarity Join is an important operation in data integration and cleansing, record linkage, data deduplication and pattern matching. It finds similar sting pairs from two collections of strings. Number of approaches have been proposed as... more
Duplicate Detection is critical task of any database of any organization. Duplicates are nothing but the same real time entities or objects are presented in the form of different structure and in the different formats. We can find out the... more
Spammers continues to uses new methods and the types of email content becomes more difficult, text-based anti-spam methods are not good enough to prevent spam. Spam image making techniques are designed to bypass well-known image spam... more
Spammers continues to uses new methods and the types of email content becomes more difficult, text-based anti-spam methods are not good enough to prevent spam. Spam image making techniques are designed to bypass well-known image spam... more
To combine information from heterogeneous sources, equivalent data in the multiple sources must be identi- fied. This task is the field matching problem. Specifi- cally, the task is to determine whether or not two syn- tactic values are... more
Data de-duplication is a very simple concept with very smart technology associated in it. The data blocks are stored only once, de-duplication systems decrease storage consumption by identifying distinct chunks of data with identical... more
Data quality in companies is decisive and critical to the benefits their products and services can provide. However, in heterogeneous IT infrastructures where, e.g., different applications for Enterprise Resource Planning (ERP), Customer... more
Road Accidents , Beware Everywhere
Centered around the data cleaning and integration research area, in this paper we propose SjClust, a framework to integrate similarity join and clustering into a single operation. The basic idea of our proposal consists in introducing a... more
Six breakpoint regions for rearrangements of human chromosome 15q11-q14 have been described. These rearrangements involve deletions found in approximately 70% of Prader-Willi or Angelman's syndrome patients (PWS, AS), duplications... more
Duplicate Detection is critical task of any database of any organization. Duplicates are nothing but the same real time entities or objects are presented in the form of different structure and in the different formats. We can find out the... more
This paper presents the details of the system we prepare as a participant of the PAN 2014 task on 'Source Retrieval: Uncovering Plagiarism, Authorship, and Social Software Misuse'. Our work is focused on intelligent chunking of suspicious... more
This paper presents the details of the system we prepare as a participant of the PAN 2014 task on 'Source Retrieval: Uncovering Plagiarism, Authorship, and Social Software Misuse'. Our work is focused on intelligent chunking of suspicious... more
Near-duplicate documents exacerbate the problem of information overload. Research in detecting near-duplicates has attracted a lot of attention from both industry and academia. In this paper, we focus on addressing this problem for... more
Proposed threshold-based and rule-based approaches to detecting duplicates in bibliographic database
Bibliographic databases are used to measure the performance of researchers, universities and research institutions. Thus, high data quality is required and data duplication is avoided. One of the weaknesses of the threshold-based approach... more
Efficiently detecting near duplicate resources is an important task when integrating information from various sources and applications. Once detected, near duplicate resources can be grouped together, merged, or removed, in order to avoid... more
Texts propagate through many social networks and provide evidence for their structure. We present efficient algorithms for detecting clusters of reused passages embedded within longer documents in large collections. We apply these... more
Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a dif cult task. Errors are introduced as the result of... more
In this study, we present newly discovered duplicates of three significant Old Babylonian literary texts. 1) An unpublished Louvre duplicate (AO 6161) of the Papulegara hymns collection, which is currently housed at the British Museum. 2)... more
Record linkage is the task of quickly and accurately identifying records corresponding to the same entity from one or more data sources. Record linkage is also known as data cleaning, entity reconciliation or identification and the... more
When querying Knowledge Bases (KBs), users are faced with large sets of data, often without knowing their underlying structures. It follows that users may make mistakes when formulating their queries, therefore receiving an unhelpful... more
When querying Knowledge Bases (KBs), users are faced with large sets of data, often without knowing their underlying structures. It follows that users may make mistakes when formulating their queries, therefore receiving an unhelpful... more
We propose efficient techniques for processing various Top-K count queries on data with noisy duplicates. Our method differs from existing work on duplicate elimination in two significant ways: First, we dedup on the fly only the part of... more
We propose efficient techniques for processing various Top-K count queries on data with noisy duplicates. Our method differs from existing work on duplicate elimination in two significant ways: First, we dedup on the fly only the part of... more
Community detection is an important task in network analysis. A community (also referred to as a cluster) is a set of cohesive vertices that have more connections inside the set than outside. In many social and information networks, these... more
Federated query engines allow for linked data consumption using SPARQL endpoints. Replicating data fragments from different sources enables data reorganization and provides the basis for more effective and efficient federated query... more
Current relational database systems are deterministic in nature and lack the support for approximate matching. The result of approximate matching would be the tuples annotated with the percentage of similarity but the existing relational... more
Current relational database systems are deterministic in nature and lack the support for approximate matching. The result of approximate matching would be the tuples annotated with the percentage of similarity but the existing relational... more
One significant challenge to scaling entity resolution algorithms to massive datasets is understanding how performance changes after moving beyond the realm of small, manually labeled reference datasets. Unlike traditional machine... more
This paper presents a search and retrieval framework that enables the management of Intellectual Property in the World Wide Web. This twofold framework helps users to detect digital rights infringements of their copyrighted content. In... more
While big data benefits are numerous, most of the collected data is of poor quality and, therefore, cannot be effectively used as it is. One pre-processing the leading big data quality challenges is data duplication. Indeed, the gathered... more
The record de-duplication is an important part of data cleaning process of a data-warehouse. Identification of multiple duplicate entries of a single entity in a data-warehouse is known as de-duplication. A lot of research is carried out... more
Quality of Record de-duplication is a key factor in decision making process. Correctness in the identification of duplicates from a dataset provides a strong foundation for inference. Blocking is a popular technique in de-duplication. In... more
The Kanseki Repository is a large repository of premodern Chinese texts. Currently it holds more than 9000 texts, covering all periods of Chinese history from early antiquity to the beginning of the 20th century. The repository is... more