Duplicate Detection Research Papers

Bibster — A Semantics-Based Bibliographic Peer-to-Peer System

2025, Springer eBooks

This paper describes Bibster, a Peer-to-Peer system for exchanging bibliographic metadata among researchers. We show how Bibster exploits ontologies in data-representation, query formulation, query routing, and query result presentation.... more

descriptionView Paper arrow_downwardDownload

Bibster—a semantics-based bibliographic Peer-to-Peer system

by Pawel Pyszlak

2025, Journal of Web Semantics

This paper describes the design and implementation of Bibster, a Peer-to-Peer system for exchanging bibliographic data among researchers. Bibster exploits ontologies in data-storage, query formulation, query-routing and answer... more

descriptionView Paper arrow_downwardDownload

Bibster � A Semantics-Based Bibliographic Peer-to-Peer System

by Steffen Staab

2025

This paper describes Bibster, a Peer-to-Peer system for exchanging bibliographic metadata among researchers. We show how Bibster exploits ontologies in data-representation, query formulation, query routing, and query result presentation.... more

descriptionView Paper arrow_downwardDownload

Bibster—a semantics-based bibliographic Peer-to-Peer system

by Steffen Staab and

2025, Journal of Web Semantics

This paper describes the design and implementation of Bibster, a Peer-to-Peer system for exchanging bibliographic data among researchers. Bibster exploits ontologies in data-storage, query formulation, query-routing and answer... more

descriptionView Paper arrow_downwardDownload

CoDet

by Cevdet Aykanat

2025, Proceedings of the 20th ACM international conference on Information and knowledge management

We study a generalized version of the near-duplicate detection problem which concerns whether a document is a subset of another document. In text-based applications, document containment can be observed in exact-duplicates,... more

descriptionView Paper arrow_downwardDownload

A hashed schema for similarity search in metric spaces

by Vlastislav Dohnal

2025, Proceeding of the 1st DELOS Network of Excellence Workshop on Information Seeking, Searching and Querying in Digital Libraries

A novel access structure for similarity search in metric data, called Similarity Hashing (sH), is proposed. Its multi-level hash structure of separable buckets on each level supports easy insertion and bounded search costs, because at... more

descriptionView Paper arrow_downwardDownload

Sequential and parallel algorithms for frontier A* with delayed duplicate detection

by José Amaral

2025, National Conference on Artificial Intelligence

We present sequential and parallel algorithms for Frontier A* (FA*) algorithm augmented with a form of Delayed Duplicate Detection (DDD). The sequential algorithm, FA*-DDD, overcomes the leak-back problem associated with the combination... more

descriptionView Paper arrow_downwardDownload

Sequential and Parallel Algorithms for Frontier A^* with Delayed Duplicate Detection

by José Amaral

2025

We present sequential and parallel algorithms for Frontier A* (FA*) algorithm augmented with a form of Delayed Duplicate Detection (DDD). The sequential algorithm, FA*-DDD, overcomes the leak-back problem associated with the combination... more

descriptionView Paper arrow_downwardDownload

A Novel Approach for Locating Identification of Similar Records

by SATISH WAGH

2025

With rapid advancement in technology enables high uses of database which causes duplication of database management. The replicated data records generate multiple copies of similar data is associated with record, in completed and also... more

descriptionView Paper arrow_downwardDownload

Using Categorical Clustering in Schema Discovery

by Periklis Andritsos

2025

descriptionView Paper arrow_downwardDownload

A Parallel External-Memory Frontier Breadth-First Traversal Algorithm for Clusters of Workstations

by José Amaral

2025, 2006 International Conference on Parallel Processing (ICPP'06)

descriptionView Paper arrow_downwardDownload

Uma abordagem baseada na web para resolução de entidades e criação de aquivos de autoridade

by denilson pereira

2025

descriptionView Paper arrow_downwardDownload

Multilingual Rules for Spam Detection

by Minh Trị Vũ

2025, Journal of Machine to Machine Communications

In this paper, we introduced a statistical rule-based method to create rules for SpamAssassin to detect spams in different languages. The theoretical framework of generating and maintaining multilingual rules were also illustrated. The... more

descriptionView Paper arrow_downwardDownload

Intestinal duplication detected with technetium-99m sodium pertechnetate imaging of the abdomen

by Otto Estuardo Montenegro Muñoz

2024, The American Journal of Digestive Diseases

Intestinal duplications are congenital malformations of the gastrointestinal tract which contain a muscular wall of two layers and a lining which resembles some part of the gastrointestinal tract (1-4). The following is a case of bleeding... more

descriptionView Paper arrow_downwardDownload

Citation Linking in Federated Digital Libraries

by Kai-Uwe Sattler

2024

Today, bibliographical information is kept in a variety of data sources world wide, some of them publically available, and some of them also offering information about citations made in publications. But as most of those sources cover... more

descriptionView Paper arrow_downwardDownload

by Kai-Uwe Sattler

2024, Data and Knowledge Engineering

The research field of data integration is an area of growing practical importance, especially considering the increasing availability of huge amounts of data from more and more source systems. According current research includes... more

The research field of data integration is an area of growing practical importance, especially considering the increasing availability of huge amounts of data from more and more source systems. According current research includes approaches for solving the problem of conflicts on the data level addressed in this thesis. Dealing with discrepancies in data still is a big challenge, relevant for instance during eliminating duplicates from semantically overlapping sources as well as for combining complementary data from different sources. According operations most often cannot only be based on equality of values, because only in rare cases there are identifiers valid across system boundaries. Using other attribute values is problematic, because erroneous data and varying conventions for information representation are common problems in this field. Therefore, according operations have to be based on the similarity of data objects and values. The concept of similarity itself is problematic regarding its usage and foundations of its semantics. Successful applications often have a very specific view of similarity measures and predicates that represent a narrow focus on the context of similarity for this given scenario. To provide similarity-based operations for data integration purposes requires a broader view on similarity, suitable to include for instance a number of generic and tailor-made similarity measures useful in a given data integration system. These problems are addressed in this thesis by providing similarity-based operations according to a small, generic framework. Similarity-based selection, join, and grouping operations are discussed regarding their general semantics and special aspects of underlying similarity relations. According algorithms suitable for data processing are described for materialised and virtual integration scenarios. Implementations are given and evaluated to prove the applicability and efficiency of the proposed approaches. On the predicate level the thesis is focused on string similarity, namely based on the Levenshtein or edit distance. The efficient processing of similarity-based operations mainly depends on an efficient evaluation of similarity predicates, which is illustrated for string similarity based on index support in materialised and pre-selection in virtual data integration scenarios.

descriptionView Paper arrow_downwardDownload

Survey of Scalable String Similarity Joins

by Archana Vaidya

2024

Similarity Join is an important operation in data integration and cleansing, record linkage, data deduplication and pattern matching. It finds similar sting pairs from two collections of strings. Number of approaches have been proposed as... more

descriptionView Paper arrow_downwardDownload

XML Duplicate Detection with Improved Network Pruning Algorithm

by Vishal Borate

2024, 2015 International Conference on Pervasive Computing (ICPC)

Duplicate Detection is critical task of any database of any organization. Duplicates are nothing but the same real time entities or objects are presented in the form of different structure and in the different formats. We can find out the... more

descriptionView Paper arrow_downwardDownload

Detection and Blocking of Image Spammers

by Bhadreshsinh Gohil

2024, International Journal of Advance Research and Innovative Ideas in Education

Spammers continues to uses new methods and the types of email content becomes more difficult, text-based anti-spam methods are not good enough to prevent spam. Spam image making techniques are designed to bypass well-known image spam... more

descriptionView Paper arrow_downwardDownload

Detection and Blocking of Image Spammers

by Vivek Khirasaria (Assistant Professor)

2024, International Journal of Advance Research and Innovative Ideas in Education

Spammers continues to uses new methods and the types of email content becomes more difficult, text-based anti-spam methods are not good enough to prevent spam. Spam image making techniques are designed to bypass well-known image spam... more

descriptionView Paper arrow_downwardDownload

The field matching problem: Algorithms and applications

by Charles Elkan

2024, Proceedings of the Second International Conference …

To combine information from heterogeneous sources, equivalent data in the multiple sources must be identi- fied. This task is the field matching problem. Specifi- cally, the task is to determine whether or not two syn- tactic values are... more

descriptionView Paper arrow_downwardDownload

Two Stage Max Gain Content Defined Chunking for Deduplication

by Arul Selvan

2024

Data de-duplication is a very simple concept with very smart technology associated in it. The data blocks are stored only once, de-duplication systems decrease storage consumption by identifying distinct chunks of data with identical... more

descriptionView Paper arrow_downwardDownload

Overlooked Aspects of Data Governance: Workflow Framework For Enterprise Data Deduplication

by Dr.-Ing. Otmane Azeroual and

2024

Data quality in companies is decisive and critical to the benefits their products and services can provide. However, in heterogeneous IT infrastructures where, e.g., different applications for Enterprise Resource Planning (ERP), Customer... more

descriptionView Paper arrow_downwardDownload

A new algorithm for the correction and control of road accidents data based on string similarities

by Diego Moretti

2024, ASA

Road Accidents , Beware Everywhere

descriptionView Paper arrow_downwardDownload

An Innovative Framework for Combining Set Similarity Join Algorithms and Clustering

by Massimiliano Nolich

2024

Centered around the data cleaning and integration research area, in this paper we propose SjClust, a framework to integrate similarity join and clustering into a single operation. The basic idea of our proposal consists in introducing a... more

descriptionView Paper arrow_downwardDownload

Human chromosome 15q11-q14 regions of rearrangements contain clusters of LCR15 duplicons

by Lluis Armengol Dulcet

2024, European Journal of Human Genetics

Six breakpoint regions for rearrangements of human chromosome 15q11-q14 have been described. These rearrangements involve deletions found in approximately 70% of Prader-Willi or Angelman's syndrome patients (PWS, AS), duplications... more

descriptionView Paper arrow_downwardDownload

XML Duplicate Detection with Improved Network Pruning Algorithm

by Vishal Borate

2024, 2015 International Conference on Pervasive Computing (ICPC)

Duplicate Detection is critical task of any database of any organization. Duplicates are nothing but the same real time entities or objects are presented in the form of different structure and in the different formats. We can find out the... more

descriptionView Paper arrow_downwardDownload

Experiments on Document Chunking and Query Formation for Plagiarism Source Retrieval

by Sujan Kumar Saha

2024, CLEF (Working Notes)

This paper presents the details of the system we prepare as a participant of the PAN 2014 task on 'Source Retrieval: Uncovering Plagiarism, Authorship, and Social Software Misuse'. Our work is focused on intelligent chunking of suspicious... more

descriptionView Paper arrow_downwardDownload

Experiments on Document Chunking and Query Formation for Plagiarism Source Retrieval

by Sujan Kumar Saha

2024

This paper presents the details of the system we prepare as a participant of the PAN 2014 task on 'Source Retrieval: Uncovering Plagiarism, Authorship, and Social Software Misuse'. Our work is focused on intelligent chunking of suspicious... more

descriptionView Paper arrow_downwardDownload

Near-Duplicates Detection for Vietnamese Documents in Large Database

by Công Sơn Trương

2024, 2008 International Conference on Advanced Language Processing and Web Information Technology

Near-duplicate documents exacerbate the problem of information overload. Research in detecting near-duplicates has attracted a lot of attention from both industry and academia. In this paper, we focus on addressing this problem for... more

descriptionView Paper arrow_downwardDownload

Proposed threshold-based and rule-based approaches to detecting duplicates in bibliographic database

by beei iaes

2024

Bibliographic databases are used to measure the performance of researchers, universities and research institutions. Thus, high data quality is required and data duplication is avoided. One of the weaknesses of the threshold-based approach... more

descriptionView Paper arrow_downwardDownload

Efficient Semantic-Aware Detection of Near Duplicate Resources

by Ekaterini Ioannou

2024, Lecture Notes in Computer Science

Efficiently detecting near duplicate resources is an important task when integrating information from various sources and applications. Once detected, near duplicate resources can be grouped together, merged, or removed, in order to avoid... more

descriptionView Paper arrow_downwardDownload

Infectious texts: Modeling text reuse in nineteenth-century newspapers

by Elizabeth Maddock Dillon

2024, 2013 IEEE International Conference on Big Data

Texts propagate through many social networks and provide evidence for their structure. We present efficient algorithms for detecting clusters of reused passages embedded within longer documents in large collections. We apply these... more

descriptionView Paper arrow_downwardDownload

Duplicate Record Detection: A Survey

by name lastname

2024, IEEE Transactions on Knowledge and Data Engineering

Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a dif cult task. Errors are introduced as the result of... more

descriptionView Paper arrow_downwardDownload

Duplication in Early Akkadian Literature. The Duplicates of the Papulegara Hymns, Ištar Louvre, and the Dialogue Between Father and Son (with M. P. Streck)

by Nathan Wasserman

2024

In this study, we present newly discovered duplicates of three significant Old Babylonian literary texts. 1) An unpublished Louvre duplicate (AO 6161) of the Papulegara hymns collection, which is currently housed at the British Museum. 2)... more

2.2. Variants Between Texts A and B After the separation line, Il. r. 14—26 allude to the astral aspect of Ninurta as Sirius’. Sirius joins Samas in his chamber at night (r. Il. 24-26) in the nether- world (il. r. 14, 15, 18, 19, 24), when the other gods are sleeping (I. r. 26). Astronomically, this means that Sirius accompanies the Sun, so that they rise and set together. This happens each year during the one and a half months before the heliacal rising of Sirius, which served as a sign for the summer solstice in the 3"! and 2™! millennia BCE. According to MUL.APIN II Gap A 12-13%, the summer solstice is associated with the rising of Sirius in the middle of months IV or V°. Hence, Ninurta-Sirius is in the chamber of the Sun in month III of the Baby- lonian calendar (plus fractions of the previous and subsequent months), which is the period of its invisibility!°.

2.3. Date of Duplicate B features and variants that suggest a later date, perhaps late OB or MB.” Epigraphi-

Fig. 6. Photo of AO 6161 (Papulegara) obverse by K. Wagensonner.

Fig. 7. Copy of AO 6161 (Papulegara) obverse by M. P. Streck.

descriptionView Paper arrow_downwardDownload

Record linkage: Current practice and future directions

by Chris Rainsford

2024

Record linkage is the task of quickly and accurately identifying records corresponding to the same entity from one or more data sources. Record linkage is also known as data cleaning, entity reconciliation or identification and the... more

descriptionView Paper arrow_downwardDownload

Dealing with Plethoric Answers of SPARQL Queries

by Allel Hadjali

2024, Springer eBooks

When querying Knowledge Bases (KBs), users are faced with large sets of data, often without knowing their underlying structures. It follows that users may make mistakes when formulating their queries, therefore receiving an unhelpful... more

descriptionView Paper arrow_downwardDownload

A cooperative treatment of the plethoric answers problem in RDF

by Allel Hadjali

2024, Knowledge and Information Systems

When querying Knowledge Bases (KBs), users are faced with large sets of data, often without knowing their underlying structures. It follows that users may make mistakes when formulating their queries, therefore receiving an unhelpful... more

descriptionView Paper arrow_downwardDownload

Efficient top-k count queries over imprecise duplicates

by Sourabh Kasliwal

2024

We propose efficient techniques for processing various Top-K count queries on data with noisy duplicates. Our method differs from existing work on duplicate elimination in two significant ways: First, we dedup on the fly only the part of... more

descriptionView Paper arrow_downwardDownload

Efficient top-k count queries over imprecise duplicates

by Sourabh Kasliwal

2024, Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology

We propose efficient techniques for processing various Top-K count queries on data with noisy duplicates. Our method differs from existing work on duplicate elimination in two significant ways: First, we dedup on the fly only the part of... more

descriptionView Paper arrow_downwardDownload

Overlapping community detection using seed set expansion

by ADIL AWAN

2024, Proceedings of the 22nd ACM international conference on Conference on information & knowledge management - CIKM '13

Community detection is an important task in network analysis. A community (also referred to as a cluster) is a set of cohesive vertices that have more connections inside the set than outside. In many social and information networks, these... more

descriptionView Paper arrow_downwardDownload

Decomposing federated queries in presence of replicated fragments

by Hala Skaf-Molli

2024, Journal of Web Semantics

Federated query engines allow for linked data consumption using SPARQL endpoints. Replicating data fragments from different sources enables data reorganization and provides the basis for more effective and efficient federated query... more

descriptionView Paper arrow_downwardDownload

Supporting Uncertain Predicates in DBMS Using Approximate String Matching and Probabilistic Databases

by Ravindra Keskar

2024, IEEE Access

Current relational database systems are deterministic in nature and lack the support for approximate matching. The result of approximate matching would be the tuples annotated with the percentage of similarity but the existing relational... more

descriptionView Paper arrow_downwardDownload

Supporting Uncertain Predicates in DBMS Using Approximate String Matching and Probabilistic Databases

by Ravindra Keskar

2024, IEEE Access

Current relational database systems are deterministic in nature and lack the support for approximate matching. The result of approximate matching would be the tuples annotated with the percentage of similarity but the existing relational... more

FIGURE 3. Block diagram of the proposed system. distance calculation, distance normalization, and probability calculation. When the parser encounters “~’ symbol in the where clause of the query, it performs approximate matching on the columns involved in uncertain predicates. A query may have other predicates along with the uncertain predi- cate, as present in our previous books example in Section I (again depicted in Fig. 4(a)). An instance of the books rela- tion is shown in Fig. 4(b). All predicates, except uncertain predicates, are applied first and then uncertain predicates are applied on those filtered result-set. These steps are the preprocessing steps of our system. Fig. 4(c) illustrates the filtered result-set. Distance Calculation Module calculates the distance between the queried literal and each of the field values in the corresponding column to get the distance array (Fig. 4(d)). Distance array is normalized in the range [0, 1] by the Distance Normalization Module (Fig. 4(e)). The above

FIGURE 4. Working of the proposed system on the books relation example. As noted earlier, q-gram distance needs O(|Q|?) storage. To avoid such high space complexity, we use the following technique to calculate the q-gram distance between the two strings (s and t) in RDBMS. Here one can try to utilize in-built features already available with the RDBMS like effi- cient joins methods, clustered index, etc. Auxiliary tables, Aj(qgram) and A2(qgram), with ggram as the only attribute,

FIGURE 5. Naive solution using auxiliary tables.

FIGURE 8. Execution times of the two techniques on the sample strings.

TABLE 1. Expected result of the query shown in Fig. 2. D. OUTLINE AND CONTRIBUTIONS

TABLE 3. Result of proposed formula on some sample strings. Let AUX be an auxiliary table used to store local q-gram distances. To calculate the global q-gram distance, we execute query Q2 in Fig. 6 on the AUX table. Before executing Q2,

In the proposed method, there is an overhead of matching individual substrings which resulted in the change of the com- plexity from O(|x| . |y|) to O(x|* . ly|). The actual increase in execution times is depicted in Fig. 8. The Movies relation contains 5043 tuples. Each of the execution times shown in Fig. 8 is the average of 10 experiments. An average of execution times of all the experiments (5*10 = 50 experi- ments) shows a 46.3% increase in the execution time. But, this trade-off comes with improved accuracy as shown in Table 4 and Fig. 8. TABLE 4. Results of naive method and the proposed method (q = 2).

TABLE 7. Result of Query 1 using the LCS method. the edit-based category. We obtained the intended results for all the methods, as shown in the tables mentioned above.

TABLE 6. Result of Query 1 using the OSA method.

TABLE 5. Result of Query 1 using the global q-gram method (q = 2).

TABLE 8. Result of Query 2 using the global q-gram method (q = 2). TABLE 11. Result of Query 3. C. ADDITIONAL OPERATORS To illustrate the utility of the proposed uncertain operator ‘~’, two additional operators (“*+’ and “~—’) are also proposed as an extension. We conducted several experiments to test the practicability of the proposed operators. We executed the same query first with ‘~’ and t and examined the results. In q to utilize our system as a predictor. We want to find best striker from the Footbal attributes that represent the pla acceleration, composure, and players, we joined ‘Player’ and on the ‘player_api_id’ attribute. Table 11 shows the resu hen using ‘*~-++/~—’ operators uery 3 (see Fig. 11), we tried the database. There are several yer’s skills and his/her perfor- mance. Key attributes of any striker are finishing, dribb ing, pace. To get the name of the ‘Player_Attributes’ relations tof Query 3 which has today’s lead ing strikers at the top. to ‘Hunters of Ghost’ in which an actor named ‘Steve’ has acted. But the real name of the movie is “Ghost Hunters’ and the complete name of the actor is “Steve Gonsalves’. IMDb database has three columns for storing the names of three actors and we are not sure which column of actor names con- tains ‘Steve’. Notice that we have used parenthesis to override the default precedence of the operators and to associate these three conditions. In the q-gram distance, a set of all possible q-grams is formed. So, inherently, the sequence of the sub- strings does not affect the distance. Thus, it is expected that the q-gram method would give the intended result at the top. On the other hand, OSA and LCS perform sequential match- ing. Hence, even if two strings have exact same substrings and if one has them swapped, distance is not zero. For example, dggram(‘Ghost Hunters’, “Hunters of Ghost’, q = 2) = 7 out of a possible range of [0-27], dosa(‘Ghost Hunters’, ‘Hunters of Ghost’) = 14 out of a possible range of [0-16] [Operations: SSSDMIIIMSSSSSSMI], and dj-;(‘Ghost Hunters’, ‘Hunters of Ghost’) = 15 [Operations: DDDDDDMMMMMMMIuII- TIIII], where ‘S’ stands for substitution, ‘D’ for deletion, ‘Tl’ for Insertion, and ‘M’ for match of the character. Results of Query 2 on the IMDb database are as shown in Tables 8-10.

TABLE 13. Result of Query 5. D. EFFECT OF THE VALUE OF q

descriptionView Paper arrow_downwardDownload

Performance Bounds for Pairwise Entity Resolution

by Artur Dubrawski

2024, arXiv (Cornell University)

One significant challenge to scaling entity resolution algorithms to massive datasets is understanding how performance changes after moving beyond the realm of small, manually labeled reference datasets. Unlike traditional machine... more

descriptionView Paper arrow_downwardDownload

A Search and Retrieval Framework for the management of copyrighted audiovisual content

by Thiên Hạ

2024

This paper presents a search and retrieval framework that enables the management of Intellectual Property in the World Wide Web. This twofold framework helps users to detect digital rights infringements of their copyrighted content. In... more

descriptionView Paper arrow_downwardDownload

An End-to-End Big Data Deduplication Framework based on Online Continuous Learning

by Imane Alaoui

2024, International Journal of Advanced Computer Science and Applications

While big data benefits are numerous, most of the collected data is of poor quality and, therefore, cannot be effectively used as it is. One pre-processing the leading big data quality challenges is data duplication. Indeed, the gathered... more

3) Encoding: Encoding is the process of converting categorical variables into numeric types. Most ML algorithms cannot handle absolute values and work better with numerical We consider these transformations the most important ones to prepare data for deduplication. However, according to the dataset context, more text cleaning may be needed, such as Spell Corrections and Stemming.

TABLE IIL. DEDUPLICATION FRAMEWORK EVALUATION It is worth noting that the framework accuracy has evolved considerably after applying online continual learning. For our built dataset, the framework detected 117700 out of 122 000 duplicated records with an F-score of 98,21%. Indeed, the resulting F-score was initially 97,05% and has increased by 1,16% after applying the continual learning process to the model with an additional dataset of 100 000 records. The metrics above were measured for three different datasets: Restaurants, Companies, and our Built Big Dataset coming out with the results presented in Table III.

TABLEIV. PROCESSING TIME A second phase of the implementation consists of scrambling the built dataset intentionally by feeding the datasets with more challenging duplicates. The goal is to uncover the framework limitations and discover how the accuracy is impacted by the inferior and very poor data quality and to what extent the framework remains suitable for use. For this, we have unfiltered in the dataset extreme cases of non- duplicates where for example the name and the address are similar, but the records are not duplicates. The framework was applied to a very poor big data quality to uncover the limitations of the framework. The dataset was then scrambled

TABLE V. F-SCORE EVOLUTION IN A VERY POOR-QUALITY DATASET D. Discussion

descriptionView Paper arrow_downwardDownload

An Efficient Approach for Automated Token Formation for Record De-duplication with special reference to Real-Time Data-Warehouse Environment

by Vaishali Wangikar

2024

The record de-duplication is an important part of data cleaning process of a data-warehouse. Identification of multiple duplicate entries of a single entity in a data-warehouse is known as de-duplication. A lot of research is carried out... more

descriptionView Paper arrow_downwardDownload

A Semi-Automated Record De-Duplication Technique for a Data Warehouse Environment

by Vaishali Wangikar

2024, International Journal of Innovative Technology and Exploring Engineering

Quality of Record de-duplication is a key factor in decision making process. Correctness in the identification of duplicates from a dataset provides a strong foundation for inference. Blocking is a popular technique in de-duplication. In... more

descriptionView Paper arrow_downwardDownload

Effective Identification of Citations in the Kanseki Repository

by Christian Wittern

2024

The Kanseki Repository is a large repository of premodern Chinese texts. Currently it holds more than 9000 texts, covering all periods of Chinese history from early antiquity to the beginning of the 20th century. The repository is... more

descriptionView Paper arrow_downwardDownload

Duplicate Detection

Related Topics