On the editing distance between unordered labeled trees
1992, Information Processing Letters
https://doi.org/10.1016/0020-0190(92)90136-J…
7 pages
1 file
Sign up for access to the world's latest research
Abstract
Zhang, K., R. Statman and D. Shasha, On the editing distance between unordered labeled trees, Information Processing Letters 42 (1992) 133-139. This paper considers the problem of computing the editing distance between unordered, labeled trees. We give efficient polynomial-time algorithms for the case when one tree is a string or has a bounded number of leaves. By contrast, we show that the problem is NP-complete even for binary trees having a label alphabet of size two.
Related papers
2018
Almost 30 years ago, Zhang and Shasha published a seminal paper describing an efficient dynamic programming algorithm computing the tree edit distance, that is, the minimum number of node deletions, insertions, and replacements that are necessary to transform one tree into another. Since then, the tree edit distance has had widespread applications, for example in bioinformatics and intelligent tutoring systems. However, the original paper of Zhang and Shasha can be challenging to read for newcomers and it does not describe how to efficiently infer the optimal edit script. In this contribution, we provide a comprehensive tutorial to the tree edit distance algorithm of Zhang and Shasha. We further prove metric properties of the tree edit distance, and describe efficient algorithms to infer the cheapest edit script, as well as a summary of all cheapest edit scripts between two trees.
2012
The aim of this thesis is the comparison of the Tree Edit Distance methods, in the context of detecting structural similarity between two XML Schema documents. The methods search the minimum number of edit operations leading from one tree to another. We have analysed and implemented a wide range of the existing tree edit distance approaches. It is important to understand that the distance computed by the algorithms is affected by the set of used edit operations, therefore the strength in detecting XML Schema similarity differs in each approach. The first part of this work contains the description of the used approaches and necessary notations. The second part provides implementation details and analysis of the described methods, which consists of theoretical comparison and empirical evaluation on real and synthetic xml data. The resulting implementation is available in the form of Java SE application.
Journal of Computational Biology, 2006
Consequently, there is a need to design metrics and algorithms to compare trees. A natural comparison metric is the "Tree Edit Distance," the number of simple edit (insert/delete) operations needed to transform one tree into the other. Rooted-ordered trees, where the order between the siblings is significant, can be compared in polynomial time. Rooted-unordered trees are used to describe processes or objects where the topology, rather than the order or the identity of each node, is important. For example, in immunology, rooted-unordered trees describe the process of immunoglobulin (antibody) gene diversification in the germinal center over time. Comparing such trees has been proven to be a difficult computational problem that belongs to the set of NP-Complete problems. Comparing two trees can be viewed as a search problem in graphs. A * is a search algorithm that explores the search space in an efficient order. Using a good lower bound estimation of the degree of difference between the two trees, A * can reduce search time dramatically. We have designed and implemented a variant of the A * search algorithm suitable for calculating tree edit distance. We show here that A * is able to perform an edit distance measurement in reasonable time for trees with dozens of nodes.
In a number of practical situations, data have structure and the relations among its component parts need to be coded with suitable data models. Trees are usually utilized for representing data for which hierarchical relations can be defined. This is the case in a number of fields like image analysis, natural language processing, protein structure, or music retrieval, to name a few. In those cases, procedures for comparing trees are very relevant. An approximate tree edit distance algorithm has been introduced for working with trees labeled only at the leaves. In this paper, it has been applied to handwritten character recognition, providing accuracies comparable to those by the most comprehensive search method, being as efficient as the fastest.
Algorithmica, 2003
In this paper we propose a dynamic programming algorithm to compare two quotiented trees using a constrained edit distance. A quotiented tree is a tree defined with an additional equivalent relation on vertices and such that the quotient graph is also a tree. The core of the method relies on an adaptation of an algorithm recently proposed by Zhang for comparing unordered rooted trees. This method is currently being used in plant architecture modelling to quantify different types of variability between plants represented by quotiented trees.
Similarity Search and Applications
The unordered tree edit distance is a natural metric to compute distances between trees without intrinsic child order, such as representations of chemical molecules. While the unordered tree edit distance is MAX SNP-hard in principle, it is feasible for small cases, e.g. via an A* algorithm. Unfortunately, current heuristics for the A* algorithm assume unit costs for deletions, insertions, and replacements, which limits our ability to inject domain knowledge. In this paper, we present three novel heuristics for the A* algorithm that work with custom cost functions. In experiments on two chemical data sets, we show that custom costs make the A* computation faster and improve the error of a 5-nearest neighbor regressor, predicting chemical properties. We also show that, on these data, polynomial edit distances can achieve similar results as the unordered tree edit distance.
6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268), 2000
A common model for computing the similarity of two strings X and Y of lengths m, and n respectively with m n, is to transform X into Y through a sequence of three types of edit operations: insertion, deletion, and substitution. The model assumes a given cost function which assigns a non-negative real weight to each edit operation. The amortized weight for a given edit sequence is the ratio of its weight to its length, and the minimum of this ratio over all edit sequences is the normalized edit distance. Existing algorithms for normalized edit distance computation with proven complexity bounds require O(mn 2 ) time in the worst-case. We give an O(mn log n)-time algorithm for the problem when the cost function is uniform, i.e, the weight of each edit operation is constant within the same type, except substitutions can have different weights depending on whether they are matching or non-matching.
ACM Transactions on Database Systems, 2010
When integrating data from autonomous sources, exact matches of data items that represent the same real world object often fail due to a lack of common keys. Yet in many cases structural information is available and can be used to match such data. Typically the matching must be approximate since the representations in the sources differ.
International Journal of Pattern Recognition and Artificial Intelligence, 2012
We model the edit distance as a function in a labeling space. A labeling space is an Euclidean space where coordinates are the edit costs. Through this model, we de¯ne a class of cost. A class of cost is a region in the labeling space that all the edit costs have the same optimal labeling. Moreover, we characterize the distance value through the labeling space. This new point of view of the edit distance gives us the opportunity of de¯ning some interesting properties that are useful for a better understanding of the edit distance. Finally, we show the usefulness of these properties through some applications.
International Journal of Foundations of Computer Science, 1996
We consider the problem of comparing CUAL graphs (Connected, Undirected, Acyclic graphs with nodes being Labeled). This problem is motivated by the study of information retrieval for bio-chemical and molecular databases. Suppose we define the distance between two CUAL graphs G1 and G2 to be the weighted number of edit operations (insert node, delete node and relabel node) to transform G1 to G2. By reduction from exact cover by 3-sets, one can show that finding the distance between two CUAL graphs is NP-complete. In view of the hardness of the problem, we propose a constrained distance metric, called the degree-2 distance, by requiring that any node to be inserted (deleted) have no more than 2 neighbors. With this metric, we present an efficient algorithm to solve the problem. The algorithm runs in time O(N1N2D2) for general weighting edit operations and in time [Formula: see text] for integral weighting edit operations, where Ni, i=1, 2, is the number of nodes in Gi, D=min{d1, d2} a...

Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
References (8)
- M.R. Garey and D.S. Johnson, Computers and Intracfabil- ity (Freeman, New York, 19791.
- B. Shapiro and K. Zhang, Comparing multiple RNA sec- ondary structures using tree comparisons, Comput. Appl.
- Biosci. 6 (4) (1990) 309-318.
- K.C. Tai, The tree-to-tree correction problem, J. ACM 26 (1979) 4222433.
- R.A. Wagner and M.J. Fisher, The string to string correc- tion problem, J. ACM 21 (1974) 168-173.
- K. Zhang, The editing distance between trees: algorithms and applications, Ph.D. Thesis. Dept. of Computer Sci- ence, Courant Institute, 1989.
- K. Zhang and D. Shasha, Simple fast algorithms for the editing distance between trees and related problems, SIAM J. Comput. 18 (1989) 1245-1262.
- K. Zhang, R. Statman and D. Shasha, On the editing distance between unordered labeled trees, Tech. Rept.