We propose a method for efficiently finding all parallel passages in a large corpus, even if the ... more We propose a method for efficiently finding all parallel passages in a large corpus, even if the passages are not quite identical due to rephrasing and orthographic variation. The key ideas are the representation of each word in the corpus by its two most infrequent letters, finding matched pairs of strings of four or five words that differ by at most one word and then identifying clusters of such matched pairs. Using this method, over 4600 parallel pairs of passages were identified in the Babylonian Talmud, a Hebrew-Aramaic corpus of over 1.8 million words, in just over 11 seconds. Empirical comparisons on sample data indicate that the coverage obtained by our method is essentially the same as that obtained using slow exhaustive methods. keywords approximate matching; fuzzy matching; text reuse INTRODUCTION Ancient text corpora in classical languages such as Greek, Latin, Hebrew and Aramaic typically include numerous examples of text reuse, including repetitions of long passages of 20 words or more. Identifying such passages is important because it allows scholars to trace the development of ideas and concepts through time and across geographical ranges. Additionally, even within a given time period and geographical location, the identification of multiple parallel sources for any given idea provides a platform for scholarly inquiry. Identifying all examples of text reuse within a large such corpus is challenging for several reasons, including the large number of comparisons that must be done and the fact that matches tend to be only approximate.
Most text analysis and retrieval work to date has focused on the topic of a text; that is, what i... more Most text analysis and retrieval work to date has focused on the topic of a text; that is, what it is about. However, a text also contains much useful information in its style, or how it is written. This includes information about its author, its purpose, feelings it is meant to evoke, and more. This article develops a new type of lexical feature for use in stylistic text classification, based on taxonomies of various semantic functions of certain choice words or phrases. We demonstrate the usefulness of such features for the stylistic text classification tasks of determining author identity and nationality, the gender of literary characters, a text's sentiment (positive/ negative evaluation), and the rhetorical character of scientific journal articles. We further show how the use of functional features aids in gaining insight about stylistic differences among different kinds of texts.
The theory revision problem is the problem of howb est to go about revising a deficient domain th... more The theory revision problem is the problem of howb est to go about revising a deficient domain theory using information contained in examples that expose inaccuracies. In this paper we present our approach to the theory revision problem for propositional domain theories. The approach described here, called PTR, uses probabilities associated with domain theory elements to numerically track the ''flow''o fp roof through the theory.T his allows us to measure the precise role of a clause or literal in allowing or preventing a (desired or undesired) derivation for a given example. This information is used to efficiently locate and repair flawed elements of the theory. PTR is provedt oc onverget oat heory which correctly classifies all examples, and shown experimentally to be fast and accurate evenfor deep theories.
While it is has often been observed that the product of translation is somehow different than non... more While it is has often been observed that the product of translation is somehow different than non-translated text, scholars have emphasized two distinct bases for such differences. Some have noted interference from the source language spilling over into translation in a source-language-specific way, while others have noted general effects of the process of translation that are independent of source language. Using a series of text categorization experiments, we show that both these effects exist and that, moreover, there is a continuum between them. There are many effects of translation that are consistent among texts translated from a given source language, some of which are consistent even among texts translated from families of source languages. Significantly, we find that even for widely unrelated source languages and multiple genres, differences between translated texts and non-translated texts are sufficient for a learned classifier to accurately determine if a given text is translated or original.
The summers spent with Arnt and colleagues from the Copenhagen Business School at the retreat in ... more The summers spent with Arnt and colleagues from the Copenhagen Business School at the retreat in Skagen were as close to heaven as I'll ever get. In his enviably laid-back way, Arnt managed to mold us into an irrepressible think tank. We wrote articles, he produced grant proposals, and all of us reveled in long walks along the magnificent beaches of that picturesque town on the tip of Jutland. Our explorations of the "migratory dune," our visits to the "sunken church," and the wonderful evenings spent lolling in the spacious parlor of the mansion that became our home during those magical weeksare etched in my memory and in my heart. Back in Copenhagen, it was Arnt who introduced me to Karen Blixen's home, and all the stories that go with it, and Arntnever rushed, never appearing pressuredwho took me to see the opera building and other highlights of Copenhagen architecture. It was also Arnt who set up CRITT and so many other cornerstones of collaborative research at CBS. And on an even more personal level, it was Arnt who did me the unforgettable honor of coediting and publishing Interpreting Studies and Beyonda very special volume of the Copenhagen Studies in Language. I feel privileged to have been given this opportunity to express my appreciation, in some small way,
Suppose a domain expert gives us a domain theory which is meant to classify examples as positive ... more Suppose a domain expert gives us a domain theory which is meant to classify examples as positive or negative examples of some concept. Now suppose, as is often the case, that the expert speci es parts of the theory which might be in need of repair, as opposed to those parts of the theory which are certainly not in need of repair. We say that such a theory is partially mutable. There might be some non-empty set of examples each of which has a classi cation in the partially mutable theory which is invariant under all possible sets of repairs to unreliable components of the theory. We call such examples stable. We present an e cient algorithm for identifying stable examples for a large class of rst-order clausal theories with negation and recursion. We further show how to use stability to arbitrate between the theory and a noisy oracle to improve classi cation accuracy. We present experimental results on some awed theories which illustrate the approach.
One common approach to using a prior domain theory as a learning bias is to revise the theory in ... more One common approach to using a prior domain theory as a learning bias is to revise the theory in accordance with a set of training examples. More recently, another class of methods has arisen in which the theory is reinterpreted, either by probabilizing it, or by using its components in constructive induction. Revision-based methods tend to work best when aws in the given theory are localized, whereas reinterpretation methods tend to work well when aws are distributed evenly throughout the theory. This paper describes a`meta-learning' algorithm which, given a awed domain theory, determines the general nature of the theory's aws by analyzing the information ow in the theory. The method works by rst`probabilizing' the theory, and then selectively`de-probabilizing' components, based on the theory's performance on a preclassi ed set of training examples. This method distinguishes between those parts of the theory which should be interpreted as given and those which need to be revised or reinterpreted. This allows us to directly determine the nature of the information contained in the theory, and hence to exploit the theory in the best way possible.
This paper describes a method for the detection and removal of shadows in RGB images. The shadows... more This paper describes a method for the detection and removal of shadows in RGB images. The shadows are with hard borders. The proposed method begins with a segmentation of the color image. It is then decided if a segment is a shadow by examination of its neighboring segments. We use the method introduced in Finlayson et. al. [1] to remove the shadows by zeroing the shadow's borders in an edge representation of the image, and then re-integrating the edge using the method introduced by Weiss [2]. This is done for all of the color channels thus leaving a shadow-free color image. Unlike previous methods, the present method requires neither a calibrated camera nor multiple images. This method is complementary of current illumination correction algorithms. Examination of a number of examples indicates that this method yields a significant improvement over previous methods.
We present a web mining method for discov- ering and enhancing relationships in which a specified... more We present a web mining method for discov- ering and enhancing relationships in which a specified concept (word class) participates. We discover a whole range of relationships focused on the given concept, rather than generic known relationships as in most pre- vious work. Our method is based on cluster- ing patterns that contain concept words and other words related to them. We evaluate the method on three different rich concepts and find that in each case the method generates a broad variety of relationships with good pre- cision.
Proceedings of the SIGIR 2007 International Workshop on Plagiarism Analysis, Authorship Identification, and Near-Duplicate Detection, PAN 2007, Amsterdam, Netherlands, July 27, 2007
Uploads
Papers by Moshe Koppel