Martin van den Berg

Focused crawling: a new approach to topic-specific Web resource discovery

Computer Networks, 1999

The rapid growth of the WorldWide Web poses unprecedented scaling challenges for general-purpose ... more The rapid growth of the WorldWide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines. In this paper we describe a new hypertext resource discovery system called a Focused Crawler. The goal of a focused crawler is to selectively seek out pages that are relevant to a pre-defined set of topics. The topics are specified not using keywords, but using exemplary documents. Rather than collecting and indexing all accessible Web documents to be able to answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the Web. This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date. To achieve such goal-directed crawling, we designed two hypertext mining programs that guide our crawler: a classifier that evaluates the relevance of a hypertext document with respect to the focus topics, and a distiller that identifies hypertext nodes that are great access points to many relevant pages within a few links. We report on extensive focused-crawling experiments using several topics at different levels of specificity. Focused crawling acquires relevant pages steadily while standard crawling quickly loses its way, even though they are started from the same root set. Focused crawling is robust against large perturbations in the starting set of URLs. It discovers largely overlapping sets of resources in spite of these perturbations. It is also capable of exploring out and discovering valuable resources that are dozens of links away from the start set, while carefully pruning the millions of pages that may lie within this same radius. Our anecdotes suggest that focused crawling is very effective for building high-quality collections of Web documents on specific topics, using modest desktop hardware.

Download

Distributed hypertext resource discovery through examples

We describe the architecture of a hypertext resource discovery system using a relational database... more We describe the architecture of a hypertext resource discovery system using a relational database. Such a system can answer questions that combine page contents, metadata, and hyperlink structure in powerful ways, such as "find the number of links from an environmental protection page to a page about oil and natural gas over the last year." A key problem in populating the database in such a system is to discover web resources related to the topics involved in such queries. We argue that that a keywordbased "find similar" search based on a giant all-purpose crawler is neither necessary nor adequate for resource discovery. Instead we exploit the properties that pages tend to cite pages with related topics, and given that a page u cites a page about a desired topic, it is very likely that u cites additional desirable pages. We exploit these properties by using a crawler controlled by two hypertext mining programs: (1) a classifier that evaluates the relevance of a region of the web to the user's interest (2) a distiller that evaluates a page as an access point for a large neighborhood of relevant pages. Our implementation uses IBM's Universal Database, not only for robust data storage, but also for integrating the computations of the classifier and distiller into the database. This results in significant increase in I/O efficiency: a factor of ten for the classifier and a factor of three for the distiller. In addition, ad-hoc SQL queries can be used to monitor the crawler, and dynamically change crawling strategies. We report on experiments to establish that our system is efficient, effective, and robust.

Download

Distributed hypertext resource discovery through examples

We describe the architecture of a hypertext resource discovery system using a relational database... more We describe the architecture of a hypertext resource discovery system using a relational database. Such a system can answer questions that combine page contents, metadata, and hyperlink structure in powerful ways, such as "find the number of links from an environmental protection page to a page about oil and natural gas over the last year." A key problem in populating the database in such a system is to discover web resources related to the topics involved in such queries. We argue that that a keywordbased "find similar" search based on a giant all-purpose crawler is neither necessary nor adequate for resource discovery. Instead we exploit the properties that pages tend to cite pages with related topics, and given that a page u cites a page about a desired topic, it is very likely that u cites additional desirable pages. We exploit these properties by using a crawler controlled by two hypertext mining programs: (1) a classifier that evaluates the relevance of a region of the web to the user's interest (2) a distiller that evaluates a page as an access point for a large neighborhood of relevant pages. Our implementation uses IBM's Universal Database, not only for robust data storage, but also for integrating the computations of the classifier and distiller into the database. This results in significant increase in I/O efficiency: a factor of ten for the classifier and a factor of three for the distiller. In addition, ad-hoc SQL queries can be used to monitor the crawler, and dynamically change crawling strategies. We report on experiments to establish that our system is efficient, effective, and robust.

Download

Distributed hypertext resource discovery through examples

We describe the architecture of a hypertext resource discovery system using a relational database... more We describe the architecture of a hypertext resource discovery system using a relational database. Such a system can answer questions that combine page contents, metadata, and hyperlink structure in powerful ways, such as "find the number of links from an environmental protection page to a page about oil and natural gas over the last year." A key problem in populating the database in such a system is to discover web resources related to the topics involved in such queries. We argue that that a keywordbased "find similar" search based on a giant all-purpose crawler is neither necessary nor adequate for resource discovery. Instead we exploit the properties that pages tend to cite pages with related topics, and given that a page u cites a page about a desired topic, it is very likely that u cites additional desirable pages. We exploit these properties by using a crawler controlled by two hypertext mining programs: (1) a classifier that evaluates the relevance of a region of the web to the user's interest (2) a distiller that evaluates a page as an access point for a large neighborhood of relevant pages. Our implementation uses IBM's Universal Database, not only for robust data storage, but also for integrating the computations of the classifier and distiller into the database. This results in significant increase in I/O efficiency: a factor of ten for the classifier and a factor of three for the distiller. In addition, ad-hoc SQL queries can be used to monitor the crawler, and dynamically change crawling strategies. We report on experiments to establish that our system is efficient, effective, and robust.

Download

Hybrid Text Summarization: Combining External Relevance Measures With Structural Analysis

… ACL Workshop Text …, 2004

In this paper, we present algorithms to address the shortcomings of both purely structural and pu... more

Download

Invited Talk: A Note on the relationship of discourse structure to information structure

Download

Calculating Valence of Expressions within Documents for Searching a Document Index

Discourse Structure and Discourse Interpretation

Annual Meeting of the Berkeley Linguistics Society, 1997

Proceedings of the Twenty-Third Annual Meeting of the Berkeley Linguistics Society: General Sessi... more

Download

Systems and methods for dynamic reading fluency proficiency assessment

System and method for teaching writing using microanalysis of text

Systems and methods for generating analytic summaries

Full Dynamic Plural Logic

this paper was sent to the proceedings of the fourth Hungarian symposium of logic and language. T... more this paper was sent to the proceedings of the fourth Hungarian symposium of logic and language. The main change is that formulas are now consequently written as infix relations. This made some textual changes necessary. Also some typing and stylistic errors have been corrected. Full Dynamic Plural Logic Martin van den Berg Department of Computational Linguistics Faculty of Arts, University of Amsterdam vdberg@alf.let.uva.nl

System and method for teaching second language writing skills using the linguistic discourse model

Discourse Structure and Sentiment

2011 IEEE 11th International Conference on Data Mining Workshops, 2011

In this paper we discuss the application of the Linguistic Discourse Model (LDM) to sentiment ana... more In this paper we discuss the application of the Linguistic Discourse Model (LDM) to sentiment analysis at the discourse level. Based on the observations that naturally occurring discourse is interpretable though often not coherent, the LDM provides a unified and explanatory approach to discourse sentiment assignment. Special attention is paid here to the well known problem of computing sentiment in movie reviews which are characterized by shifting contexts of sentiment source and target.

A discourse perspective on verb phrase anaphora

Preventing existence

Proceedings of the international conference on Formal Ontology in Information Systems - Volume 2001, 2001

Download

LiveTree

Proceedings of the 2004 ACL Workshop on Discourse Annotation - DiscAnnotation '04, 2004

In this paper, we introduce LiveTree, a core component of LIDAS, the Linguistic Discourse Analysi... more In this paper, we introduce LiveTree, a core component of LIDAS, the Linguistic Discourse Analysis System for automatic discourse parsing with the Unified Linguistic Discourse Model (U-LDM) (X et al, 2004). LiveTree is an integrated workbench for supervised and unsupervised creation, storage and manipulation of the discourse structure of text documents under the U-LDM. The LiveTree environment provides tools for manual and automatic U-LDM segmentation and discourse parsing. Document management, grammar testing, manipulation of discourse structures and creation and editing of discourse relations are also supported.

Download

Sentential structure and discourse parsing

Proceedings of the 2004 ACL Workshop on Discourse Annotation - DiscAnnotation '04, 2004

In this paper, we describe how the LIDAS System (Linguistic Discourse Analysis System), a discour... more In this paper, we describe how the LIDAS System (Linguistic Discourse Analysis System), a discourse parser built as an implementation of the Unified Linguistic Discourse Model (U-LDM) uses information from sentential syntax and semantics along with lexical semantic information to build the Open Right Discourse Parse Tree (DPT) that serves as a representation of the structure of the discourse (Polanyi et al., 2004; Thione 2004a,b). More specifically, we discuss how discourse segmentation, sentence-level discourse parsing, and text-level discourse parsing depend on the relationship between sentential syntax and discourse. Specific discourse rules that use syntactic information are used to identify possible attachment points and attachment relations for each Basic Discourse Unit to the DPT.

Download

Discourse Grammar and Dynamic Logic

this paper, I will make a first stab at combining the discourse parsingmechanisms developed in (P... more this paper, I will make a first stab at combining the discourse parsingmechanisms developed in (Prust 1992, Prust et.al. 1994) and applied in (Polanyiand Scha 1983, Polanyi 1988) with the dynamic logic for plurals presented atthe Ninth Amsterdam Colloquium (van den Berg 1994) and in my dissertation(van den Berg 1996a).1.1 Anaphora

Discourse Structure and Sentential Information Structure. An Initial Proposal

Journal of Logic, Language and Information - JOLLI, 2003

In this article we argue that discourse structure constrains the set ofpossible constituents in a... more In this article we argue that discourse structure constrains the set ofpossible constituents in a discourse that can provide the relevantcontext for structuring information in a target sentence, whileinformation structure critically constrains discourse structureambiguity. For the speaker, the discourse structure provides a set of possible contexts for continuation while information structure assignment is independent of discourse structure. For the hearer, the information structure of a sentence together with discourse structure instructs dynamic semantics how rhematic information should be used to update the meaning representation of the discourse (Polanyi and van den Berg, 1996).

Download

Uploads

Papers by Martin van den Berg

Log In