Papers by Martin van den Berg

Computer Networks, 1999
The rapid growth of the WorldWide Web poses unprecedented scaling challenges for general-purpose ... more The rapid growth of the WorldWide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines. In this paper we describe a new hypertext resource discovery system called a Focused Crawler. The goal of a focused crawler is to selectively seek out pages that are relevant to a pre-defined set of topics. The topics are specified not using keywords, but using exemplary documents. Rather than collecting and indexing all accessible Web documents to be able to answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the Web. This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date. To achieve such goal-directed crawling, we designed two hypertext mining programs that guide our crawler: a classifier that evaluates the relevance of a hypertext document with respect to the focus topics, and a distiller that identifies hypertext nodes that are great access points to many relevant pages within a few links. We report on extensive focused-crawling experiments using several topics at different levels of specificity. Focused crawling acquires relevant pages steadily while standard crawling quickly loses its way, even though they are started from the same root set. Focused crawling is robust against large perturbations in the starting set of URLs. It discovers largely overlapping sets of resources in spite of these perturbations. It is also capable of exploring out and discovering valuable resources that are dozens of links away from the start set, while carefully pruning the millions of pages that may lie within this same radius. Our anecdotes suggest that focused crawling is very effective for building high-quality collections of Web documents on specific topics, using modest desktop hardware.

We describe the architecture of a hypertext resource discovery system using a relational database... more We describe the architecture of a hypertext resource discovery system using a relational database. Such a system can answer questions that combine page contents, metadata, and hyperlink structure in powerful ways, such as "find the number of links from an environmental protection page to a page about oil and natural gas over the last year." A key problem in populating the database in such a system is to discover web resources related to the topics involved in such queries. We argue that that a keywordbased "find similar" search based on a giant all-purpose crawler is neither necessary nor adequate for resource discovery. Instead we exploit the properties that pages tend to cite pages with related topics, and given that a page u cites a page about a desired topic, it is very likely that u cites additional desirable pages. We exploit these properties by using a crawler controlled by two hypertext mining programs: (1) a classifier that evaluates the relevance of a region of the web to the user's interest (2) a distiller that evaluates a page as an access point for a large neighborhood of relevant pages. Our implementation uses IBM's Universal Database, not only for robust data storage, but also for integrating the computations of the classifier and distiller into the database. This results in significant increase in I/O efficiency: a factor of ten for the classifier and a factor of three for the distiller. In addition, ad-hoc SQL queries can be used to monitor the crawler, and dynamically change crawling strategies. We report on experiments to establish that our system is efficient, effective, and robust.

We describe the architecture of a hypertext resource discovery system using a relational database... more We describe the architecture of a hypertext resource discovery system using a relational database. Such a system can answer questions that combine page contents, metadata, and hyperlink structure in powerful ways, such as "find the number of links from an environmental protection page to a page about oil and natural gas over the last year." A key problem in populating the database in such a system is to discover web resources related to the topics involved in such queries. We argue that that a keywordbased "find similar" search based on a giant all-purpose crawler is neither necessary nor adequate for resource discovery. Instead we exploit the properties that pages tend to cite pages with related topics, and given that a page u cites a page about a desired topic, it is very likely that u cites additional desirable pages. We exploit these properties by using a crawler controlled by two hypertext mining programs: (1) a classifier that evaluates the relevance of a region of the web to the user's interest (2) a distiller that evaluates a page as an access point for a large neighborhood of relevant pages. Our implementation uses IBM's Universal Database, not only for robust data storage, but also for integrating the computations of the classifier and distiller into the database. This results in significant increase in I/O efficiency: a factor of ten for the classifier and a factor of three for the distiller. In addition, ad-hoc SQL queries can be used to monitor the crawler, and dynamically change crawling strategies. We report on experiments to establish that our system is efficient, effective, and robust.

We describe the architecture of a hypertext resource discovery system using a relational database... more We describe the architecture of a hypertext resource discovery system using a relational database. Such a system can answer questions that combine page contents, metadata, and hyperlink structure in powerful ways, such as "find the number of links from an environmental protection page to a page about oil and natural gas over the last year." A key problem in populating the database in such a system is to discover web resources related to the topics involved in such queries. We argue that that a keywordbased "find similar" search based on a giant all-purpose crawler is neither necessary nor adequate for resource discovery. Instead we exploit the properties that pages tend to cite pages with related topics, and given that a page u cites a page about a desired topic, it is very likely that u cites additional desirable pages. We exploit these properties by using a crawler controlled by two hypertext mining programs: (1) a classifier that evaluates the relevance of a region of the web to the user's interest (2) a distiller that evaluates a page as an access point for a large neighborhood of relevant pages. Our implementation uses IBM's Universal Database, not only for robust data storage, but also for integrating the computations of the classifier and distiller into the database. This results in significant increase in I/O efficiency: a factor of ten for the classifier and a factor of three for the distiller. In addition, ad-hoc SQL queries can be used to monitor the crawler, and dynamically change crawling strategies. We report on experiments to establish that our system is efficient, effective, and robust.
… ACL Workshop Text …, 2004
In this paper, we present algorithms to address the shortcomings of both purely structural and pu... more In this paper, we present algorithms to address the shortcomings of both purely structural and purely statistical methods of sentence extraction summa-rization. We present the PALSUMM hybrid sum-marization algorithms that use structural methods based on discourse ...
Calculating Valence of Expressions within Documents for Searching a Document Index
Annual Meeting of the Berkeley Linguistics Society, 1997
Proceedings of the Twenty-Third Annual Meeting of the Berkeley Linguistics Society: General Sessi... more Proceedings of the Twenty-Third Annual Meeting of the Berkeley Linguistics Society: General Session and Parasession on Pragmatics and Grammatical Structure (1997)
Systems and methods for dynamic reading fluency proficiency assessment
System and method for teaching writing using microanalysis of text
Systems and methods for generating analytic summaries
Full Dynamic Plural Logic
this paper was sent to the proceedings of the fourth Hungarian symposium of logic and language. T... more this paper was sent to the proceedings of the fourth Hungarian symposium of logic and language. The main change is that formulas are now consequently written as infix relations. This made some textual changes necessary. Also some typing and stylistic errors have been corrected. Full Dynamic Plural Logic Martin van den Berg Department of Computational Linguistics Faculty of Arts, University of Amsterdam vdberg@alf.let.uva.nl
System and method for teaching second language writing skills using the linguistic discourse model
Discourse Structure and Sentiment
2011 IEEE 11th International Conference on Data Mining Workshops, 2011
In this paper we discuss the application of the Linguistic Discourse Model (LDM) to sentiment ana... more In this paper we discuss the application of the Linguistic Discourse Model (LDM) to sentiment analysis at the discourse level. Based on the observations that naturally occurring discourse is interpretable though often not coherent, the LDM provides a unified and explanatory approach to discourse sentiment assignment. Special attention is paid here to the well known problem of computing sentiment in movie reviews which are characterized by shifting contexts of sentiment source and target.
A discourse perspective on verb phrase anaphora
Proceedings of the international conference on Formal Ontology in Information Systems - Volume 2001, 2001
Proceedings of the 2004 ACL Workshop on Discourse Annotation - DiscAnnotation '04, 2004
In this paper, we introduce LiveTree, a core component of LIDAS, the Linguistic Discourse Analysi... more In this paper, we introduce LiveTree, a core component of LIDAS, the Linguistic Discourse Analysis System for automatic discourse parsing with the Unified Linguistic Discourse Model (U-LDM) (X et al, 2004). LiveTree is an integrated workbench for supervised and unsupervised creation, storage and manipulation of the discourse structure of text documents under the U-LDM. The LiveTree environment provides tools for manual and automatic U-LDM segmentation and discourse parsing. Document management, grammar testing, manipulation of discourse structures and creation and editing of discourse relations are also supported.
Proceedings of the 2004 ACL Workshop on Discourse Annotation - DiscAnnotation '04, 2004
In this paper, we describe how the LIDAS System (Linguistic Discourse Analysis System), a discour... more In this paper, we describe how the LIDAS System (Linguistic Discourse Analysis System), a discourse parser built as an implementation of the Unified Linguistic Discourse Model (U-LDM) uses information from sentential syntax and semantics along with lexical semantic information to build the Open Right Discourse Parse Tree (DPT) that serves as a representation of the structure of the discourse (Polanyi et al., 2004; Thione 2004a,b). More specifically, we discuss how discourse segmentation, sentence-level discourse parsing, and text-level discourse parsing depend on the relationship between sentential syntax and discourse. Specific discourse rules that use syntactic information are used to identify possible attachment points and attachment relations for each Basic Discourse Unit to the DPT.
Discourse Grammar and Dynamic Logic
this paper, I will make a first stab at combining the discourse parsingmechanisms developed in (P... more this paper, I will make a first stab at combining the discourse parsingmechanisms developed in (Prust 1992, Prust et.al. 1994) and applied in (Polanyiand Scha 1983, Polanyi 1988) with the dynamic logic for plurals presented atthe Ninth Amsterdam Colloquium (van den Berg 1994) and in my dissertation(van den Berg 1996a).1.1 Anaphora
Journal of Logic, Language and Information - JOLLI, 2003
In this article we argue that discourse structure constrains the set ofpossible constituents in a... more In this article we argue that discourse structure constrains the set ofpossible constituents in a discourse that can provide the relevantcontext for structuring information in a target sentence, whileinformation structure critically constrains discourse structureambiguity. For the speaker, the discourse structure provides a set of possible contexts for continuation while information structure assignment is independent of discourse structure. For the hearer, the information structure of a sentence together with discourse structure instructs dynamic semantics how rhematic information should be used to update the meaning representation of the discourse (Polanyi and van den Berg, 1996).
Uploads
Papers by Martin van den Berg