Academia.edu no longer supports Internet Explorer.

To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.

Log In
Sign Up

Figure 5 – uploaded by István T. Nagy

See full PDF downloadDownload figure

Table 5. Performance comparison between different feature strings and slot size thresholds We take all the papers extracted from PDF files as input to run the algorithm. Identical TP- URLs are first eliminated (therefore their candi- date anchor blocks are merged) by utilizing a hash table. This pre-process step results in about 1.46 million distinct TP-URLs. The number is larger than our collection size (0.9 million), be- cause some cited papers are not in our paper col- lection. We tested four kinds of feature strings all of which are generated from paper title: uni- grams, bigrams, trigrams, and 4-grams. Table-4 shows the slot size distribution corresponding to each kind of feature strings. The performance comparison among different feature strings and slot size thresholds is shown in Table 5. It seems that bigrams achieve a good trade-off between accuracy and performance. — Table 5 Performance comparison between different feature strings and slot size thresholds We take all the papers extracted from PDF files as input to run the algorithm. Identical TP- URLs are first eliminated (therefore their candi- date anchor blocks are merged) by utilizing a hash table. This pre-process step results in about 1.46 million distinct TP-URLs. The number is larger than our collection size (0.9 million), be- cause some cited papers are not in our paper col- lection. We tested four kinds of feature strings all of which are generated from paper title: uni- grams, bigrams, trigrams, and 4-grams. Table-4 shows the slot size distribution corresponding to each kind of feature strings. The performance comparison among different feature strings and slot size thresholds is shown in Table 5. It seems that bigrams achieve a good trade-off between accuracy and performance.

Related Figures (51)

Table 1: The size of the textual corpus which con- tains affiliation information. freely available for non-commercial use*.

Table 4: Accuracies of subject detection methods. To find predicated relationships among the other types of entities (affiliation, position type, start year, end year) we used a very simple heuristic. As the af { we simply filiation slot is the head of the tuple assigned every other detected entity to the nearest affiliation and regarded the ear- lier preidcated year token as the start year.

Table 2: The results achieved by CRF. evaluation scheme, while Table 3 lists the results of a baseline method which labels each member of the university and position type gazetteers and identifies years using regular expressions. This comparison highlights the fact that labeling each occurrences of this easily recognisable classes cannot be applied. It gives an extremely low pre- cision thus contextual information has to be lever- aged.

Table 3: NER baseline results. 4.5 The assignment of roles

Figure 2. The main process of extracting (a) anchor text in general web search and (b) pseudo-anchor text in academic search Before describing our approach in detail, we first recall how anchor text is processed in general Web search. Assume that there have been a col- lection of documents being crawled and stored on local disk. In the first step, each web page is parsed and the out links (or forward links) within the page are extracted. Each link is comprised of a URL and its corresponding anchor text. In the second step, all links are accumulated according to their destination URLs (i.e. the anchor texts of all links pointed to the same URL are merged). Thus, we can get all anchor text corresponding to each web page. Figure-2 (a) demonstrates this process.

for each term in the anchor blocks, a discrete de- gree of being anchor text. The main reasons for taking such an approach is twofold: First, we believe that assigning each term a fuzzy degree of being anchor text is more appropriate than a binary judgment as either an anchor-term or non- anchor-term. Second, since the importance of a term for a “link” may be determined by many factors in paper search, a machine-learning could be more flexible and general than the approaches that compute term degrees by a specially de- signed formula.

Figure 6 shows the performance comparison be- tween the results of two baseline paper ranking algorithms and the results of including pseudo- anchor text in ranking.

Table 2. Statistical significance tests (t-test over nDCG@3) From Figure 6, we can see that the overall per- formance is greatly improved by including pseu- do-anchor information. Table 2 shows the t-test results, where a “>” indicates that the algorithm in the row outperforms that in the column with a p-value of 0.05 or less, and a “>>” means a p- value of 0.01 or less.

Table 4. Slot distribution with different feature strings ture strings. When feature strings are fixed, the slot size threshold can be used to tune the tra- deoff between accuracy and performance.

Table 1: Teufel’s (1999) Argumentative Zones 3 Maximum Entropy models 2 Argumentative Zoning Teufel (1999) introduced a new rhetorical analy- sis for scientific texts called Argumentative Zon- ing. Each sentence of an article from the scien- tific literature is classified into one of seven basic rhetorical structures shown in Table 1. Maximum entropy (ME) or log-linear models are statistical models that can incorporate evidence from a diverse range of complex and potentially overlapping features. Unlike Naive Bayes (NB), the features can be conditionally dependent given the class, which is important since feature sets in NLP rarely satisfy this independence constraint.

large difference to classification accuracy.

Table 3: Teufel and Moens (2002)’s and our NB performance on CMP-LG Table 4: History features on the CMP-LG corpus with ME model of unigram/bigram features only

Table 5: Subtractive analysis CMP-LG ME model

Table 8: Subtractive analysis ASTRO ME model Table 7: Final CMP-LG ME performance

Table 9: Final ASTRO ME model performance Figure 1: Examples of sentences with the given tags in the astronomical corpus

Table 10: Comparing CMP-LG and ASTRO directly on the basic annotation scheme Table 10 compares the performance of our Naive Bayes and Maximum Entropy classifiers on the two corpora for just the basic annotation scheme: Background, Own and Other. The fea- tures used are the set of Teufel features we have implemented (so it does not include unigram or bigram features).

The goal of our study is to classify English re- search papers (Language L1=English, Genre Gl=research papers) into a patent classification using a patent data set written in Japanese (Lan- guage L2=Japanese, Genre G2=patents). Figure 2 shows the system configuration. Our system is comprised of a "Japanese index creating module" and a "document classification module". In the following, we explain both modules. When a title and abstract pair, as shown in Figure 3, is given, the module creates a Japanese index, shown in Figure 4’, using translation models for research papers and for patents.

3.2 System Configuration

We proposed several methods that automatically classify research papers into the IPC system us- ing two translation models. To confirm the effec- tiveness of our method, we conducted some ex- aminations using the data of the NTCIR-7 Patent Mining Task. The results showed that one of our methods "SMT(Paper)+Index(Patent)" obtained a MAP score of 0.2897. This score was higher than that of "SMT(Paper)", which used transla- tion results by the translation model for research papers, and this indicates that our method is ef- fective for cross-genre, cross-lingual document classification. Table 3: Recall for top n results (SMT(Paper)+Index(Patent))

NTCIR-7, Proceedings of the 7 NTCIR Workshop Meeting: 351-353. 5 Conclusion

Table 2. Causes of silence: 1.Incorrect analysis by the parser; 2.Inadequacy of the framework for the task; 3. Not SUMMARY or PROBLEM sen- tence according to our definition

Figure 1: Principal information needs and tasks of participants with regard to citations. In the first table, information needs are prefixed by ‘md’ for meta-data and ‘co’ forcontent-oriented. ‘Freq’ in- dicates the number of occurrences in the results.

Figure 5: A sample pop-up with an automatically generated summary, triggered by a mouse action ove the citation. Extracted sentences are grouped together by section titles. Words that match with the citation context are coloured and emboldened.

Figure 1: CGI interface used for matching new references to existing papers Author: Och, Franz Josef ymaster’s Note: The whole dataset is available Here. Please download the dataset instead of crawling the website

Figure 3: Snapshot of the different statistics for a paper

Table 2: Network Statistics of the cita- tion and collaboration network. The re- maining authors (11,180-10,409) are not cited and are therefore removed from the network analysis

Table 3: Degree Statistics of the citation and collaboration networks

Table 7: Authors with the highest h- ndex Table 6: Authors with most incoming citations (the values in parentheses are us- ing non-self- citations)

Table 8: Authors with the least average shortest path (ASP) length in the author collaboration network

of the author. We have been able to annotate 8,578 authors this way: 6,396 male and 2,182 female. The citation text that we have extracted for each paper is a good resource to generate summaries of the contributions of that paper. We have previously developed systems using clustering the similarity networks to generate short, and yet informative, summaries of in- dividual papers (Qazvinian and Radev 2008), and more general scientific topics, such as Dependency Parsing, and Machine Transla- tion (Radev et al. 2009) .

Figure 2: Digital library interface with faceted navigation, continued, from http://berkeley.worldcat.org .

Figure 1: Worldcat consortium digital library interface using faceted navigation. The instance shown is the University of California version, from http://berkeley.worldcat.org .

Figure 3: University of Chicago digital library interface using faceted navigation, using an interface from A quaBrowser.

Figure 1: A web page with a list of references. Paper titles are displayed in bold. {hongchin, jprabawa, kanmy}@comp.nus.edu.sg

Tests were conducted using these 40 pages to obtain the reference string recognition algorithm’s accuracy. A reference string is considered found if there exists, in the set of confirmed reference strings C’, a parsed text segment c that contains the entire title as well as all the authors’ names. Each parsed text segment can only be used to identify one reference string, so if any text segments con- tain more than one reference string, only one of those reference strings will be considered found. those reference strings will be considered found. In order to determine the effect of each stage on overall recognition accuracy, some stages of the recognition algorithm were disabled in testing. The results are presented in Table 2. As all test pages come from university domains, all pass the first URL test. When the keyword search is deac- tivated, all 40 test pages pass Stage 1. Otherwise, 19 pages with reference strings and 6 pages with- out reference strings pass Stage 1. rr. _ as: tse a: aL Ha 4° yy. _- eo 4° "74 4

Table 1: List of classifier features information about the token; 3) Contextual fea- tures, which are lexical or local features of a to- ken’s neighbours. Table 1 gives an exhaustive list of features used in FireCite.

Figure 5: Screenshot of FireCite prototype illus- trating (a) the reference string library, (b) button appended to each reference string, and (c) button state after the reference string has been added to the list.

Table 4: Performance evaluation of the system.

Table 1: Corpus composition To our knowledge, this is the first corpus con- structed in the context of paper summarization re- ated ta anllaatianne Af eittne Ransre 4 To our knowledge, this is the first corpus con- We then linked each c-site to its anchor, each an- chor to its reference, and any background informa- tion to the c-site supplemented. We also decided on annotating entire sentences, even if only part of a sentence referred to the cited paper. Table 1 outlines our corpus. Analysis of the corpus provided some interest- ing insights, though a larger corpus is required to confirm the frequency and validity of such phe- nomena. The more salient discoveries are item- ized below. These phenomena may also co-occur. Analysis of the corpus provided some interest-

Table 2: Evaluation results for coreference resolution against the MUC-7 formal corpus. salient for increased performance. We also ex- tended this list by adding a cosine-similarity met- ric between two noun phrases; it uses bag-of- words to create a vector for each noun phrase (where each word is a term in the vector) to com- pute their similarity. The intuition behind this is that noun phrases with more similar surface forms should be more likely to corefer. resolution with coreference-chains. This is be- cause coreference-chains match noun phrases that appear with other noun phrases to which they re- fer, a characteristic present in these three cate- gories. On the other hand, cue-phrases do not detect any c-site sentence that does not use key- words (e.g. “In addition’). In the following sec- tion we discuss our implementation of a corefer- ence chain-based extraction technique, and how we then applied it to the c-site extraction task. An analysis of the results then follows. words to create a vector for each noun phrase

Table 3: Features used for coreference resolution.

Table 4: Evaluation results for c-site extraction w/o background information

Related topics:

Connect with 287M+ leading minds in your field

Discover breakthrough research and expand your academic network

Explore
Papers
Topics

Features
Mentions
Analytics
PDF Packages
Advanced Search
Search Alerts

Journals
Academia.edu Journals
My submissions
Reviewer Hub
Why publish with us
Testimonials

Company
About
Careers
Press
Help Center
Terms
Privacy
Copyright
Content Policy

580 California St., Suite 400

San Francisco, CA, 94104

© 2025 Academia. All rights reserved