Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
Table 5 Performance comparison between different feature strings and slot size thresholds We take all the papers extracted from PDF files as input to run the algorithm. Identical TP- URLs are first eliminated (therefore their candi- date anchor blocks are merged) by utilizing a hash table. This pre-process step results in about 1.46 million distinct TP-URLs. The number is larger than our collection size (0.9 million), be- cause some cited papers are not in our paper col- lection. We tested four kinds of feature strings all of which are generated from paper title: uni- grams, bigrams, trigrams, and 4-grams. Table-4 shows the slot size distribution corresponding to each kind of feature strings. The performance comparison among different feature strings and slot size thresholds is shown in Table 5. It seems that bigrams achieve a good trade-off between accuracy and performance.
Discover breakthrough research and expand your academic network
Join for free