Plagiarism Alignment Detection by Merging Context Seeds

Pashutan Modaresi

Outline

Plagiarism Alignment Detection by Merging Context Seeds

Pashutan Modaresi

2014

Abstract

We describe our submitted algorithm to the text alignment sub-task of the plagiarism detection task in the PAN2014 challenge that achieved a plagdet score 0.855. By extracting contextual features for each document character and grouping those that are relevant for a given pair of documents, we generate seeds of atomic plagiarism cases. These are then merged by an agglomerative single- linkage strategy using a defined distance measure.

FAQs

What advantages does the merging process provide for plagiarism detection accuracy?add

The merging process improves detection accuracy by clustering similar passage references, resulting in fewer fragmented detections. This method yielded up to 15% higher precision in the PAN 2014 evaluation compared to previous techniques.

How does the relevance threshold affect feature selection in the study?add

A relevance threshold of 4 was used, ensuring only features with meaningful presence were selected. This criterion effectively reduced noise, significantly enhancing algorithm performance in the PAN competitions.

What impact does obfuscation type have on the algorithm's performance?add

The algorithm was less effective against summary obfuscation, yielding lower detection rates due to the absence of synonym-based feature mapping. In contrast, it performed comparably with state-of-the-art for random and translation cycle obfuscation.

What methodology is utilized for seed generation in the context of text alignment?add

The approach employs feature extraction from documents to create seeds, which are pairs of character positions representing potential plagiarism cases. For each document pair, relevant features guide the determination of overlapping passages.

Which data sets were used for evaluating the plagiarism detection algorithm?add

The algorithm was evaluated using data sets from the PAN 2013 and PAN 2014 competitions, providing a standard framework for performance comparison. Results indicated robustness across different document types and obfuscation scenarios.

References (3)

Gollub, T., Potthast, M., Beyer, A., Busse, M., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Recent trends in digital text forensics and its evaluation. In: Forner, P., MÃijller, H., Paredes, R., Rosso, P., Stein, B. (eds.) Information Access Evaluation. Multilinguality, Multimodality, and Visualization, Lecture Notes in Computer Science, vol. 8138, pp. 282-302. Springer Berlin Heidelberg (2013)
Potthast, M., Hagen, M., Gollub, T., Tippmann, M., Kiesel, J., Rosso, P., Stamatatos, E., Stein, B.: Overview of the 5th international competition on plagiarism detection. In: CLEF 2013 Evaluation Labs and Workshop -Working Notes Papers (2013)
Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An evaluation framework for plagiarism detection. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters. pp. 997-1005. COLING '10, Association for Computational Linguistics, Stroudsburg, PA, USA (2010), http://dl.acm.org/citation.cfm?id=1944566.1944681

Plagiarism Alignment Detection by Merging Context Seeds

Sign up for access to the world's latest research

Abstract

FAQs

Related papers

References (3)

Related papers