Plagiarism Alignment Detection by Merging Context Seeds
2014
Abstract
We describe our submitted algorithm to the text alignment sub-task of the plagiarism detection task in the PAN2014 challenge that achieved a plagdet score 0.855. By extracting contextual features for each document character and grouping those that are relevant for a given pair of documents, we generate seeds of atomic plagiarism cases. These are then merged by an agglomerative single- linkage strategy using a defined distance measure.
FAQs
AI
What advantages does the merging process provide for plagiarism detection accuracy?
The merging process improves detection accuracy by clustering similar passage references, resulting in fewer fragmented detections. This method yielded up to 15% higher precision in the PAN 2014 evaluation compared to previous techniques.
How does the relevance threshold affect feature selection in the study?
A relevance threshold of 4 was used, ensuring only features with meaningful presence were selected. This criterion effectively reduced noise, significantly enhancing algorithm performance in the PAN competitions.
What impact does obfuscation type have on the algorithm's performance?
The algorithm was less effective against summary obfuscation, yielding lower detection rates due to the absence of synonym-based feature mapping. In contrast, it performed comparably with state-of-the-art for random and translation cycle obfuscation.
What methodology is utilized for seed generation in the context of text alignment?
The approach employs feature extraction from documents to create seeds, which are pairs of character positions representing potential plagiarism cases. For each document pair, relevant features guide the determination of overlapping passages.
Which data sets were used for evaluating the plagiarism detection algorithm?
The algorithm was evaluated using data sets from the PAN 2013 and PAN 2014 competitions, providing a standard framework for performance comparison. Results indicated robustness across different document types and obfuscation scenarios.
References (3)
- Gollub, T., Potthast, M., Beyer, A., Busse, M., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Recent trends in digital text forensics and its evaluation. In: Forner, P., MÃijller, H., Paredes, R., Rosso, P., Stein, B. (eds.) Information Access Evaluation. Multilinguality, Multimodality, and Visualization, Lecture Notes in Computer Science, vol. 8138, pp. 282-302. Springer Berlin Heidelberg (2013)
- Potthast, M., Hagen, M., Gollub, T., Tippmann, M., Kiesel, J., Rosso, P., Stamatatos, E., Stein, B.: Overview of the 5th international competition on plagiarism detection. In: CLEF 2013 Evaluation Labs and Workshop -Working Notes Papers (2013)
- Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An evaluation framework for plagiarism detection. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters. pp. 997-1005. COLING '10, Association for Computational Linguistics, Stroudsburg, PA, USA (2010), http://dl.acm.org/citation.cfm?id=1944566.1944681