Analyzing Semantic Concept Patterns to Detect Academic Plagiarism
2017, Proceedings of the International Workshop on Mining Scientific Publications (WOSP) at the ACM/IEEE Joint Conference on Digital Libraries (JCDL)
https://doi.org/10.1145/3127526.3127535Abstract
Detecting academic plagiarism is a pressing problem, e.g., for educational and research institutions, funding agencies, and academic publishers. Existing plagiarism detection systems reliably identify copied text, or near copies of text, but often fail to detect disguised forms of academic plagiarism, such as paraphrases, translations, and idea plagiarism. We present Semantic Concept Pattern Analysis - an approach that performs an integrated analysis of semantic text relatedness and structural text similarity. Using 25 officially retracted academic plagiarism cases, we demonstrate that our approach can detect plagiarism that established text matching approaches would not identify. We view our approach as a promising addition to improve the detection capabilities for strong paraphrases. We plan to further improve Semantic Concept Pattern Analysis and include the approach as part of an integrated detection process that analyzes heterogeneous similarity features to better identify the many possible forms of plagiarism in academic documents.
References (38)
- Salha Alzahrani, Vasile Palade, Naomie Salim, and Ajith Abraham. 2011. Using Structural Information and Citation Evidence to Detect Signicant Plagiarism Cases in Scientic Publications. JASIST 63(2) (2011), 286-312.
- Maik Anderka and Benno Stein. 2009. e ESA Retrieval Model Revisited. In Proc. SIGIR. 670-671.
- Alexander Budanitsky and Graeme Hirst. 2006. Evaluating WordNet-based Measures of Lexical Semantic Relatedness. Comput. Linguist. 32, 1 (2006), 13-47.
- Curt Burgess. 1998. From simple associations to the building blocks of language: Modeling meaning in memory with the HAL model. Behavior Research Methods, Instruments, & Computers 30, 2 (1998), 188-198.
- Zdenek Ceska. 2008. Plagiarism Detection Based on Singular Value Decomposi- tion. In Advances in NLP. LNCS, Vol. 5221. 108-119.
- Paul Clough and Mark Sanderson. 2013. Evaluating the Performance of Informa- tion Retrieval Systems using Test Collections. Informat. Res. 18, 2 (2013).
- Paul Clough and Mark Stevenson. 2009. Creating a Corpus of Plagiarised Aca- demic Texts. In Proc. Corpus Linguistics Conf.
- Sco Deerwester, Susan T Dumais, George W Furnas, omas K Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. JASIST 41, 6 (1990), 391.
- Ofer Egozi, Shaul Markovitch, and Evgeniy Gabrilovich. 2011. Concept-based Information Retrieval using Explicit Semantic Analysis. TOIS 29, 2 (2011), 8.
- Teddi Fishman. 2009. "We know it when we see it"? is not good enough: toward a standard denition of plagiarism that transcends the, fraud, and copyright. In Proc. Asia Pacic Conf. on Educational Integrity.
- Evgeniy Gabrilovich and Shaul Markovitch. 2005. Feature Generation for Text Categorization using World Knowledge. In Proc. IJCAI, Vol. 5. 1048-1053.
- Evgeniy Gabrilovich and Shaul Markovitch. 2006. Overcoming the brileness boleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. In AAAI, Vol. 6. 1301-1306.
- Evgeniy Gabrilovich and Shaul Markovitch. 2007. Computing Semantic Re- latedness using Wikipedia-based Explicit Semantic Analysis.. In IJcAI, Vol. 7. 1606-1611.
- Evgeniy Gabrilovich and Shaul Markovitch. 2009. Wikipedia-based Semantic Interpretation for Natural Language Processing. J. Artif. Intell. Res. 34 (2009), 443-498.
- Bela Gipp. 2014. Citation-based Plagiarism Detection -Detecting Disguised and Cross-language Plagiarism using Citation Paern Analysis. Springer.
- Bela Gipp and Norman Meuschke. 2011. Citation Paern Matching Algorithms for Citation-based Plagiarism Detection: Greedy Citation Tiling, Citation Chunk- ing and Longest Common Citation Sequence. In Proc. DocEng. 249-258.
- Bela Gipp, Norman Meuschke, and Joeran Beel. 2011. Comparative Evaluation of Text-and Citation-based Plagiarism Detection Approaches using GuenPlag. In Proc. JCDL. 255-258.
- Bela Gipp, Norman Meuschke, and Corinna Breitinger. 2014. Citation-based Plagiarism Detection: Practicability on a Large-scale Scientic Corpus. JASIST 65, 2 (2014), 1527-1540.
- Bela Gipp, Norman Meuschke, Corinna Breitinger, Mario Lipinski, and Andreas Nuernberger. 2013. Demonstration of Citation Paern Analysis for Plagiarism Detection. In Proc. SIGIR.
- Wael H Gomaa and Aly A Fahmy. 2013. A Survey of Text Similarity Approaches. Int. J. of Computer Applications 68, 13 (2013).
- omas Goron, Maik Anderka, and Benno Stein. 2011. Insights into Explicit Semantic Analysis. In Proc. CIKM. 1961-1964.
- Jorge Gracia and Eduardo Mena. 2008. Web-based Measure of Semantic Related- ness. In Proc. Int. Conf. on Web Informat. Sys. Eng. 136-150.
- Gali Halevi and Judit Bar-Ilan. 2016. Post Retraction Citations in Context. In Proc. BIRNDL Workshop at JCDL. 23-29.
- Michael D Lee, DAniel J Navarro, and Hannah Nikkerud. 2005. An Empirical Evaluation of Models of Text Document Similarity. In Proc. of the Cognitive Science Society, Vol. 27.
- Norman Meuschke and Bela Gipp. 2013. State of the Art in Detecting Academic Plagiarism. 9, 1 (2013), 50-71.
- Norman Meuschke and Bela Gipp. 2014. Reducing Computational Eort for Plagiarism Detection by using Citation Characteristics to Limit Retrieval Space. In Proc. JCDL. 197-200.
- Rada Mihalcea, Courtney Corley, Carlo Strapparava, and others. 2006. Corpus- based and Knowledge-based Measures of Text Semantic Similarity. In AAAI, Vol. 6. 775-780.
- Jane Morris and Graeme Hirst. 2004. Non-classical Lexical Semantic Relations. In Proc. of the HLT-NAACL Ws. on Computational Lexical Semantics. Assoc. for Computat. Linguist., 46-51.
- Ahmed Hamza Osman, Naomie Salim, Mohammed Salem Binwahlan, Ssennoga Twaha, Yogan Jaya Kumar, and Albaraa Abuobieda. 2012. Plagiarism detection scheme based on Semantic Role Labeling. In Proc. Int. Conf. on Information Retrieval Knowledge Management. 30-33.
- Merin Paul and Sangeetha Jamal. 2015. An improved SRL based plagiarism detection technique using sentence ranking. Proc. Comput. Sc. 46 (2015), 223- 230.
- Maria Soledad Pera and Yiu-Kai Ng. 2011. SimPaD: a Word-Similarity Sentence- Based Plagiarism Detection Tool on Web Documents. Web Intelligence and Agent Sys. 9, 1 (2011), 24-41.
- Martin Pohast, Benno Stein, and Maik Anderka. 2008. A Wikipedia-Based Multilingual Retrieval Model. In Proc. ECIR, Craig Macdonald, Iadh Ounis, Vassilis Plachouras, Ian Ruthven, and Ryen W. White (Eds.). 522-530.
- Martin Pohast, Benno Stein, Alberto Barrón Cede ño, and Paolo Rosso. 2010. An Evaluation Framework for Plagiarism Detection. In Proc. ACL. 997-1005.
- Salha Alzahrani and Naomie Salim. 2010. Fuzzy Semantic-Based String Similarity for Extrinsic Plagiarism Detection. In CLEF 2010 Notebook Papers.
- David Sánchez, Montserrat Batet, David Isern, and Aida Valls. 2012. Ontology- based Semantic Similarity: A new Feature-based Approach. Expert Systems with Applications 39, 9 (2012), 7718-7728.
- Benno Stein, Sven Meyer zu Eissen, and Martin Pohast. 2007. Strategies for Retrieving Plagiarized Documents. In Proc. SIGIR. 825-826.
- K. Vani and Deepa Gupta. 2016. Study on Extrinsic Text Plagiarism Detection Techniques and Tools. J. Engin. Sc. & Techn. Review 9, 5 (2016).
- Debora Weber-Wul. 2014. False Feathers: A Perspective on Academic Plagiarism. Springer.