Academia.eduAcademia.edu

Outline

PDLK: Plagiarism detection using linguistic knowledge

2015, Expert Systems with Applications

https://doi.org/10.1016/J.ESWA.2015.07.048

Abstract

Plagiarism is described as the reuse of someone else's previous ideas, work or even words without sufficient attribution to the source. This paper presents a method to detect external plagiarism using the integration of semantic relations between words and their syntactic composition. The problem with the available methods is that they fail to capture the meaning in comparison between a source document sentence and a suspicious document sentence, when two sentences have same surface text (the words are the same) or they are a paraphrase of each other. Therefore it causes inaccurate or unnecessary matching results. However, this method can improve the performance of plagiarism detection because it is able to avoid selecting the source text sentence whose similarity with suspicious text sentence is high but its meaning is different. It is executed by computing the semantic and syntactic similarity of the sentence-to-sentence. Besides, the proposed method expands the words in sentences to tackle the problem of information limit. It bridges the lexical gaps for semantically similar contexts that are expressed in a different wording. This method is also capable to identify various kinds of plagiarism such as the exact copied text, paraphrasing, transformation of sentences and changing of word structure in the sentences. As a result, the experimental results have displayed that the proposed method is able to improve the performance compared with the participating systems in PAN-PC-11. The experimental results also displayed that the proposed method demonstrates better performance as compared to other existing techniques on PAN-PC-10 and PAN-PC-11 datasets.

References (55)

  1. Abdi, A., Idris, N., Alguliev, R. M., & Aliguliyev, R. M. (2015). Automatic summarization assessment through a combination of semantic and syntactic information for in- telligent educational systems. Information Processing & Management, 51, 340-358.
  2. AdelsonVelskii, M., & Landis, E.M. (1963). An algorithm for the organization of infor- mation. DTIC document.
  3. Alguliev, R. M., Aliguliyev, R. M., & Mehdiyev, C. A. (2011). Sentence selection for generic document summarization using an adaptive differential evolution algo- rithm. Swarm and Evolutionary Computation, 1, 213-222.
  4. Aytar, Y., Shah, M., & Luo, J. (2008). Utilizing semantic word similarity measures for video retrieval. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2008. CVPR 2008. (pp. 1-8). IEEE.
  5. Baeza-Yates, R.A. (1992). Introduction to data structures and algorithms related to in- formation retrieval.
  6. Bao, J.-P., Shen, J.-Y., Liu, X.-D., & Song, Q.-B. (2003). A survey on natural language text copy detection. Journal of Software, 14, 1753-1760.
  7. Barrón-Cedeño, A., Gupta, P., & Rosso, P. (2013). Methods for cross-language plagiarism detection. Knowledge-Based Systems, 50, 211-217.
  8. Campos, R. A. C., & Martinez, F. J. Z. (2012). Batch source-code plagiarism detection using an algorithm for the bounded longest common subsequence problem. In Proceedings of the 9th international conference on electrical engineering, computing science and automatic control (CCE), 2012 (pp. 1-4). IEEE.
  9. Ceska, Z. (2008). Plagiarism detection based on singular value decomposition. In Ad- vances in natural language processing (pp. 108-119). Springer.
  10. Ceska, Z., Toman, M., & Jezek, K. (2008). Multilingual plagiarism detection. In Artificial intelligence: Methodology, systems, and applications (pp. 83-92). Springer.
  11. Cooke, N., Gillam, L., Wrobel, P., Cooke, H., & Al-Obaidli, F. (2011). A High-performance plagiarism detection system-notebook for PAN at CLEF 2011. In Proceedings of the CLEF (notebook papers/labs/workshop).
  12. Ekbal, A., Saha, S., & Choudhary, G. (2012). Plagiarism detection in text using vec- tor space model. In Proceedings of the conference on hybrid intelligent systems, HIS (pp. 366-371).
  13. El-Alfy, E.-S. M., Abdel-Aal, R. E., Al-Khatib, W. G., & Alvi, F. (2015). Boosting paraphrase detection through textual similarity metrics with abductive networks. Applied Soft Computing, 26, 444-453.
  14. Foster, C. C. (1965). Information retrieval: Information storage and retrieval using AVL trees. In Proceedings of the 20th national ACM conference. (pp. 192-205). ACM.
  15. Franzke, M., & Streeter, L.A. (2006). Building student summarization, writing and read- ing comprehension skills with guided practice and automated feedback. Highlights from research at the University of Colorado, a white paper from Pearson Knowl- edge Technologies.
  16. Geravand, S., & Ahmadi, M. (2014). An efficient and scalable plagiarism checking system using Bloom filters. Computers & Electrical Engineering, 40, 1789-1800.
  17. Grman, J., & Ravas, R. (2011). Improved implementation for finding text similarities in large collections of data: Notebook for PAN at CLEF 2011. In Notebook Papers of CLEF.
  18. Grozea, C., Gehl, C., & Popescu, M. (2009). ENCOPLOT: Pairwise sequence matching in linear time applied to plagiarism detection. In Proceedings of the 3rd PAN workshop on uncovering plagiarism, authorship and social software misuse (p. 10).
  19. Hussein, A. S. (2015). Arabic document similarity analysis using n-grams and singu- lar value decomposition. In Proceedings of the IEEE 9th International Conference on Research Challenges in Information Science (RCIS), 2015 (pp. 445-455). IEEE.
  20. Iliopoulos, C. S., & Rahman, M. S. (2009). A new efficient algorithm for computing the longest common subsequence. Theory of Computing Systems, 45, 355-371.
  21. Irving, R. W., & Love, L. (2003). The suffix binary search tree and suffix AVL tree. Journal of Discrete Algorithms, 1, 387-408.
  22. Kasprzak, J., Brandejs, M., & Kripac, M. (2009). Finding plagiarism by evaluating docu- ment similarities. In Proceedings of the conference on SEPLN: Vol. 9 (pp. 24-28).
  23. Kong, L., Qi, H., Wang, S., Du, C., Wang, S., & Han, Y. (2012). Approaches for candidate document retrieval and detailed comparison of plagiarism detection-notebook for PAN at CLEF 2012. In Proceedings of the conference on CLEF 2012 evaluation labs and workshop-working notes papers, (pp. 17-20).
  24. Li, Y., McLean, D., Bandar, Z. A., O'shea, J. D., & Crockett, K. (2006). Sentence similarity based on semantic nets and corpus statistics. Knowledge and Data Engineering, IEEE Transactions on, 18, 1138-1150.
  25. Lin, D. (1998). An information-theoretic definition of similarity. In Proceedings of the conference on ICML: Vol. 98 (pp. 296-304).
  26. Mahdavi, P., Siadati, Z., & Yaghmaee, F. (2014). Automatic external Persian plagiarism detection using vector space model. In Proceedings of the 4th international confer- ence on computer and knowledge engineering (ICCKE), 2014 (pp. 697-702). IEEE.
  27. Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval: Vol. 1. Cambridge: Cambridge university press.
  28. Mihalcea, R., Corley, C., & Strapparava, C. (2006). Corpus-based and knowledge-based measures of text semantic similarity. In Proceedings of the conference on AAAI: Vol. 6 (pp. 775-780).
  29. Miller, G. A., & Charles, W. G. (1991). Contextual correlates of semantic similarity. Lan- guage and Cognitive Processes, 6, 1-28.
  30. Nawab, R., Stevenson, M., & Clough, P. (2010). University of sheffield: Lab report for PAN at CLEF 2010. In Proceedings of the conference on CLEF 2010 LABs and workshops, notebook papers. CLEF.
  31. Nawab, R. M. A., Stevenson, M., & Clough, P. (2011). External plagiarism for PAN at CLEF 2011. In Proceedings of the 5th international workshop on uncovering plagiarism, au- thorship, and social software misuse. Sheffield.
  32. Oberreuter, G., Ríos, S.A., & Velásquez, J.D. (2010). FASTDOCODE: Finding approximated segments of n-grams for document copy detection lab report for PAN at CLEF 2010. Detection using information retrieval and sequence alignment-notebook.
  33. Oberreuter, G., & VeláSquez, J. D. (2013). Text mining applied to plagiarism detection: The use of words for detecting deviations in the writing style. Expert Systems with Applications, 40, 3756-3763.
  34. Oktoveri, A., Wibowo, A. T., & Barmawi, A. M. (2014). Non-relevant document reduction in anti-plagiarism using asymmetric similarity and AVL tree index. In Proceedings of the 5th International Conference on Intelligent and Advanced Systems (ICIAS), 2014 (pp. 1-5). IEEE.
  35. Osman, A. H., Salim, N., Binwahlan, M. S., Alteeb, R., & Abuobieda, A. (2012). An im- proved plagiarism detection scheme based on semantic role labeling. Applied Soft Computing, 12, 1493-1502.
  36. Paul, M., & Jamal, S. (2015). An improved SRL based plagiarism detection technique using sentence ranking. Procedia Computer Science, 46, 223-230.
  37. Potthast, M., Barrón-Cedeño, A., Eiselt, A., Stein, B., & Rosso, P. (2010). Overview of the 2nd international competition on plagiarism detection. In Proceedings of the con- ference on CLEF (notebook papers/labs/workshops).
  38. Potthast, M., Gollub, T., Hagen, M., Kiesel, J., Michel, M., Oberländer, A., Tippmann, M., Barrón-Cedeno, A., Gupta, P., & Rosso, P. (2012). Overview of the 4th international competition on plagiarism detection. In Proceedings of the conference on CLEF (on- line working notes/labs/workshop).
  39. Raghavan, V. V., & Wong, S. M. (1986). A critical analysis of vector space model for information retrieval. Journal of the American Society for information Science, 37, 279-287.
  40. Rao, S., Gupta, P., Singhal, K., & Majumder, P. (2011). External & intrinsic plagiarism detection: VSM & discourse markers based approach-notebook for PAN at CLEF 2011. In Proceedings of the conference on CLEF (notebook papers/labs/workshop).
  41. Raphael, A. (2002). Signature extraction for overlap detection in documents. In Pro- ceedings of the twenty-fifth Australasian conference on computer science: Vol. 4.
  42. Rodríguez-Torrejón, D., & Martín-Ramos, J. (2010). CoReMo system (contextual refer- ence monotony) a fast, low cost and high performance plagiarism analyzer system: Lab report for PAN at CLEF 2010. In Notebook Papers of CLEF.
  43. Sánchez-Vega, F., Villatoro-Tello, E., Montes-y-Gomez, M., Villaseñor-Pineda, L., & Rosso, P. (2013). Determining and characterizing the reused text for plagiarism de- tection. Expert Systems with Applications, 40, 1804-1813.
  44. Sarkar, A., Marjit, U., & Biswas, U. (2014). A conceptual model to develop an advanced plagiarism checking tool based on semantic matching. In Proceedings of the 2nd International Conference on Business and Information Management (ICBIM), 2014 (pp. 104-108). IEEE.
  45. Schleimer, S., Wilkerson, D. S., & Aiken, A. (2003). Winnowing: Local algorithms for document fingerprinting. In Proceedings of the 2003 ACM SIGMOD international con- ference on Management of data (pp. 76-85). ACM.
  46. Soleman, S., & Purwarianti, A. (2014). Experiments on the Indonesian plagiarism de- tection using latent semantic analysis. In Proceedings of the 2nd international con- ference on information and communication technology (ICoICT), 2014 (pp. 413-418). IEEE.
  47. Stamatatos, E. (2011). Plagiarism detection using stopword n-grams. Journal of the American Society for Information Science and Technology, 62, 2512-2527.
  48. Stein, B., & Zu Eissen, S. M. (2006). Near similarity search and plagiarism analysis. From data and information analysis to knowledge engineering (pp. 430-437). Springer.
  49. Suchomel, Š., Kasprzak, J., & Brandejs, M. Three way search engine queries with multi- feature document comparison for plagiarism detection-notebook for PAN at CLEF 2012. Forner et al (Eds.) ISBN, 978-988.
  50. Tian, Y., Li, H., Cai, Q., & Zhao, S. (2010). Measuring the similarity of short texts by word similarity and tree kernels. In Proceedings of the IEEE youth conference on informa- tion computing and telecommunications (YC-ICT), 2010 (pp. 363-366). IEEE.
  51. Toman, M., Tesar, R., & Jezek, K. (2006). Influence of word normalization on text classi- fication. Proceedings of InSciT, 4, 354-358.
  52. Tomasic, A., & Garcia-Molina, H. (1993). Query processing and inverted indices in shared: Nothing text document information retrieval systems. The VLDB Journal- The International Journal on Very Large Data Bases, 2, 243-276.
  53. van Rijsbergen, C. J. (1986). (invited paper) A new theoretical framework for informa- tion retrieval. In Proceedings of the 9th annual international ACM SIGIR conference on research and development in information retrieval (pp. 194-200). ACM.
  54. Wang, S., Qi, H., Kong, L., & Nu, C. (2013). Combination of VSM and Jaccard coefficient for external plagiarism detection. In Proceedings of the international conference on machine learning and cybernetics (ICMLC), 2013 : Vol. 4 (pp. 1880-1885). IEEE.
  55. Warin, M. (2004). Using wordnet and semantic similarity to disambiguate an ontology. Retrieved January, 25, 2008.