Academia.eduAcademia.edu

Outline

Shared information and program plagiarism detection

2004, IEEE Transactions on Information Theory

https://doi.org/10.1109/TIT.2004.830793

Abstract

A fundamental question in information theory and in computer science is how to measure similarity or the amount of shared information between two sequences. We have proposed a metric, based on Kolmogorov complexity to answer this question, and have proven it to be universal. We apply this metric in measuring the amount of shared information between two computer programs, to enable plagiarism detection. We have designed and implemented a practical system SID (Software Integrity Diagnosis system) that approximates this metric by a heuristic compression algorithm. Experimental results demonstrate that SID has clear advantages over other plagiarism detection systems. SID system server is online at

References (24)

  1. A. Aiken. Measure of software similarity. URL http://www.cs.berkeley.edu/∼aiken/moss.html.
  2. D. Benedetto, E. Caglioti, and V. Loreto, Language trees and zipping. Physical Review Letters, 88:4(2002).
  3. C.H. Bennett, P. Gács, M. Li, P. Vitányi, and W. Zurek, Information Distance. IEEE Transactions on Information Theory, 44:4(July 1998), 1407-1423.
  4. C. Bennett, M. Li and B. Ma. Chain letters and evolutionary histories. Scientific American, June 2003, 71-76.
  5. X. Chen, S. Kwong and M. Li. A compression algorithm for DNA sequences and its applications in genome comparison. In Proc. of the 10th Workshop on Genome Informatics, pp. 52-61, 1999.
  6. R. Cilibrasi, R. de Wolf and P. Vitányi. Algorithmic clustering of music, http://arxiv.org/archive/cs/0303025, 2003.
  7. D. Gitchell and N. Tran. A utility for detecting similarity in computer programs. Proceedings of 30th SCGCSE Technical Symposium, New Orleans, USA. pp. 266-270, 1998.
  8. M. Li, J. Badger, X. Chen, S. Kwong, P. Kearney and H. Zhang. An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics. 17:2(2001), 149-154.
  9. M. Li and P. Vitányi. An introduction to Kolmogorov complexity and its applications. 2nd Ed., Springer, New York, 1997.
  10. M. Li, X. Chen, X. Li, B. Ma, P. Vitanyi. The similarity metric. In Proc. of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 863-872, 2003.
  11. M. Li and P. Vitanyi. Reversibility and adiabatic computation: trading time and space for energy. Proc. Royal Society of London, Series A, 452(1996), 769-789.
  12. T. Luczak and W. Szpankowski. A suboptimal lossy data compression based on approximate pattern matching, IEEE Transactions on Information Theory 43:5(1997), 1439-1451.
  13. G. Malpohl. JPlag: detecting software plagiarism. URL http://www.ipd.uka.de:2222/index.html.
  14. K. Ottenstein. An algorithmic approach to the detection and prevention of plagiarism. SIGCSE Bulletin 8:4(1977), 30-41.
  15. A. Parker and J. Hamblen. Computer algorithms for plagiarism detection. IEEE Transactions on Educa- tion. 32:2(1989), 94-99.
  16. S. Schleimer, D.S. Wilkerson, and A. Aiken. Winnowing: local algorithms for document figureprinting. ACM SIGMOD Conference, June 9-12, 2003, San Diego.
  17. C.E. Shannon, A mathematical theory of communication. Bell System Technical Journal, 27(July and October, 1948) 379-423 and 623-656.
  18. W. Weaver and C.E. Shannon, The Mathematical Theory of Communication. University of Illinois Press, 1949.
  19. G. Whale. Plague: plagiarism detection using program structure. Dept. of Computer Science Technical Report 8805, University of NSW, Kensington, Australia, 1988.
  20. G. Whale. Identification of program similarity in large populations. The Computer Journal, 33:2(1990).
  21. M. Wise. Running Karp-Rabin matching and greedy string tiling. Department of Computer Science tech- nical report, Sydney University, 1994.
  22. M. Wise. YAP3: improved detection of similarities in computer program and other texts. Proceedings of 27th SCGCSE Technical Symposium, Philadelphia, USA. 130-134, 1996.
  23. E.-H. Yang and J.C. Kieffer. On the performance of data compression algorithms based upon string match- ing, IEEE Transactions on Information Theory 44(1998), 47-65.
  24. J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 23(1977), 337-343.