Evaluation of Statistical Complexity in Viral Genome Sequences
2019, RECPAD 2019
Abstract
In algorithmic information theory, the Kolmogorov complexity of an object is the length of the shortest computer program that produces the object as output. It is a measure of the computational resources needed to specify the object. However, Kolmogorov complexity is non-computable as such, it can only be approximately attainable. One of the most notable approximations are data compressors, since the bitstream produced by a lossless data compression algorithm allows the reconstruction of the original data with the appropriate decoder, and therefore can be seen as an upper bound of the algorithmic complexity of the sequence. In this paper, we evaluate the usage of the Normalized Compression (NC) as the compression measure for analysing various Virus DNA sequences and evaluate how it changes when substitutions and permutations are performed on the DNA sequences. Finally, we draw conclusions regarding the nature of these sequences.
References (10)
- Charles H. Bennett. Logical depth and physical complexity. The Universal Turing Machine A Half-Century Survey, pages 207-235, 1995.
- Peter Bloem, Francisco Mota, Steven de Rooij, Luís Antunes, and Pieter Adriaans. A safe approximation for kolmogorov complexity. In International Conference on Algorithmic Learning Theory, pages 336-350. Springer, Springer International Publishing, 2014.
- Daniel Hammer, Andrei Romashchenko, Alexander Shen, and Nikolai Vereshchagin. Inequalities for Shannon Entropy and Kol- mogorov Complexity. Journal of Computer and System Sciences, 60 (2):442-464, 2000. ISSN 0022-0000. doi: https://doi.org/10.1006/ jcss.1999.1677. URL http://www.sciencedirect.com/ science/article/pii/S002200009991677X.
- Andrei N. Kolmogorov. Three approaches to the quantitative defi- nition of information. Problems of Information Transmission, 1(1): 1-7, 1965.
- Ming Li, Paul Vitányi, et al. An introduction to Kolmogorov com- plexity and its applications, volume 3. Springer, 2008.
- Armando J. Pinho, Paulo J. S. G. Ferreira, António J. R. Neves, and Carlos A. C. Bastos. On the representability of complete genomes by multiple competing finite-context (markov) models. PLOS ONE, 6(6):1-7, 06 2011. doi: 10.1371/journal.pone.0021588. URL https://doi.org/10.1371/journal.pone.0021588.
- Diogo Pratas and Armando J. Pinho. On the approximation of the kolmogorov complexity for DNA sequences. In Iberian Confer- ence on Pattern Recognition and Image Analysis, pages 259-266. Springer, 2017.
- Diogo Pratas, Armando J. Pinho, and Paulo J. S. G. Ferreira. Effi- cient compression of genomic sequences. In 2016 Data Compres- sion Conference (DCC), pages 231-240. IEEE, 2016.
- Claude Elwood Shannon. A Mathematical Theory of Com- munication. Bell System Technical Journal, 27(3):379-423, 1948. doi: 10.1002/j.1538-7305.1948.tb01338.x. URL https://onlinelibrary.wiley.com/doi/abs/10. 1002/j.1538-7305.1948.tb01338.x.
- Hector Zenil, Santiago Hernández-Orozco, Narsis A. Kiani, Fer- nando Soler-Toscano, Antonio Rueda-Toicen, and Jesper Tegnér. A decomposition method for global evaluation of shannon entropy and local estimations of algorithmic complexity. Entropy, 20(8), 2018. ISSN 1099-4300. doi: 10.3390/e20080605. URL http: //www.mdpi.com/1099-4300/20/8/605.