A New Burrows Wheeler Transform Markov Distance
2019, arXiv (Cornell University)
https://doi.org/10.48550/ARXIV.1912.13046Abstract
Prior work inspired by compression algorithms has described how the Burrows Wheeler Transform can be used to create a distance measure for bioinformatics problems. We describe issues with this approach that were not widely known, and introduce our new Burrows Wheeler Markov Distance (BWMD) as an alternative. The BWMD avoids the shortcomings of earlier efforts, and allows us to tackle problems in variable length DNA sequence clustering.BWMD is also more adaptable to other domains, which we demonstrate on malware classification tasks. Unlike other compression-based distance metrics known to us, BWMD works by embedding sequences into a fixed-length feature vector. This allows us to provide significantly improved clustering performance on larger malware corpora, a weakness of prior methods.
References (39)
- H. S. Anderson and P. Roth. EMBER: An Open Dataset for Training Static PE Malware Machine Learning Mod- els. ArXiv e-prints, 2018. URL http://arxiv.org/abs/1804. 04637.
- M. Apel, C. Bockermann, and M. Meier. Measuring simi- larity of malware behavior. In 2009 IEEE 34th Confer- ence on Local Computer Networks, number October, pages 891-898. IEEE, 10 2009. ISBN 978-1-4244-4488-5. doi: 10.1109/LCN.2009.5355037. URL http://ieeexplore.ieee. org/document/5355037/.
- D. Arp, M. Spreitzenbarth, H. Malte, H. Gascon, and K. Rieck. Drebin: Effective and Explainable Detection of Android Malware in Your Pocket. Symposium on Net- work and Distributed System Security (NDSS), (February): 23-26, 2014. doi: 10.14722/ndss.2014.23247.
- M. Bailey, J. Oberheide, J. Andersen, Z. M. Mao, F. Jaha- nian, and J. Nazario. Automated Classification and Anal- ysis of Internet Malware. In Proceedings of the 10th In- ternational Conference on Recent Advances in Intrusion Detection, RAID'07, pages 178-197, Berlin, Heidelberg, 2007. Springer-Verlag. ISBN 3-540-74319-7, 978-3-540- 74319-4. URL http://dl.acm.org/citation.cfm?id=1776434. 1776449.
- U. Bayer, P. M. Comparetti, C. Hlauschek, C. Kruegel, and E. Kirda. Scalable , Behavior-Based Malware Clustering. NDSS, 9, 2009.
- K. S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When Is "Nearest Neighbor" Meaningful? In Proceed- ings of the 7th International Conference on Database Theory, ICDT '99, pages 217-235, London, UK, UK, 1999. Springer-Verlag. ISBN 3-540-65452-6. URL http: //dl.acm.org/citation.cfm?id=645503.656271.
- R. S. Borbely. On normalized compression distance and large malware. Journal of Computer Virology and Hacking Techniques, pages 1-8, 2015. ISSN 2263-8733. doi: 10. 1007/s11416-015-0260-0.
- A. P. Bradley. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7):1145-1159, 1997. ISSN 00313203. doi: 10.1016/S0031-3203(96)00142-2.
- M. Burrows and D. J. Wheeler. A Block-sorting Lossless Data Compression Algorithm. Technical report, DEC Systems Research Center, Palo Alto, Califorina, 1994. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi= 10.1.1.141.5254.
- Y. Cao, A. Janke, P. J. Waddell, M. Westerman, O. Take- naka, S. Murata, N. Okada, S. Pääbo, and M. Hasegawa. Conflict Among Individual Mitochondrial Proteins in Re- solving the Phylogeny of Eutherian Orders. Journal of Molecular Evolution, 47(3):307-322, 1998. ISSN 1432- 1432. doi: 10.1007/PL00006389. URL https://doi.org/10. 1007/PL00006389.
- M. Cebrián, M. Alfonseca, A. Ortega, and others. Common pitfalls using the normalized compression distance: What to watch out for in a compressor. Communications in Information & Systems, 5(4):367-384, 2005.
- R. Cilibrasi and P. M. B. Vitanyi. Clustering by Compression. IEEE Transactions on Information Theory, 51(4):1523- 1545, 4 2005. ISSN 0018-9448. doi: 10.1109/TIT.2005. 844059. URL http://dx.doi.org/10.1109/TIT.2005.844059.
- P. Ferragina and G. Manzini. Indexing Compressed Text. J. ACM, 52(4):552-581, 7 2005. ISSN 0004-5411. doi: 10.1145/1082036.1082039. URL http://doi.acm.org/10. 1145/1082036.1082039.
- G. Hamerly. Making k-means even faster. In SIAM Interna- tional Conference on Data Mining (SDM), pages 130-140, 2010. URL http://72.32.205.185/proceedings/datamining/ 2010/dm10_012_hamerlyg.pdf.
- J. Jang, D. Brumley, and S. Venkataraman. BitShred: Fea- ture Hashing Malware for Scalable Triage and Semantic Analysis. In Proceedings of the 18th ACM conference on Computer and communications security -CCS, pages 309- 320, New York, New York, USA, 2011. ACM Press. ISBN 9781450309486. doi: 10.1145/2046707.2046742. URL http://dl.acm.org/citation.cfm?doid=2046707.2046742.
- M. E. Karim, A. Walenstein, A. Lakhotia, and L. Parida. Malware phylogeny generation using permutations of code. Journal in Computer Virology, 1(1):13-23, 2005. ISSN 1772-9904. doi: 10.1007/s11416-005-0002-9. URL http: //dx.doi.org/10.1007/s11416-005-0002-9.
- E. Keogh, S. Lonardi, and C. A. Ratanamahatana. Towards Parameter-free Data Mining. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '04, pages 206-215, New York, NY, USA, 2004. ACM. ISBN 1-58113-888-1. doi: 10.1145/1014052.1014077. URL http://doi.acm.org/ 10.1145/1014052.1014077.
- K. Li and J. Malik. Fast k-Nearest Neighbour Search via Prioritized DCI. In Thirty-fourth International Conference on Machine Learning (ICML), 2017. URL http://arxiv.org/ abs/1703.00440.
- M. Li, X. Chen, X. Li, B. Ma, and P. M. Vitanyi. The Sim- ilarity Metric. IEEE Transactions on Information The- ory, 50(12):3250-3264, 2004. ISSN 0018-9448. doi: 10.1109/TIT.2004.838101.
- S. Mantaci, A. Restivo, G. Rosone, and M. Sciortino. An Extension of the Burrows Wheeler Transform and Applica- tions to Sequence Comparison and Data Compression. In Proceedings of the 16th Annual Conference on Combina- torial Pattern Matching, CPM'05, pages 178-189, Berlin, Heidelberg, 2005. Springer-Verlag. ISBN 3-540-26201-6, 978-3-540-26201-5. doi: 10.1007/11496656{\_}16. URL http://dx.doi.org/10.1007/11496656_16.
- S. Mantaci, A. Restivo, G. Rosone, and M. Sciortino. An extension of the Burrows-Wheeler Transform. The- oretical Computer Science, 387(3):298-312, 11 2007. ISSN 0304-3975. doi: 10.1016/J.TCS.2007.07.014.
- URL https://www.sciencedirect.com/science/article/pii/ S0304397507005282.
- S. Mantaci, A. Restivo, and M. Sciortino. Distance measures for biological sequences: Some recent approaches. Interna- tional Journal of Approximate Reasoning, 47(1):109-124, 1 2008. ISSN 0888-613X. doi: 10.1016/J.IJAR.2007.03.
- URL https://www.sciencedirect.com/science/article/ pii/S0888613X07000382.
- G. Manzini. An Analysis of the Burrows-Wheeler Transform. Journal of the ACM, 48(3):407-430, 5 2001. ISSN 0004- 5411. doi: 10.1145/382780.382782. URL http://doi.acm. org/10.1145/382780.382782.
- L. Martignoni, M. Christodorescu, and S. Jha. OmniUnpack: Fast, Generic, and Safe Unpacking of Malware. In Twenty- Third Annual Computer Security Applications Conference (ACSAC 2007), pages 431-441. IEEE, 12 2007. ISBN 0- 7695-3060-5. doi: 10.1109/ACSAC.2007.15. URL http: //ieeexplore.ieee.org/document/4413009/.
- D. Müllner. Modern hierarchical, agglomerative clustering algorithms. arXiv preprint arXiv:1109.2378, 2011. URL http://arxiv.org/abs/1109.2378.
- E. Raff and C. Nicholas. An Alternative to NCD for Large Sequences, Lempel-Ziv Jaccard Distance. In Proceed- ings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining -KDD '17, pages 1007-1015, New York, New York, USA, 2017a. ACM Press. ISBN 9781450348874. doi: 10.1145/ 3097983.3098111. URL http://dl.acm.org/citation.cfm? doid=3097983.3098111.
- E. Raff and C. Nicholas. Malware Classification and Class Imbalance via Stochastic Hashed LZJD. In Proceed- ings of the 10th ACM Workshop on Artificial Intelligence and Security, AISec '17, pages 111-120, New York, NY, USA, 2017b. ACM. ISBN 978-1-4503-5202-4. doi: 10.1145/3128572.3140446. URL http://doi.acm.org/10. 1145/3128572.3140446.
- E. Raff and C. Nicholas. Toward Metric Indexes for In- cremental Insertion and Querying. arXiv, 2018a. URL http://arxiv.org/abs/1801.05055.
- E. Raff and C. K. Nicholas. Lempel-Ziv Jaccard Distance, an effective alternative to ssdeep and sdhash. Digital Investi- gation, 2 2018b. ISSN 17422876. doi: 10.1016/j.diin.2017. 12.004. URL https://doi.org/10.1016/j.diin.2017.12.004.
- E. Raff, J. Aurelio, and C. Nicholas. PyLZJD: An Easy to Use Tool for Machine Learning. In C. Calloway, D. Lippa, D. Niederhut, and D. Shupe, editors, Proceedings of the 18th Python in Science Conference, pages 97-102, 2019. doi: 10.25080/Majora-7ddc1dd1-00e. URL http:// conference.scipy.org/proceedings/scipy2019/pylzjd.html.
- J.-M. Roberts. Virus Share, 2011. URL https://virusshare. com/.
- R. Ronen, M. Radu, C. Feuerstein, E. Yom-Tov, and M. Ah- madi. Microsoft Malware Classification Challenge, 2018. URL https://www.kaggle.com/c/malware-classification/.
- A. Rosenberg and J. Hirschberg. V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Compu- tational Natural Language Learning(EMNLP-CoNLL), pages 410-420, 2007.
- V. Roussev. Data Fingerprinting with Similarity Digests. In K.-P. Chow and S. Shenoi, editors, Advances in Digital Forensics VI: Sixth IFIP WG 11.9 International Confer- ence on Digital Forensics, Hong Kong, China, January 4-6, 2010, Revised Selected Papers, pages 207-226. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010. ISBN 978- 3-642-15506-2. doi: 10.1007/978-3-642-15506-2{\_}15. URL http://dx.doi.org/10.1007/978-3-642-15506-2_15.
- M. Sebastián, R. Rivera, P. Kotzias, and J. Caballero. AV- class: A Tool for Massive Malware Labeling. In F. Mon- rose, M. Dacier, G. Blanc, and J. Garcia-Alfaro, editors, Research in Attacks, Intrusions, and Defenses: 19th Inter- national Symposium, RAID 2016, pages 230-253. Springer International Publishing, Paris, France, 2016. ISBN 978- 3-319-45719-2. doi: 10.1007/978-3-319-45719-2{\_}11. URL http://dx.doi.org/10.1007/978-3-319-45719-2_11.
- N. VanHoudnos, W. Casey, D. French, B. Lindauer, E. Kanal, E. Wright, B. Woods, S. Moon, P. Jansen, and J. Carbonell. This Malware Looks Familiar: Laymen Identify Malware Run-time Similarity with Chernoff faces and Stick Figures. In 10th EAI International Conference on Bio-inspired In- formation and Communications Technologies (formerly BIONETICS), 2017. doi: 10.4108/eai.22-3-2017.152417.
- P. Yianilos. Data structures and algorithms for nearest neigh- bor search in general metric spaces. In Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algo- rithms, page 311-321. Society for Industrial and Applied Mathematics, 1993. URL http://dl.acm.org/citation.cfm? id=313789.