Online Defense of Trojaned Models using Misattributions

Panagiota Kiourti

Outline

Online Defense of Trojaned Models using Misattributions

Panagiota Kiourti

2021, ArXiv

Abstract

This paper proposes a new approach to detecting neural Trojans on Deep Neural Networks during inference. This approach is based on monitoring the inference of a machine learning model, computing the attribution of the model’s decision on different features of the input, and then statistically analyzing these attributions to detect whether an input sample contains the Trojan trigger. The anomalous attributions, aka misattributions, are then accompanied by reverse-engineering of the trigger to evaluate whether the input sample is truly poisoned with a Trojan trigger. We evaluate our approach on several benchmarks, including models trained on MNIST, Fashion MNIST, and German Traffic Sign Recognition Benchmark, and demonstrate the state of the art detection accuracy.

References (63)

William Aiken, Hyoungshick Kim, Simon Woo, and Jungwoo Ryoo. 2021. Neural network laundering: Removing black-box backdoor watermarks from deep neural networks. Computers & Security 106 (2021), 102277.
Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. 2015. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one 10, 7 (2015), e0130140.
Eugene Bagdasaryan and Vitaly Shmatikov. 2020. Blind backdoors in deep learning models. arXiv preprint arXiv:2005.03823 (2020).
David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Network dissection: Quantifying interpretability of deep visual representations. In Conference on computer vision and pattern recognition. CVPR, 6541-6549.
Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and Devon Hjelm. 2018. Mutual information neural estimation. In International Conference on Machine Learning. PMLR, 531-540.
Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. 2016. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316 (2016).
Eitan Borgnia, Valeriia Cherepanova, Liam Fowl, Amin Ghiasi, Jonas Geiping, Micah Goldblum, Tom Goldstein, and Arjun Gupta. 2021. Strong data augmenta- tion sanitizes poisoning and backdoor attacks without an accuracy tradeoff. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3855-3859.
Bryant Chen, Wilka Carvalho, Nathalie Baracaldo, Heiko Ludwig, Benjamin Edwards, Taesung Lee, Ian Molloy, and Biplav Srivastava. 2018. Detecting back- door attacks on deep neural networks by activation clustering. arXiv preprint arXiv:1811.03728 (2018).
Huili Chen, Cheng Fu, Jishen Zhao, and Farinaz Koushanfar. 2019. DeepIn- spect: A Black-box Trojan Detection and Mitigation Framework for Deep Neural Networks.. In International joint conferences on artificial intelligence. 4658-4664.
Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. 2017. Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526 (2017).
Edward Chou, Florian Tramèr, and Giancarlo Pellegrino. 2020. Sentinet: Detecting localized universal attacks against deep learning systems. In 2020 IEEE Security and Privacy Workshops (SPW). IEEE, 48-54.
B Gia Doan, Ehsan Abbasnejad, and Damith C Ranasinghe. 2019. Februus: Input purification defense against trojan attacks on deep neural network systems. In arXiv: 1908.03369. arXiv.
Yansong Gao, Change Xu, Derui Wang, Shiping Chen, Damith C Ranasinghe, and Surya Nepal. 2019. Strip: A defence against trojan attacks on deep neural networks. In Computer security applications conference. 113-125.
Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and An- drew G Wilson. 2018. Loss surfaces, mode connectivity, and fast ensembling of dnns. In Neural information processing systems. 8789-8798.
Ian J et al. Goodfellow. 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014).
Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. 2019. Badnets: Evaluating backdooring attacks on deep neural networks. IEEE Access 7 (2019), 47230-47244.
Wenbo Guo, Lun Wang, Xinyu Xing, Min Du, and Dawn Song. 2020. TABOR: A Highly Accurate Approach to Inspecting and Restoring Trojan Backdoors in AI Systems. ICDM (2020). arXiv:1908.01763
Jonathan Hayase, Weihao Kong, Raghav Somani, and Sewoong Oh. 2021. SPEC- TRE: Defending Against Backdoor Attacks Using Robust Statistics. arXiv preprint arXiv:2104.11315 (2021).
Xijie Huang, Moustafa Alzantot, and Mani Srivastava. 2019. Neuroninspect: Detecting backdoors in neural networks via output explanations. arXiv preprint arXiv:1911.07399 (2019).
Todd Huster and Emmanuel Ekwedike. 2021. TOP: Backdoor Detection in Neural Networks via Transferability of Perturbation. arXiv preprint arXiv:2103.10274 (2021).
Mojan Javaheripi, Mohammad Samragh, Gregory Fields, Tara Javidi, and Fari- naz Koushanfar. 2020. Cleann: Accelerated trojan shield for embedded neural networks. In 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD). IEEE, 1-9.
Andrei Kapishnikov, Tolga Bolukbasi, Fernanda Viégas, and Michael Terry. 2019. Xrai: Better attributions through regions. In Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision. 4948-4957.
Kiran Karra, Chace Ashcraft, and Neil Fendley. 2020. The TrojAI Software Framework: An OpenSource tool for Embedding Trojans into Deep Learning Models. arXiv preprint arXiv:2003.07233 (2020).
Panagiota Kiourti, Kacper Wardega, Susmit Jha, and Wenchao Li. 2020. TrojDRL: evaluation of backdoor attacks on deep reinforcement learning. In 2020 57th ACM/IEEE Design Automation Conference (DAC). IEEE, 1-6.
Soheil Kolouri, Aniruddha Saha, Hamed Pirsiavash, and Heiko Hoffmann. 2020. Universal Litmus Patterns: Revealing Backdoor Attacks in CNNs. In Conference on computer vision and pattern recognition. 301-310.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classifica- tion with deep convolutional neural networks. In Advances in neural information processing systems. 1097-1105.
Shaofeng Li, Benjamin Zi Hao Zhao, Jiahao Yu, Minhui Xue, Dali Kaafar, and Haojin Zhu. 2019. Invisible backdoor attacks against deep neural networks. arXiv preprint arXiv:1909.02742 (2019).
Yige Li, Xixiang Lyu, Nodens Koren, Lingjuan Lyu, Bo Li, and Xingjun Ma. 2021. Neural attention distillation: Erasing backdoor triggers from deep neural networks. arXiv preprint arXiv:2101.05930 (2021).
Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. 2018. Fine-pruning: De- fending against backdooring attacks on deep neural networks. In International symposium on research in attacks, intrusions, and defenses. Springer, 273-294.
Yingqi Liu, Wen-Chuan Lee, Guanhong Tao, Shiqing Ma, Yousra Aafer, and Xiangyu Zhang. 2019. ABS: Scanning neural networks for back-doors by artificial brain stimulation. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security. 1265-1282.
Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and Xiangyu Zhang. 2018. Trojaning Attack on Neural Networks. In 25nd Annual Network and Distributed System Security Symposium, NDSS 2018, San Diego, California, USA, February 18-221, 2018. The Internet Society.
Yunfei Liu, Xingjun Ma, James Bailey, and Feng Lu. 2020. Reflection backdoor: A natural backdoor attack on deep neural networks. arXiv preprint arXiv:2007.02343 (2020).
Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 4765-4774.
Shervin Minaee, Amirali Abdolrashidi, Hang Su, Mohammed Bennamoun, and David Zhang. 2019. Biometric recognition using deep learning: A survey. arXiv preprint arXiv:1912.00271 (2019).
Woo-Jeoung Nam, Shir Gur, Jaesik Choi, Lior Wolf, and Seong-Whan Lee. 2019. Relative Attributing Propagation: Interpreting the Comparative Contributions of Individual Units in Deep Neural Networks. arXiv preprint arXiv:1904.00605 (2019).
Ren Pang, Zheng Zhang, Xiangshan Gao, Zhaohan Xi, Shouling Ji, Peng Cheng, and Ting Wang. 2020. TROJANZOO: Everything you ever wanted to know about neural backdoors (but were afraid to ask). arXiv preprint arXiv:2012.09302 (2020).
Ximing Qiao, Yukun Yang, and Hai Li. 2019. Defending Neural Backdoors via Generative Distribution Modeling. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc., 14004-14013.
Han Qiu, Yi Zeng, Shangwei Guo, Tianwei Zhang, Meikang Qiu, and Bhavani Thuraisingham. 2021. Deepsweep: An evaluation framework for mitigating dnn backdoor attacks using data augmentation. In Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security. 363-377.
Aniruddha Saha, Akshayvarun Subramanya, and Hamed Pirsiavash. 2020. Hidden trigger backdoor attacks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11957-11965.
Ahmed Salem, Rui Wen, Michael Backes, Shiqing Ma, and Yang Zhang. 2020. Dynamic backdoor attacks against machine learning models. arXiv preprint arXiv:2003.03675 (2020).
Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815-823.
Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedan- tam, Devi Parikh, and Dhruv Batra. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE interna- tional conference on computer vision. 618-626.
Guangyu Shen, Yingqi Liu, Guanhong Tao, Shengwei An, Qiuling Xu, Siyuan Cheng, Shiqing Ma, and Xiangyu Zhang. 2021. Backdoor Scanning for Deep Neural Networks through K-Arm Optimization. arXiv preprint arXiv:2102.05123 (2021).
Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. 2017. Learning im- portant features through propagating activation differences. In International Conference on Machine Learning. PMLR, 3145-3153.
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershel- vam, Marc Lanctot, et al. 2016. Mastering the game of Go with deep neural networks and tree search. nature 529, 7587 (2016), 484.
Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Ried- miller. 2014. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806 (2014).
Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic attribution for deep networks. arXiv preprint arXiv:1703.01365 (2017).
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2014. Intriguing properties of neural networks. In International Conference on Learning Representations.
Mingxing Tan, Ruoming Pang, and Quoc V Le. 2020. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10781-10790.
Brandon Tran, Jerry Li, and Aleksander Madry. 2018. Spectral signatures in backdoor attacks. arXiv preprint arXiv:1811.00636 (2018).
Alexander Turner, Dimitris Tsipras, and Aleksander Madry. 2018. Clean-label backdoor attacks. (2018).
Alexander Turner, Dimitris Tsipras, and Aleksander Madry. 2019. Label- consistent backdoor attacks. arXiv preprint arXiv:1912.02771 (2019).
Sakshi Udeshi, Shanshan Peng, Gerald Woo, Lionell Loh, Louth Rawshan, and Sudipta Chattopadhyay. 2019. Model agnostic defence against backdoor attacks in machine learning. arXiv preprint arXiv:1908.02203 (2019).
Akshaj Kumar Veldanda, Kang Liu, Benjamin Tan, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, Brendan Dolan-Gavitt, and Siddharth Garg. 2020. NNoculation: broad spectrum and targeted treatment of backdoored DNNs. arXiv preprint arXiv:2002.08313 (2020).
Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y Zhao. 2019. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In 2019 IEEE Symposium on Security and Privacy (SP). IEEE, 707-723.
Emily Wenger, Josephine Passananti, Arjun Nitin Bhagoji, Yuanshun Yao, Haitao Zheng, and Ben Y Zhao. 2021. Backdoor Attacks Against Deep Learning Systems in the Physical World. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6206-6215.
Xiaojun Xu, Qi Wang, Huichen Li, Nikita Borisov, Carl A Gunter, and Bo Li. 2019. Detecting AI Trojans Using Meta Neural Analysis. arXiv preprint arXiv:1910.03137 (2019).
Zhaoyuan Yang, Naresh Iyer, Johan Reimann, and Nurali Virani. 2019. Design of intentional backdoors in sequential models. arXiv preprint arXiv:1902.09972 (2019).
Yuanshun Yao, Huiying Li, Haitao Zheng, and Ben Y Zhao. 2019. Latent Backdoor Attacks on Deep Neural Networks. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security. 2041-2055.
Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolu- tional networks. In ECCV. Springer, 818-833.
Yi Zeng, Won Park, Z Morley Mao, and Ruoxi Jia. 2021. Rethinking the Backdoor Attacks' Triggers: A Frequency Perspective. arXiv preprint arXiv:2104.03413 (2021).
Xinqiao Zhang, Huili Chen, and Farinaz Koushanfar. 2021. TAD: Trigger Approx- imation based Black-box Trojan Detection for AI. arXiv preprint arXiv:2102.01815 (2021).
Pu Zhao, Pin-Yu Chen, Payel Das, Karthikeyan Natesan Ramamurthy, and Xue Lin. 2020. Bridging Mode Connectivity in Loss Landscapes and Adversarial Robustness. arXiv preprint arXiv:2005.00060 (2020).

Online Defense of Trojaned Models using Misattributions

Sign up for access to the world's latest research

Abstract

Related papers

References (63)

Related papers

Related topics