Academia.eduAcademia.edu

Outline

MalPacDetector: An LLM-Based Malicious NPM Package Detector

IEEE T-IFS

https://doi.org/10.1109/TIFS.2025.3580336

Abstract

The Node Package Manager (NPM) registry contains millions of JavaScript packages widely shared between worldwide developers. However, NPM has also been abused by attackers to spread malicious packages, highlighting the importance of detecting malicious NPM packages. Existing malicious NPM package detectors suffer from, among other things, high false positives and/or high false negatives. In this paper, we propose a novel Malicious NPM Package Detector (MalPacDetector), which leverages Large Language Model (LLM) to automatically and dynamically generate features (rather than asking experts to manually define them). To evaluate the effectiveness of Mal-PacDetector and existing detectors, we construct a new NPM package dataset, which overcomes the weaknesses of existing datasets (e.g., a small number of examples and a high repetition rate of malicious fragments). The experimental results show that MalPacDetector outperforms existing detectors by achieving a false positive rate of 1.3% and a false negative rate of 7.5%. In particular, MalPacDetector detects 39 previously unknown malicious packages, which are confirmed by the NPM security team.

References (38)

  1. N. Zahan, T. Zimmermann, P. Godefroid, B. Murphy, C. Maddila, and L. Williams, "What are weak links in the npm supply chain?," in Proc. IEEE/ACM 44th Int. Conf. Softw. Engineering: Softw. Eng. Pract. (ICSE- SEIP), Pittsburgh, PA, USA, May 2022, pp. 331-340.
  2. A. Bagmar, J. Wedgwood, D. Levin, and J. Purtilo, "I know what you imported last summer: A study of security threats in thePython ecosystem," 2021, arXiv:2102.06301.
  3. I. Koishybayev and A. Kapravelos, "Mininode: Reducing the attack surface of Node.Js applications," in Proc. 23rd Int. Symp. Res. Attacks, Intrusions Defenses (RAID), San Sebastian, Spain, Jan. 2020, pp. 121-134.
  4. X. Jiang, L. Meng, S. Li, and D. Wu, "Active poisoning: Efficient backdoor attacks on transfer learning-based brain-computer interfaces," Sci. China Inf. Sci., vol. 66, no. 8, pp. 1-22, Aug. 2023.
  5. Eslint-scope. Accessed: Sep. 20, 2023. [Online]. Available: https:// github.com/advisories/GHSA-hxxf-q3w9-4xgwvspace\{-1pc\}
  6. M. Zimmermann, C. A. Staicu, C. Tenny, and M. Pradel, "Small world with high risks: A study of security threats in the npm ecosystem," in Proc. 28th USENIX Secur. Symp., 2019, pp. 995-1010.
  7. R. Duan, O. Alrawi, R. P. Kasturi, R. Elder, B. Saltaformaggio, and W. Lee, "Towards measuring supply chain attacks on package managers for interpreted languages," in Proc. Netw. Distrib. Syst. Secur. Symp., 2021, pp. 1-17.
  8. D. L. Vu, I. Pashchenko, F. Massacci, H. Plate, and A. Sabetta, "Towards using source code repositories to identify software supply chain attacks," in Proc. ACM SIGSAC Conf. Comput. Commun. Secur., Oct. 2020, pp. 2093-2095.
  9. M. Ohm, A. Sykosch, and M. Meier, "Towards detection of software supply chain attacks by forensic artifacts," in Proc. 15th Int. Conf. Availability, Rel. Secur., Aug. 2020, pp. 1-6.
  10. M. Ohm, F. Boes, C. Bungartz, and M. Meier, "On the feasibility of supervised machine learning for the detection of malicious software packages," in Proc. 17th Int. Conf. Availability, Rel. Secur., Aug. 2022, pp. 1-10.
  11. A. Sejfia and M. Schäfer, "Practical automated detection of malicious npm packages," in Proc. IEEE/ACM 44th Int. Conf. Softw. Eng. (ICSE), Pittsburgh, PA, USA, May 2022, pp. 1681-1692.
  12. M. Ohm, L. Kempf, F. Boes, and M. Meier, "Supporting the detec- tion of software supply chain attacks through unsupervised signature generation," 2020, arXiv:2011.02235.
  13. S. Scalco, R. Paramitha, D.-L. Vu, and F. Massacci, "On the feasibility of detecting injections in malicious npm packages," in Proc. 17th Int. Conf. Availability, Rel. Secur., Vienna, Austria, Aug. 2022, pp. 1-8.
  14. M. Ohm, H. Plate, A. Sykosch, and M. Meier, "Backstabber's knife collection: A review of open source software supply chain attacks," in Proc. 17th Int. Conf. Detection Intrusions Malware, 2020, pp. 23-43.
  15. R. Campos, V. Mangaravite, A. Pasquali, A. Jorge, C. Nunes, and A. Jatowt, "YAKE! Keyword extraction from single documents using multiple local features," Inf. Sci., vol. 509, pp. 257-289, Jan. 2020.
  16. Babel. Accessed: Jun. 20, 2023. [Online]. Available: https://babeljs.io/
  17. D. Friedman and A. B. Dieng, "The Vendi Score: A diversity evaluation metric for machine learning," Trans. Mach. Learn. Res., vol. 2023, pp. 1-26, 2023.
  18. P. Ladisa, H. Plate, M. Martinez, and O. Barais, "SoK: Taxonomy of attacks on open-source software supply chains," in Proc. IEEE Symp. Secur. Privacy (SP), San Francisco, CA, USA, May 2023, pp. 1509-1526.
  19. Snyk Vulnerability Database of Npm. Accessed: Apr. 1, 2023. [Online]. Available: https://security.snyk.io/vuln/npm
  20. GitHub Advisory Database of Npm. Accessed: Aug. 15, 2023. [Online]. Available: https://github.com/ advisories?query=type\%3Areviewed+ecosystem\%3Anpm
  21. Libraries. Io of NPM. Accessed: Apr. 1, 2023. [Online]. Available: https://libraries.io/npm
  22. L. van der Maaten and G. Hinton, "Visualizing data using t-SNE," J. Mach. Learn. Res., vol. 9, pp. 2579-2605, Nov. 2008.
  23. A. M. Ikotun, A. E. Ezugwu, L. Abualigah, B. Abuhaija, and J. Hem- ing, "K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data," Inf. Sci., vol. 622, pp. 178-210, Apr. 2023.
  24. R. Mohammed, J. Rawashdeh, and M. Abdullah, "Machine learning with oversampling and undersampling techniques: Overview study and experimental results," in Proc. 11th Int. Conf. Inf. Commun. Syst. (ICICS), Apr. 2020, pp. 243-248.
  25. N. Li, S. Wang, M. Feng, K. Wang, M. Wang, and H. Wang, "MalWuKong: Towards fast, accurate, and multilingual detection of malicious code poisoning in OSS supply chains," in Proc. 38th IEEE/ACM Int. Conf. Automated Softw. Eng. (ASE), Sep. 2023, pp. 1993-2005.
  26. C. Huang et al., "DONAPI: Malicious NPM packages detector using behavior sequence knowledge mapping," in Proc. 33rd USENIX Security Symp., Philadelphia, PA, USA, 2024, pp. 1-21.
  27. S. García and F. Herrera, "An extension on 'Statistical comparisons of classifiers over multiple data Sets' for all pairwise comparisons," J. Mach. Learn. Res., vol. 9, no. 89, pp. 2677-2694, Jan. 2008.
  28. Microsoft OSS ApplicationInspector. Accessed: Jun. 20, 2023. [Online]. Available: https://github.com/microsoft/ApplicationInspector
  29. D.-L. Vu, F. Massacci, I. Pashchenko, H. Plate, and A. Sabetta, "LastPyMile: Identifying the discrepancy between sources and packages," in Proc. 29th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng., Athens, Greece, Aug. 2021, pp. 780-792.
  30. K. Garrett, G. Ferreira, L. Jia, J. Sunshine, and C. Kästner, "Detecting suspicious package updates," in Proc. IEEE/ACM 41st Int. Conf. Softw. Engineering: New Ideas Emerg. Results (ICSE-NIER), Montreal, QC, Canada, May 2019, pp. 13-16.
  31. M. Ohm, L. Kempf, F. Boes, and M. Meier, Towards Detection of Mali- cious Software Packages Through Code Reuse by Malevolent Actors. Bonn, German: Gesellschaft für Informatik, 2022.
  32. D. U. Brand, O. Stussi, and E. Wåreus, "Supply chain attacks in open source projects," Master's thesis, Dept. Elect. Inf. Technol., LUND Univ., Lund, Sweden, 2022.
  33. Z. Yu, M. Wen, X. Guo, and H. Jin, "Maltracker: A fine-grained NPM malware tracker copiloted by LLM-enhanced dataset," in Proc. 33rd ACM SIGSOFT Int. Symp. Softw. Test. Anal., Sep. 2024, pp. 1759-1771.
  34. W. Tang, M. Tang, M. Ban, Z. Zhao, and M. Feng, "CSGVD: A deep learning approach combining sequence and graph embedding for source code vulnerability detection," J. Syst. Softw., vol. 199, May 2023, Art. no. 111623.
  35. K. Zhang, D. Wang, J. Xia, W. Y. Wang, and L. Li, "ALGO: Synthe- sizing algorithmic programs with generated Oracle verifiers," in Proc. Annu. Conf. Neural Inf. Process. Syst. (NeurIPS), New Orleans, LA, USA, 2023, pp. 1-12.
  36. M. Alqarni and A. Azim, "Low level source code vulnerability detection using advanced BERT language model," in Proc. 35th Can. Conf. Artif. Intell., Toronto, ON, Canada, May 2022, pp. 1-11.
  37. J. Zhang et al., "Killing two birds with one stone: Malicious package detection in NPM and PyPI using a single model of malicious behavior sequence," 2023, arXiv:2309.02637.
  38. N. Zahan, P. Burckhardt, M. Lysenko, F. Aboukhadijeh, and L. Williams, "Leveraging large language models to detect npm malicious packages," 2024, arXiv:2403.12196.