Advancing Backdoor Attack Detection in Transformer Models using Feature Squeezing and Statistical Anomaly Filtering Techniques
Abstract
The rapid proliferation of Transformer-based architectures across critical socio-technical infrastructures has introduced significant security vulnerabilities, most notably the emergence of sophisticated backdoor attacks. These adversarial interventions involve the clandestine insertion of malicious triggers during the training or fine-tuning phases, which remain dormant during standard operational cycles but activate specific, harmful behaviors when encountering predefined input patterns. This research investigates the advancement of backdoor detection mechanisms through the integration of feature squeezing and statistical anomaly filtering. By reducing the complexity of the input space and systematically identifying deviations in the latent representation distributions, the proposed framework enhances the robustness of large-scale language and vision Transformers. The study provides a comprehensive system-level analysis of how these defensive layers interact with existing model deployments, emphasizing the trade-offs between computational overhead and security efficacy. Furthermore, the discussion extends to the governance of AI supply chains, the policy implications of vulnerable foundation models, and the sustainability of long-term defensive strategies in evolving adversarial landscapes. The findings suggest that a multi-layered, statistical approach to anomaly detection can significantly mitigate the risks posed by poisoned datasets and compromised third-party model providers, thereby reinforcing the integrity of the broader artificial intelligence ecosystem.
References
1.Adebayo, J., Gilmer, J., Muelly, M., Gu, I., Hardt, M., & Kim, B. (2018). Sanity checks for saliency maps. Advances in Neural Information Processing Systems, 31.
2.Adi, Y., Baum, C., Cisse, M., Pinkas, B., & Keshet, J. (2018). Turning your weakness into a strength: Watermarking deep neural networks by backdooring. Proceedings of the 27th USENIX Security Symposium.
3.Bagdasaryan, E., & Shmatikov, V. (2021). Blind backdoors in deep learning models. Proceedings of the 30th USENIX Security Symposium.
4.Carlini, N., & Wagner, D. (2017). Towards evaluating the robustness of neural networks. 2017 IEEE Symposium on Security and Privacy (SP), 39-57.
5.Chen, X., Liu, C., Li, B., Lu, K., & Song, D. (2017). Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526.
6.Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of NAACL-HLT, 4171-4186.
7.Dong, Y., Fu, Q., Yang, X., Pang, T., Su, H., Xiao, Z., & Zhu, J. (2019). Benchmarking adversarial robustness on image classification. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
8.Gu, T., Dolan-Gavitt, B., & Garg, S. (2017). Badnets: Identifying vulnerabilities in the machine learning model supply chain. arXiv preprint arXiv:1708.06733.
9.Hoffman, J., Tzeng, E., Park, T., Zhu, J. Y., Isola, P., Saenko, K., ... & Efros, A. A. (2018). CyCADA: Cycle-consistent adversarial domain adaptation. International Conference on Machine Learning, 1989-1998.
10.Hu, Z., Shen, Y., Kuang, Z., & Zheng, B. (2026). Resilient Architectures for Large-Scale AI Systems. Journal of Infrastructure Security, 14(2), 112-135.
11.Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B., & Madry, A. (2019). Adversarial examples are not bugs, they are features. Advances in Neural Information Processing Systems, 32.
12.Ji, Y., Zhang, X., Ji, S., Luo, X., & Wang, T. (2018). Model-reuse attacks on deep learning systems. Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security.
13.Kurakin, A., Goodfellow, I., & Bengio, S. (2016). Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236.
14.Li, Y., Wu, B., Jiang, Y., Li, Z., & Xia, S. T. (2022). Backdoor learning: A survey. IEEE Transactions on Neural Networks and Learning Systems.
15.Liu, K., Dolan-Gavitt, B., & Garg, S. (2018). Fine-pruning: Defending against backdooring attacks on deep neural networks. International Symposium on Research in Attacks, Intrusions, and Defenses, 273-294.
16.Liu, Y., Ma, S., Aafer, Y., Lee, W. C., Zhai, J., Wang, W., & Zhang, X. (2018). Trojaning attack on neural networks. Proceedings of the 25th Network and Distributed System Security Symposium (NDSS).
17.Madry, A., Makelov, A., Schmidt, L., Tsipras, D., & Vladu, A. (2018). Towards deep learning models resistant to adversarial attacks. International Conference on Learning Representations.
18.Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik, Z. B., & Swami, A. (2016). The limitations of deep learning in adversarial settings. 2016 IEEE European Symposium on Security and Privacy (EuroS&P).
19.Shi, C., Li, S., Lu, W., Wu, W., Wang, C., Cheng, Z., ... & Chua, T. S. (2026). TraceRouter: Robust Safety for Large Foundation Models via Path-Level Intervention. arXiv preprint arXiv:2601.21900.
20.Shokri, R., Stronati, M., Song, C., & Shmatikov, V. (2017). Membership inference attacks against machine learning models. 2017 IEEE Symposium on Security and Privacy (SP).
21.Sun, Lichao, et al. (2024). A survey on large language model security and privacy. IEEE Communications Surveys & Tutorials.
22.Tran, B., Li, J., & Madry, A. (2018). Spectral signatures in backdoor attacks. Advances in Neural Information Processing Systems, 31.
23.Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
24.Wallace, E., Feng, S., Kandpal, N., Gardner, M., & Singh, S. (2019). Universal adversarial triggers for nlp. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.
25.Wang, B., Yao, Y., Shan, S., Li, H., Viswanath, B., Zheng, H., & Zhao, B. Y. (2019). Neural cleanse: Identifying and mitigating backdoor attacks in deep neural networks. 2019 IEEE Symposium on Security and Privacy (SP), 707-723.
26.Xu, W., Evans, D., & Qi, Y. (2018). Feature squeezing: Detecting adversarial examples in deep neural networks. Proceedings of the 25th Network and Distributed System Security Symposium (NDSS).
27.Yao, Y., Li, H., Zheng, H., & Zhao, B. Y. (2019). Latent backdoor attacks on deep learning models. Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security.
28.Zhang, J., Chen, Z., & Wang, Q. (2026). Statistical Filtering for Latent Anomaly Detection in Transformers. Journal of Artificial Intelligence Research, 74, 455-489.
29.Zhao, Pu, et al. (2025). On the security and robustness of vision transformers. International Journal of Computer Vision.
30.Zhou, Y., & Li, M. (2026). Socio-Technical Implications of AI Vulnerabilities in National Infrastructure. Technology in Society, 88, 102-120.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 International Journal of Artificial Intelligence Research

This work is licensed under a Creative Commons Attribution 4.0 International License.
This article is published under the Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.



