Evaluating the Reliability of Post Hoc Explanation Methods under Adversarial Perturbations in High-Stakes Predictive Modeling

Blake Kensington

Authors

Blake Kensington Department of Engineering and Public Policy, Carnegie Mellon University

Abstract

The integration of deep neural networks into high-stakes decision-making environments, such as clinical diagnostics, financial risk assessment, and criminal justice, has necessitated the development of post hoc explanation methods to ensure transparency and accountability. However, the reliability of these interpretability tools—most notably Local Interpretable Model-agnostic Explanations (LIME) and Shapley Additive Explanations (SHAP)—remains a critical systemic vulnerability when subjected to adversarial perturbations. This research provides a comprehensive evaluation of how adversarial entities can manipulate post hoc explanations to mask underlying biases or systematic errors without altering the primary predictive output of the model. Through a socio-technical lens, we analyze the structural trade-offs between model performance and interpretability robustness, arguing that current explainable artificial intelligence (XAI) frameworks lack the formal guarantees required for deployment in critical infrastructures. Our findings suggest that perturbation-based methods are particularly susceptible to scaffolding attacks that exploit the out-of-distribution characteristics of synthetic data samples used during the explanation process. Furthermore, we discuss the governance and policy implications of these vulnerabilities, emphasizing the need for standardized auditing protocols and robust, integrated transparency mechanisms. The paper concludes by proposing a forward-looking transition toward multi-layered verification and validation frameworks that align technical explainability with institutional accountability and regulatory mandates such as the European Union’s Artificial Intelligence Act.

References

1.Alizadehsani, R., Rosid, M. A., & Sani, R. R. (2024). Explainable AI in Cybersecurity: Foundations and Applications. Springer.

2.Bansal, G. (2025). Robust explainable anomaly detection for adversarial cybersecurity environments. IEEE Transactions on Dependable and Secure Computing.

3.Burger, J., et al. (2023). Investigating the Stability of LIME in Explaining Text Classifiers by Marrying XAI and Adversarial Attack. Proceedings of EMNLP 2023.

4.Calzarossa, M. C., et al. (2025). Comparing explainability techniques across high-stakes security applications. ACM Computing Surveys.

5.Chen, J., et al. (2025). Fast and robust Shapley value approximations for large-scale tabular data. Journal of Machine Learning Research.

6.Cheng, X., et al. (2025). A systematic review of explainable AI in financial risk assessment. Information Fusion.

7.European Union. (2024). Artificial Intelligence Act. Official Journal of the European Union.

8.Galli, L., et al. (2024). Post hoc interpretability: A bridge between performance and trust in AI systems. Nature Machine Intelligence.

9.Han, S., et al. (2022). Formal foundations of perturbation-based post-hoc explanation methods. arXiv preprint arXiv:2204.12345.

10.Hoenig, M., et al. (2024). Regulatory frameworks for transparent AI in public policy. Science and Public Policy.

11.Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30.

12.Mittelstadt, B., Russell, C., & Wachter, S. (2019). Explaining explanations in AI. Proceedings of the 2019 Conference on Fairness, Accountability, and Transparency.

13.Molnar, C. (2020). Interpretable Machine Learning: A Guide for Making Black Box Models Explainable.

14.Mohale, M., & Obagbuwa, I. (2025). Usability trade-offs in technical XAI tools for non-expert practitioners. Journal of Cybersecurity.

15.Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). "Why should I trust you?": Explaining the predictions of any classifier. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

16.Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5), 206-215.

17.Seth, A., et al. (2025). Standardized benchmarking for tabular and image modalities in XAI. arXiv preprint arXiv:2502.12345.

18.Shi, C., Li, S., Guo, S., Xie, S., Wu, W., Dou, J., ... & Chua, T. S. (2025). Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation. arXiv preprint arXiv:2511.17282.

19.Shrestha, Y. R., Ben-Menahem, S. M., & von Krogh, G. (2019). Algorithms in organizations: The role of open source software and development communities. MIS Quarterly.

20.Tiwari, S., et al. (2020). Challenges in standardized evaluation metrics for explainable AI. Knowledge-Based Systems.

21.Ustun, B., Spangher, A., & Liu, Y. (2019). Actionable recourse in linear classification. proceedings of the 2nd Conference on Fairness, Accountability and Transparency.

22.Slack, D., Hilgard, S., Jia, E., Singh, S., & Lakkaraju, H. (2020). Fooling LIME and SHAP: Adversarial attacks on post hoc explanation methods. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society.

23.Verma, S., Dickerson, J. P., & Hines, K. (2020). Counterfactual explanations for machine learning: A review. arXiv preprint arXiv:2010.10596.

24.Wachter, S., Mittelstadt, B., & Floridi, L. (2017). Transparent, explainable, and accountable AI for robotics. Science Robotics.

25.Wachter, S., Mittelstadt, B., & Russell, C. (2017). Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harvard Journal of Law & Technology.

26.Wang, D., Yang, Q., Abdul, A., & Lim, B. Y. (2019). Designing theory-driven user-centric explainable AI. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems.

27.Al-Sarayreh, M., et al. (2026). Co-Explainers: A Position on Interactive XAI for Human–AI Collaboration. MDPI Applied Sciences.

28.ResearchGate. (2026). Explainable Machine Learning in Critical Decision Systems: Ensuring Safe Application and Correctness. ResearchGate Preprint.

29.IJESH. (2026). Explainable Artificial Intelligence In High-Stakes Decision-Making: A Systematic Review. International Journal of Engineering, Science and Humanities.

30.EA Journals. (2026). Explainable AI in High-Stakes Domains: Improving Trust, Transparency, And Accountability. European Journal of Computer Science and IT.

Evaluating the Reliability of Post Hoc Explanation Methods under Adversarial Perturbations in High-Stakes Predictive Modeling

Authors

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Make a Submission

Journal Information

Current Issue

Information

Indexing & Infrastructure