Improving Interpretability in Large Language Models through Layer Wise Relevance Propagation for Identifying Latent Reasoning Biases
Abstract
The rapid deployment of large language models (LLMs) across critical socio-technical infrastructures has necessitated a paradigm shift in how we understand their internal decision-making processes. As these models transition from experimental prototypes to foundational components of legal, medical, and financial systems, the inherent "black box" nature of transformer architectures presents significant risks regarding latent reasoning biases. This research explores the application of Layer Wise Relevance Propagation (LRP) as a systemic diagnostic tool for enhancing interpretability within high-capacity neural networks. Unlike traditional local explanation methods that often fail to capture the hierarchical dependencies of deep attention mechanisms, LRP offers a robust framework for redistributing output logit scores back through the network layers to identify which specific neurons and structural components contribute most heavily to biased outcomes. This paper examines the structural trade-offs between model performance and transparency, proposing a governance framework that integrates LRP-based auditing into the deployment pipeline. We analyze how latent biases are encoded within the model’s embedding spaces and reinforced through multi-head attention layers, arguing that interpretability is not merely a technical luxury but a functional requirement for systemic robustness. By mapping the relevance flow across diverse transformer blocks, this study identifies critical architectural vulnerabilities where bias becomes entrenched. The findings suggest that systemic interventions at the layer level, rather than simple output filtering, are essential for ensuring the ethical alignment and long-term sustainability of artificial intelligence systems in complex societal environments.
References
1.Adebayo, J., Gilmer, J., Muelly, M., Gu, I., Hardt, M., & Kim, B. (2018). Sanity checks for saliency maps. Advances in Neural Information Processing Systems, 31.
2.Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., & Mané, D. (2016). Concrete problems in AI safety. arXiv preprint arXiv:1606.06565.
3.Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K. R., & Samek, W. (2015). On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS One, 10(7), e0130140.
4.Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–623.
5.Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., ... & Liang, P. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
6.Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.
7.Clark, K., Khandelwal, U., Levy, O., & Manning, C. D. (2019). What does BERT look at? An analysis of BERT’s attention. Proceedings of the 2019 ACL Workshop BlackboxNLP.
8.Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
9.Floridi, L., & Cowls, J. (2019). A unified framework of five ethical principles for AI in society. Harvard Data Science Review, 1(1).
10.Ghorbani, A., Abid, A., & Zou, J. (2019). Interpretation of neural networks is fragile. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01), 3681–3688.
11.Hooker, S., Erhan, D., Kindermans, P. J., & Kim, B. (2019). A benchmark for interpretability methods in deep learning. Advances in Neural Information Processing Systems, 32.
12.Koh, P. W., & Liang, P. (2017). Understanding black-box predictions via influence functions. International Conference on Machine Learning, 1885–1894.
13.Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30.
14.Montavon, G., Binder, A., Lapuschkin, S., Samek, W., & Müller, K. R. (2019). Layer-wise relevance propagation: An overview. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, 193–209.
15.Olah, C., Satyanarayan, A., Johnson, I., Carter, S., Schubert, L., Ye, K., & Mordvintsev, A. (2018). The building blocks of interpretability. Distill, 3(3), e10.
16.Pearl, J. (2019). The seven tools of causal inference, with reflections on machine learning. Communications of the ACM, 62(3), 54–60.
17.Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). "Why should I trust you?": Explaining the predictions of any classifier. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1135–1144.
18.Samek, W., Binder, A., Montavon, G., Lapuschkin, S., & Müller, K. R. (2017). Evaluating the visualization of what a deep neural network has learned. IEEE Transactions on Neural Networks and Learning Systems, 28(11), 2660–2673.
19.Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-CAM: Visual explanations from deep networks via gradient-based localization. Proceedings of the IEEE International Conference on Computer Vision, 618–626.
20.Shi, C., Li, S., Lu, W., Wu, W., Wang, C., Cheng, Z., ... & Chua, T. S. (2026). TraceRouter: Robust Safety for Large Foundation Models via Path-Level Intervention. arXiv preprint arXiv:2601.21900.
21.Simonyan, K., Vedaldi, A., & Zisserman, A. (2013). Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034.
22.Sundararajan, M., Taly, A., & Yan, Q. (2017). Axiomatic attribution for deep networks. International Conference on Machine Learning, 3319–3328.
23.Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
24.Voita, E., Talbot, D., Fedotova, F., & Sennrich, R. (2019). The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.
25.Wachter, S., Mittelstadt, B., & Russell, C. (2017). Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harvard Journal of Law & Technology, 31, 841.
26.Wiegreffe, S., & Pinter, Y. (2019). Attention is not explanation. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.
27.Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., ... & Rush, A. M. (2020). Transformers: State-of-the-art natural language processing. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 38–45.
28.Yang, F., Huang, Z., & Scholtz, J. (2019). Towards interpretation of node-level predictions in graph neural networks. arXiv preprint arXiv:1902.04233.
29.Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. European Conference on Computer Vision, 818–833.
30.Zhang, Q. S., & Zhu, S. C. (2018). Visual interpretability for deep learning: a survey. Frontiers of Information Technology & Electronic Engineering, 19(1), 27–39.
31.Zhao, J., Wang, T., Yatskar, M., Ordonez, V., & Chang, K. W. (2017). Men also like shopping: Reducing gender bias amplification using corpus-level constraints. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2979–2989.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 International Journal of Artificial Intelligence Research

This work is licensed under a Creative Commons Attribution 4.0 International License.
This article is published under the Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.



