Energy-Efficient Edge Intelligence through Adaptive Fast–Slow Inference Scheduling in LLM-Driven Systems

Hugo Jorgensen; Krishna J. Sood; Milos Hayes

Authors

Hugo Jorgensen Department of Computer Science, University of Alabama at Birmingham, Birmingham, AL, USA.
Krishna J. Sood Department of Computer Science, University of North Texas, Denton, TX, USA.
Milos Hayes Department of Computer Science and Engineering, University of Nevada, Reno, Reno, NV, USA.

Keywords:

edge intelligence, large language models, fast-slow inference, energy efficiency, adaptive scheduling, sustainable AI, dual-process theory, system architecture

Abstract

The deployment of large language models on edge devices presents a fundamental tension between computational intensity and energy constraints. While LLMs offer unprecedented capabilities in natural language understanding, reasoning, and generation, their execution on resource-limited edge hardware incurs prohibitive energy costs that undermine the sustainability of ubiquitous intelligence. This paper proposes an adaptive fast-slow inference scheduling framework that dynamically allocates computational resources by distinguishing between low-complexity queries requiring rapid, approximate responses and high-stakes tasks demanding deep, deliberative reasoning. Drawing inspiration from dual-process theories of cognition, the framework leverages a lightweight trigger model to classify incoming requests and routes them to either a fast inference path using compressed, quantized models or a slow inference path employing full-precision LLMs with chain-of-thought processing. We examine the architectural trade-offs inherent in such a system, including latency, accuracy, energy consumption, and memory footprint. The discussion extends to system-level considerations such as robustness to adversarial perturbations, fairness across diverse user populations, governance of autonomous decision-making, and policy implications for sustainable AI infrastructure. Through analytical reasoning and cross-domain comparisons with prior work in energy-aware computing, we demonstrate that adaptive scheduling can reduce overall energy consumption by orders of magnitude while maintaining acceptable accuracy for the majority of queries. The framework also introduces governance mechanisms for handling ambiguous cases, ensuring that critical decisions are not sacrificed for efficiency. This work contributes a systems-oriented perspective on reconciling the growing demand for intelligent edge services with the imperative of environmental sustainability.

References

1. Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 3645–3650.

2. Han, S., Mao, H., & Dally, W. J. (2016). Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. International Conference on Learning Representations.

3. Kahneman, D. (2011). Thinking, fast and slow. Farrar, Straus and Giroux.

4. Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P.-S., ... & Gabriel, I. (2021). Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359.

5. Raji, I. D., Smart, A., White, R. N., Mitchell, M., Gebru, T., Hutchinson, B., ... & Barnes, P. (2020). Closing the AI accountability gap: Defining an end-to-end framework for internal algorithmic auditing. Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 33–44.

6. Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., ... & Kalenichenko, D. (2018). Quantization and training of neural networks for efficient integer-arithmetic-only inference. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2704–2713.

7. Dettmers, T., Lewis, M., Belkada, Y., & Zettlemoyer, L. (2022). LLM.int8(): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339.

8. Teerapittayanon, S., McDanel, B., & Kung, H. T. (2016). BranchyNet: A network with early exits for distributed inference. International Conference on Learning Representations.

9. Bengio, Y., Léonard, N., & Courville, A. (2013). Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432.

10. Botvinick, M., Wang, J. X., Dabney, W., Miller, K. J., & Kurth-Nelson, Z. (2020). Deep reinforcement learning and its neuroscientific implications. Neuron, 107(4), 603–616.

11. Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., ... & Zhou, D. (2023). Self-consistency improves chain of thought reasoning in language models. International Conference on Learning Representations.

12. Mao, Y., You, C., Zhang, J., Huang, K., & Letaief, K. B. (2017). A survey on mobile edge computing: The communication perspective. IEEE Communications Surveys & Tutorials, 19(4), 2322–2358.

13. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.

14. Horowitz, M. (2014). Computing's energy problem (and what we can do about it). 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers, 10–14.

15. Dou, Z., Cui, D., Yan, J., Wang, W., Chen, B., Wang, H., ... & Zhang, S. (2025). Dsadf: Thinking fast and slow for decision making. arXiv preprint arXiv:2505.08189.

16. Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., ... & Dean, J. (2021). Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350.

17. Yi, S., Hao, Z., Qin, Z., & Li, Q. (2015). Fog computing: Platform and applications. 2015 IEEE Workshop on Hot Topics in Web Systems and Technologies, 73–78.

18. Dwork, C., Hardt, M., Pitassi, T., Reingold, O., & Zemel, R. (2012). Fairness through awareness. Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, 214–226.

19. Stoica, I., Shenker, S., & Zhang, H. (1998). Core-stateless fair queueing: A scalable architecture to approximate fair bandwidth allocations in high-speed networks. IEEE/ACM Transactions on Networking, 6(6), 661–674.

20. European Commission. (2021). Proposal for a regulation laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). COM/2021/206 final.

21. Crawford, K., & Joler, V. (2018). Anatomy of an AI system: The Amazon Echo as an anatomical map of human labor, data and planetary resources. AI Now Institute and Share Lab.

Energy-Efficient Edge Intelligence through Adaptive Fast–Slow Inference Scheduling in LLM-Driven Systems

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Make a Submission

Journal Information

Current Issue

Information

Indexing & Infrastructure