Cross-View Semantic World Modeling for Embodied Robot Navigation Using 360-Degree Generative Scene Priors

Milos C. Lindgren; Lars Greene

Authors

Milos C. Lindgren Department of Computer Science, George Mason University, Fairfax, VA, USA.
Lars Greene School of Computing, Clemson University, Clemson, SC, USA.

Keywords:

embodied navigation, world modeling, 360-degree scene generation, generative priors, semantic mapping, cross-view learning, robotic infrastructure, policy governance

Abstract

Embodied robot navigation in unstructured, partially observable environments remains a fundamental challenge in autonomous systems. Traditional approaches rely on explicit geometric mapping and localization, which often fail under perceptual aliasing, dynamic occlusions, or incomplete sensor coverage. This paper introduces a cross-view semantic world modeling framework that leverages 360-degree generative scene priors to synthesize consistent, semantically annotated representations of the environment from sparse egocentric observations. By integrating large-scale generative models that produce panoramic scene completions from limited viewpoints, the proposed system enables a robot to reason about occluded regions, plan navigation paths with higher robustness, and align heterogeneous sensory modalities across spatial scales. The architecture comprises three core components: a cross-view encoder for extracting latent representations from egocentric video streams, a 360-degree generative prior module that produces coherent multimodal scene layouts, and a semantic grounding layer that maps synthetic content onto a structured world model. We discuss structural trade-offs between generative fidelity and computational efficiency, governance considerations for deploying generative priors in safety-critical robotics, and sustainability implications of training large scene priors on distributed infrastructure. Through comparative analysis with conventional mapping pipelines and emerging neural radiance field methods, we highlight the advantages of embedding generative scene priors into a closed-loop planning and control loop. Policy implications concerning real-world deployment, fairness of generative representations across diverse environments, and the robustness of learned priors under distribution shift are examined. This work contributes a system-level perspective on how generative artificial intelligence can reshape embodied navigation by bridging the gap between perception and semantic understanding, and outlines future directions for scalable, accountable world modeling.

References

1. Thrun, S., Burgard, W., & Fox, D. (2005). Probabilistic robotics. MIT Press.

2. Rosinol, A., Gupta, A., Abbar, M., Carlone, L., & Torralba, A. (2020). Kimera: an open-source library for real-time metric-semantic localization and mapping. In IEEE International Conference on Robotics and Automation (pp. 1689-1696).

3. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10684-10695).

4. Milioto, A., & Stachniss, C. (2019). RangeNet++: Fast and accurate LiDAR semantic segmentation. In IEEE/RSJ International Conference on Intelligent Robots and Systems (pp. 4213-4220).

5. Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., & Ng, R. (2021). NeRF: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1), 99-106.

6. Sitzmann, V., Martel, J. N. P., Bergman, A. W., Lindell, D. B., & Wetzstein, G. (2020). Implicit neural representations with periodic activation functions. In Advances in Neural Information Processing Systems (Vol. 33, pp. 7462-7473).

7. Chen, Z., Li, Z., Xu, Y., & Jacobs, N. (2024). Text2Scene: Generating compositional scenes from text. In European Conference on Computer Vision (pp. 234-251).

8. Philion, J., & Fidler, S. (2020). Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In European Conference on Computer Vision (pp. 194-210).

9. Xiong, Z., Chen, Z., Li, Z., Xu, Y., & Jacobs, N. (2025). PanoDreamer: Consistent text to 360-degree scene generation. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 295-304).

10. Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.

11. Buolamwini, J., & Gebru, T. (2018). Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency (pp. 77-91).

12. Hafner, D., Pasukonis, J., Ba, J., & Lillicrap, T. (2023). Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104.

13. National Highway Traffic Safety Administration. (2022). Automated driving systems: A vision for safety. U.S. Department of Transportation.

14. Patterson, D., Gonzalez, J., Le, Q. V., Liang, C., Munguia, L.-M., Rothchild, D., So, D., Texier, M., & Dean, J. (2021). Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350.

15. Frankle, J., & Carbin, M. (2019). The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations.

16. Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., & Wermter, S. (2019). Continual lifelong learning with neural networks: A review. Neural Networks, 113, 54-71.

17. European Commission. (2021). Proposal for a regulation laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). COM(2021) 206 final.

18. Saxena, A., Chung, S. H., & Ng, A. Y. (2008). 3-D depth reconstruction from a single still image. International Journal of Computer Vision, 76(1), 55-71.

19. Zitnick, C. L., & Dollár, P. (2014). Edge boxes: Locating object proposals from edges. In European Conference on Computer Vision (pp. 391-405).

20. Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2018). DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834-848.

Cross-View Semantic World Modeling for Embodied Robot Navigation Using 360-Degree Generative Scene Priors

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Make a Submission

Journal Information

Current Issue

Information

Indexing & Infrastructure