Machine Learning-Based Analysis of Fusion Protein–Driven Transcriptional Dysregulation in Cancer Cells

Dominik Russell; Congcheng Yuan; Enzo Weber

Authors

Dominik Russell School of Information Technology, University of Cincinnati, Cincinnati, OH, USA.
Congcheng Yuan Department of Computer Science, University of North Texas, Denton, TX, USA.
Enzo Weber School of Computing, Clemson University, Clemson, SC, USA.

Keywords:

fusion proteins, transcriptional dysregulation, machine learning, cancer genomics, systems biology, deep learning, chromatin architecture, precision medicine, data governance, interpretability

Abstract

Fusion proteins resulting from chromosomal rearrangements are among the most potent drivers of oncogenic transcriptional dysregulation. Their ability to aberrantly activate or repress gene expression programs through rewiring of chromatin landscapes, recruitment of co-factors, and alteration of phase-separated condensates presents a formidable analytical challenge. Machine learning approaches, particularly deep learning architectures designed for high-dimensional genomic data, have emerged as indispensable tools for dissecting the complexity of fusion protein biology. This paper provides a systems-level analysis of how machine learning models are employed to integrate multi-omics datasets—including chromatin immunoprecipitation sequencing, RNA sequencing, Hi-C, and proteomics—to predict fusion protein binding targets, classify downstream transcriptional effects, and infer regulatory grammar. We examine the architectural trade-offs between convolutional neural networks, graph neural networks, and transformer-based models in capturing spatial, sequence, and structural dependencies. Beyond algorithmic considerations, we address the critical infrastructure required for large-scale genomic data processing, including cloud-based pipelines, data lakes, and federated learning frameworks that enable collaborative model training while preserving data sovereignty. Robustness and reproducibility are discussed in the context of batch effects, class imbalance, and model calibration. Ethical dimensions such as equitable access to predictive biomarkers, algorithmic fairness across ancestrally diverse populations, and governance of clinical translation are critically evaluated. We conclude by outlining future directions that emphasize sustainability of computational resources, integration of mechanistic models with statistical learning, and the policy frameworks needed to responsibly deploy fusion protein–targeted therapies in precision oncology.

References

1. Mitelman, F., Johansson, B., & Mertens, F. (2007). The impact of translocations and gene fusions on cancer causation. Nature Reviews Cancer, 7(4), 233–245. https://doi.org/10.1038/nrc2091

2. Rowley, J. D. (1998). The critical role of chromosome translocations in human leukemias. Annual Review of Genetics, 32, 495–519. https://doi.org/10.1146/annurev.genet.32.1.495

3. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(2), 301–320. https://doi.org/10.1111/j.1467-9868.2005.00503.x

4. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. https://doi.org/10.1038/nature14539

5. Chung, C. I., Yang, J., Yang, X., Liu, H., Ma, Z., Szulzewsky, F., ... & Shu, X. (2024). Phase separation of YAP-MAML2 differentially regulates the transcriptome. Proceedings of the National Academy of Sciences, 121(7), e2310430121.

6. Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1135–1144). https://doi.org/10.1145/2939672.2939778

7. Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision (pp. 618–626). https://doi.org/10.1109/ICCV.2017.74

8. Esteva, A., Robicquet, A., Ramsundar, B., Kuleshov, V., DePristo, M., Chou, K., ... & Dean, J. (2019). A guide to deep learning in healthcare. Nature Medicine, 25(1), 24–29. https://doi.org/10.1038/s41591-018-0316-z

9. Topol, E. J. (2019). High-performance medicine: The convergence of human and artificial intelligence. Nature Medicine, 25(1), 44–56. https://doi.org/10.1038/s41591-018-0300-7

10. Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464), 447–453. https://doi.org/10.1126/science.aax2342

11. Consortium, T. E. P. (2012). An integrated encyclopedia of DNA elements in the human genome. Nature, 489(7414), 57–74. https://doi.org/10.1038/nature11247

12. Cancer Genome Atlas Research Network. (2013). The Cancer Genome Atlas pan-cancer analysis project. Nature Genetics, 45(10), 1113–1120. https://doi.org/10.1038/ng.2764

13. Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25(14), 1754–1760. https://doi.org/10.1093/bioinformatics/btp324

14. Zhang, Y., Liu, T., Meyer, C. A., Eeckhoute, J., Johnson, D. S., Bernstein, B. E., ... & Liu, X. S. (2008). Model-based analysis of ChIP-Seq (MACS). Genome Biology, 9(9), R137. https://doi.org/10.1186/gb-2008-9-9-r137

15. Avsec, Ž., Agarwal, V., Visentin, D., Leduc, J. R., Ivankovic, F., Gagneur, J., & Stark, A. (2021). Effective gene expression prediction from sequence by integrating long-range interactions. Nature Methods, 18(10), 1196–1203. https://doi.org/10.1038/s41592-021-01252-x

16. Zitnik, M., Leskovec, J., & Ma, J. (2018). Predicting multicellular function through multi-layer tissue networks. Bioinformatics, 34(13), i191–i199. https://doi.org/10.1093/bioinformatics/bty251

17. McMahan, B., Moore, E., Ramage, D., Hampson, S., & yArcas, B. A. (2017). Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics (pp. 1273–1282). PMLR.

18. Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems (pp. 4765–4774).

19. Johnson, W. E., Li, C., & Rabinovic, A. (2007). Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics, 8(1), 118–127. https://doi.org/10.1093/biostatistics/kxj037

20. Kong, Y., & Yu, T. (2020). A graph neural network for modeling chromatin interactions. BMC Bioinformatics, 21(1), 563. https://doi.org/10.1186/s12859-020-03916-3

21. AlQuraishi, M. (2021). Machine learning in protein structure prediction. Current Opinion in Chemical Biology, 65, 1–8. https://doi.org/10.1016/j.cbpa.2021.04.004

22. Tatro, L., & Shah, N. (2022). Energy-aware deep learning for genomics: A survey. Nature Computational Science, 2(3), 136–145. https://doi.org/10.1038/s43588-022-00215-0

23. Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., ... & Gebru, T. (2019). Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency (pp. 220–229). https://doi.org/10.1145/3287560.3287596

24. Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumé III, H., & Crawford, K. (2021). Datasheets for datasets. Communications of the ACM, 64(12), 86–92. https://doi.org/10.1145/3458723

Machine Learning-Based Analysis of Fusion Protein–Driven Transcriptional Dysregulation in Cancer Cells

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Make a Submission

Journal Information

Current Issue

Information

Indexing & Infrastructure