Journal of Computer Sciences and Applications
ISSN (Print): 2328-7268 ISSN (Online): 2328-725X Website: https://www.sciepub.com/journal/jcsa Editor-in-chief: Minhua Ma, Patricia Goncalves
Open Access
Journal Browser
Go
Journal of Computer Sciences and Applications. 2026, 14(1), 21-30
DOI: 10.12691/jcsa-14-1-4
Open AccessArticle

Improving Medical Multimodal Retrieval with Graph-Rag and Fusion Methods with Mmedrag++

Dr. A.Shilpa Gupta1, , M. Uday Reddy1, K. Jayanth Kumar Reddy1, P. Medha Goud1, T. Harshitha Reddy1 and Aditi Jopat1

1Department of Computer Science Engineering, Keshav Memorial College of Engineering, Ibrahimpatnam, Telangana, India

Pub. Date: May 20, 2026

Cite this paper:
Dr. A.Shilpa Gupta, M. Uday Reddy, K. Jayanth Kumar Reddy, P. Medha Goud, T. Harshitha Reddy and Aditi Jopat. Improving Medical Multimodal Retrieval with Graph-Rag and Fusion Methods with Mmedrag++. Journal of Computer Sciences and Applications. 2026; 14(1):21-30. doi: 10.12691/jcsa-14-1-4

Abstract

We introduce MMedRAG++,a medical multimodal retrieval system that uses fusion-based representation learning and graph-based reranking (Graph-RAG) to improve on conventional Retrieval-Augmented Generation (RAG). Unlike baseline systems, which do not incorporate fusion strategies or graph-based reranking, MMedRAG++ improves cross-modal embeddings and retrieval coherence. Experiments are conducted primarily on PMC-OA, a challenging dataset with only ~10% unique captions and many unrelated image-text pairs, and IU-Xray is used for modality- specific subtasks. Graph-RAG demonstrates improved retrieval centrality and diversity, and fusion strategies, including Cross- Attention and DeepSet Fusion, enhance embedding quality. Quantitative evaluation on PMC-OA and IU-Xray confirms improved retrieval coherence and cross-modal alignment over baseline configurations. Top-1 Accuracy: 18.7%, Top-10 Accuracy: 57.4%.

Keywords:
Medical AI RAG Graph-RAG Multimodal Fusion Contrastive Learning Medical Image-text Retrieval

Creative CommonsThis work is licensed under a Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

References:

[1]  K. Guu, T. Lee, Z. Tung, P. Pasupat, and M. Chang, "REALM: Retrieval-Augmented Language Model Pre-Training," arXiv preprint arXiv: 2002. 08909, 2020.
 
[2]  P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks," Advances in Neural Information Processing Systems (NeurIPS), 2020.
 
[3]  A. Zhang, T. R. R. He, R. L. Jiao, et al., "BioViL: Self-Supervised Vision-Language Pretraining for Biomedical Image–Text Retrieval," IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
 
[4]  Z. Wang, Y. Tang, and X. Wang, "MedCLIP: Contrastive Learning from Unpaired Medical Images and Text," arXiv preprint arXiv: 2210. 10163, 2022.
 
[5]  M. Liu, Y. Yin, W. Chen, et al., "PMC-CLIP: Contrastive Learning from 1.1M PMC Image–Text Pairs for Biomedical Vision-Language Pre-training," arXiv preprint arXiv: 2303. 07240, 2023.
 
[6]  T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient Estimation of Word Representations in Vector Space," arXiv preprint arXiv: 1301. 3781, 2013.
 
[7]  A. Radford, J. Kim, C. Hallacy, et al., "Learning Transferable Visual Models from Natural Language Supervision," Proceedings of the 38th International Conference on Machine Learning (ICML), 2021.
 
[8]  E. Alsentzer, J. R. Murphy, W. Boag, et al., "Publicly Available Clinical BERT Embeddings," Proceedings of the 2nd Clinical Natural Language Processing Workshop (ClinicalNLP), 2019.
 
[9]  J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," Proceedings of NAACL-HLT, 2019.
 
[10]  R. Mihalcea and P. Tarau, "TextRank: Bringing Order into Texts," Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2004.
 
[11]  L. Page, S. Brin, R. Motwani, and T. Winograd, "The PageRank Citation Ranking: Bringing Order to the Web," Technical Report, Stanford InfoLab, 1999.
 
[12]  D. Hendrycks and K. Gimpel, "Gaussian Error Linear Units (GELUs)," arXiv preprint arXiv: 1606. 08415, 2016.
 
[13]  F. Schroff, D. Kalenichenko, and J. Philbin, "FaceNet: A Unified Embedding for Face Recognition and Clustering," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
 
[14]  A. Vaswani, N. Shazeer, N. Parmar, et al., "Attention Is All You Need," Advances in Neural Information Processing Systems (NeurIPS), 2017.
 
[15]  C. Ouyang, Y. Xue, and D. Rueckert, "Self-Supervised Learning for Medical Image Analysis: A Survey," IEEE Transactions on Medical Imaging, vol. 42, no. 3, pp. 665–684, 2023.