Improving Medical Multimodal Retrieval with Graph-Rag and Fusion Methods with Mmedrag++

Dr. A.Shilpa Gupta; M. Uday Reddy; K. Jayanth Kumar Reddy; P. Medha Goud; T. Harshitha Reddy; Aditi Jopat

Journal of Computer Sciences and Applications. 2026, 14(1), 21-30
DOI: 10.12691/jcsa-14-1-4

Open AccessArticle

Improving Medical Multimodal Retrieval with Graph-Rag and Fusion Methods with Mmedrag++

Dr. A.Shilpa Gupta^1,, M. Uday Reddy¹, K. Jayanth Kumar Reddy¹, P. Medha Goud¹, T. Harshitha Reddy¹ and Aditi Jopat¹

¹Department of Computer Science Engineering, Keshav Memorial College of Engineering, Ibrahimpatnam, Telangana, India

Pub. Date: May 20, 2026

View Full Text Full Text PDF (414 KB) Full Text ePUB(319 KB)

Cite this paper:
Dr. A.Shilpa Gupta, M. Uday Reddy, K. Jayanth Kumar Reddy, P. Medha Goud, T. Harshitha Reddy and Aditi Jopat. Improving Medical Multimodal Retrieval with Graph-Rag and Fusion Methods with Mmedrag++. Journal of Computer Sciences and Applications. 2026; 14(1):21-30. doi: 10.12691/jcsa-14-1-4

Abstract

We introduce MMedRAG++,a medical multimodal retrieval system that uses fusion-based representation learning and graph-based reranking (Graph-RAG) to improve on conventional Retrieval-Augmented Generation (RAG). Unlike baseline systems, which do not incorporate fusion strategies or graph-based reranking, MMedRAG++ improves cross-modal embeddings and retrieval coherence. Experiments are conducted primarily on PMC-OA, a challenging dataset with only ~10% unique captions and many unrelated image-text pairs, and IU-Xray is used for modality- specific subtasks. Graph-RAG demonstrates improved retrieval centrality and diversity, and fusion strategies, including Cross- Attention and DeepSet Fusion, enhance embedding quality. Quantitative evaluation on PMC-OA and IU-Xray confirms improved retrieval coherence and cross-modal alignment over baseline configurations. Top-1 Accuracy: 18.7%, Top-10 Accuracy: 57.4%.

Keywords:
Medical AI RAG Graph-RAG Multimodal Fusion Contrastive Learning Medical Image-text Retrieval

This work is licensed under a Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

References:

[1]	K. Guu, T. Lee, Z. Tung, P. Pasupat, and M. Chang, "REALM: Retrieval-Augmented Language Model Pre-Training," arXiv preprint arXiv: 2002. 08909, 2020.

[2]	P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks," Advances in Neural Information Processing Systems (NeurIPS), 2020.

[3]	A. Zhang, T. R. R. He, R. L. Jiao, et al., "BioViL: Self-Supervised Vision-Language Pretraining for Biomedical Image–Text Retrieval," IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.

[4]	Z. Wang, Y. Tang, and X. Wang, "MedCLIP: Contrastive Learning from Unpaired Medical Images and Text," arXiv preprint arXiv: 2210. 10163, 2022.

[5]	M. Liu, Y. Yin, W. Chen, et al., "PMC-CLIP: Contrastive Learning from 1.1M PMC Image–Text Pairs for Biomedical Vision-Language Pre-training," arXiv preprint arXiv: 2303. 07240, 2023.

[6]	T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient Estimation of Word Representations in Vector Space," arXiv preprint arXiv: 1301. 3781, 2013.

[7]	A. Radford, J. Kim, C. Hallacy, et al., "Learning Transferable Visual Models from Natural Language Supervision," Proceedings of the 38th International Conference on Machine Learning (ICML), 2021.

[8]	E. Alsentzer, J. R. Murphy, W. Boag, et al., "Publicly Available Clinical BERT Embeddings," Proceedings of the 2nd Clinical Natural Language Processing Workshop (ClinicalNLP), 2019.

[9]	J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," Proceedings of NAACL-HLT, 2019.

[10]	R. Mihalcea and P. Tarau, "TextRank: Bringing Order into Texts," Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2004.

[11]	L. Page, S. Brin, R. Motwani, and T. Winograd, "The PageRank Citation Ranking: Bringing Order to the Web," Technical Report, Stanford InfoLab, 1999.

[12]	D. Hendrycks and K. Gimpel, "Gaussian Error Linear Units (GELUs)," arXiv preprint arXiv: 1606. 08415, 2016.

[13]	F. Schroff, D. Kalenichenko, and J. Philbin, "FaceNet: A Unified Embedding for Face Recognition and Clustering," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.

[14]	A. Vaswani, N. Shazeer, N. Parmar, et al., "Attention Is All You Need," Advances in Neural Information Processing Systems (NeurIPS), 2017.

[15]	C. Ouyang, Y. Xue, and D. Rueckert, "Self-Supervised Learning for Medical Image Analysis: A Survey," IEEE Transactions on Medical Imaging, vol. 42, no. 3, pp. 665–684, 2023.