Distilling Cross-Encoder Signals into Bi-Encoders for Domain Retrieval
DOI:
https://doi.org/10.64917/feaiml/Volume02Issue10-04Keywords:
Dense Retrieval, Knowledge Distillation, Cross-Encoder, Bi-Encoder, Information Retrieval, Teacher-Student Learning, Domain Adaptation, Retrieval-Augmented GenerationAbstract
Dense retrieval models have become crucial for information retrieval tasks, but their success is frequently reliant on large, computationally expensive architectures. While existing approaches like REFINE [1] achieve strong performance through model fusion requiring dual model serving and weighted interpolation at inference time, this work proposes a novel fusion-free teacher-student knowledge distillation framework that achieves competitive performance with significantly simpler deployment. The key innovation lies in transferring fine-grained cross-encoder relevance judgments directly into a single deployable bi-encoder through listwise knowledge distillation with temperature-scaled soft labels, eliminating the need for runtime model fusion while preserving retrieval quality. Unlike standard contrastive fine-tuning that relies solely on hard binary labels, this approach combines InfoNCE contrastive loss with KL divergence-based distillation from cross-encoder probability distributions, enabling the student to learn nuanced ranking signals beyond simple positive-negative distinctions. The distillation framework uses cross-encoder/ms-marco-MiniLM-L12-v2 as the teacher model and trains a student encoder, bge-large-en-v1.5, on synthetically generated query-document pairs from SQuAD and RAG datasets with hybrid hard negative mining (BM25 + dense retrieval) and dynamic negative refresh. The student model is trained with a combined objective function balancing contrastive learning and knowledge distillation, with performance evaluated using standard retrieval metrics including Recall@k, MAP, NDCG, and MRR. Results demonstrate that the distilled student model significantly outperforms the vanilla BGE baseline across both datasets, achieving 19.66% improvement in Recall@3 on SQuAD and 3.63% on RAG, while maintaining simpler deployment architecture compared to fusion-based methods. Notably, the approach outperforms REFINE by 3.3% on RAG despite eliminating runtime fusion overhead. These findings demonstrate that teacher-student distillation with listwise soft labels provides an effective and practical approach for enhancing dense retrieval models in data-scarce scenarios without requiring extensive architectural modifications, dual-model serving, or large-scale training data.
References
Ambuje Gupta, Mrinal Rawat, Andreas Stolcke, and Roberto Pieraccini. REFINE on Scarce Data: Retrieval Enhancement through Fine-Tuning via Model Fusion of Embedding Models. arXiv preprint arXiv:2410.12890, 2024.
Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, and Shaoping Ma. Optimizing dense retrieval model training with hard negatives.
Jeongsu Yu. Eflcient fine-tuning methodology of text embedding models for information retrieval: contrastive learning penalty (CLP). arXiv preprint arXiv:2412.17364, 2024.
Zhenghao Lin, Yeyun Gong, Xiao Liu, Hang Zhang, Chen Lin, Anlei Dong, Jian Jiao, Jingwen Lu, Daxin Jiang, Rangan Majumder, and Nan Duan. PROD: Progressive Distillation for Dense Retrieval.
Yuxiang Lu, Yiding Liu, Jiaxiang Liu, Yunsheng Shi, Zhengjie Huang, Shikun Feng, Yu Sun, Hao Tian, Hua Wu, Shuaiqiang Wang, Dawei Yin, and Haifeng Wang. ERNIE-Search: Bridging Cross- Encoder with Dual-Encoder via Self On-the-fly Distillation for Dense Passage Retrieval. arXiv preprint arXiv:2205.09153, 2022.
Christos Tsirigotis, Vaibhav Adlakha, Joao Monteiro, Aaron Courville, and Perouz Taslakian. BiXSE: Im- proving Dense Retrieval via Probabilistic Graded Relevance Distillation. arXiv preprint arXiv:2508.06781, 2025.
Vladimir Karpukhin, Barlas Og˘ uz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense Passage Retrieval for Open-Domain Question Answering. arXiv preprint arXiv:2004.04906, 2020.
Wayne Xin Zhao, Jing Liu, Ruiyang Ren, and Ji-Rong Wen. Dense Text Retrieval based on Pretrained Language Models: A Survey. arXiv preprint arXiv:2211.14876, 2022.
Nandan Thakur, Nils Reimers, Johannes Daxenberger, and Iryna Gurevych. Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks. arXiv preprint arXiv:2010.08240, 2021.
Zhuyun Dai, Vincent Y. Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin Guu, Keith B. Hall, and Ming-Wei Chang. Promptagator: Few-shot Dense Retrieval From 8 Examples. arXiv preprint arXiv:2209.11755, 2022.
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531, 2015.
Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, and Rodrigo Nogueira. InPars: Data Augmentation for Information Retrieval using Large Language Models. arXiv preprint arXiv:2202.05144, 2022
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Suraj Balaso Desai

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors retain the copyright of their articles published in this journal. All articles are licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). This license permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly cited.