Distilling Cross-Encoder Signals into Bi-Encoders for Domain Retrieval
Abstract
Dense retrieval models have become crucial for information retrieval tasks, but their success is frequently reliant on large, computationally expensive architectures. While existing approaches like REFINE [1] achieve strong performance through model fusion requiring dual model serving and weighted interpolation at inference time, this work proposes a novel fusion-free teacher-student knowledge distillation framework that achieves competitive performance with significantly simpler deployment. The key innovation lies in transferring fine-grained cross-encoder relevance judgments directly into a single deployable bi-encoder through listwise knowledge distillation with temperature-scaled soft labels, eliminating the need for runtime model fusion while preserving retrieval quality. Unlike standard contrastive fine-tuning that relies solely on hard binary labels, this approach combines InfoNCE contrastive loss with KL divergence-based distillation from cross-encoder probability distributions, enabling the student to learn nuanced ranking signals beyond simple positive-negative distinctions. The distillation framework uses cross-encoder/ms-marco-MiniLM-L12-v2 as the teacher model and trains a student encoder, bge-large-en-v1.5, on synthetically generated query-document pairs from SQuAD and RAG datasets with hybrid hard negative mining (BM25 + dense retrieval) and dynamic negative refresh. The student model is trained with a combined objective function balancing contrastive learning and knowledge distillation, with performance evaluated using standard retrieval metrics including Recall@k, MAP, NDCG, and MRR. Results demonstrate that the distilled student model significantly outperforms the vanilla BGE baseline across both datasets, achieving 19.66% improvement in Recall@3 on SQuAD and 3.63% on RAG, while maintaining simpler deployment architecture compared to fusion-based methods. Notably, the approach outperforms REFINE by 3.3% on RAG despite eliminating runtime fusion overhead. These findings demonstrate that teacher-student distillation with listwise soft labels provides an effective and practical approach for enhancing dense retrieval models in data-scarce scenarios without requiring extensive architectural modifications, dual-model serving, or large-scale training data.