RELICLogo : Enhancing Reward Model Generalization for Low-Resource Indic Languages with Few-Shot Examples

1University of Maryland    2Indian Institute of Technology Bombay   3Indian Institute of Technology Patna  4Adobe Research   5Indian Institute of Technology Kanpur 
denotes equal contribution

Abstract

Reward models are essential for aligning large language models (LLMs) with human preferences. However, most open-source multilingual reward models are primarily trained on preference datasets in high-resource languages, resulting in unreliable reward signals for low-resource Indic languages. Collecting large-scale, high-quality preference data for these languages is prohibitively expensive, making preference-based training approaches impractical. To address this challenge, we propose RELIC, a novel in-context learning framework for reward modeling in low-resource Indic languages. RELIC trains a retriever with a pairwise ranking objective to select in-context examples from auxiliary high-resource languages that most effectively highlight the distinction between preferred and less-preferred responses. Extensive experiments on three preference datasets—PKU-SafeRLHF, WebGPT, and HH-RLHF—using state-of-the-art open-source reward models demonstrate that RELIC significantly improves reward model accuracy for low-resource Indic languages, consistently outperforming existing example selection methods. For example, on Bodo—a low-resource Indic language—using a LLaMA-3.2-3B reward model, RELIC achieves a 12.81% and 10.13% improvement in accuracy over zero-shot prompting and state-of-the-art example selection method, respectively.

Algorithm

RELIC learns language-specific retrievers via a pairwise ranking loss, then uses those retrievers to select auxiliary few-shot examples in low-resource target languages.

Algorithm 1: Training the Retriever

            Input:
              Dₜ = low-resource target-language example bank  
              {Dᴴ₁, …, Dᴴₚ} = high-resource language example banks  
            Output:
              Φ = ∅  (set of trained retrievers)

            1. Φ ← ∅  
            2. For each Dᴴₚ ∈ {Dᴴ₁, …, Dᴴₚ} do  
              a. Initialize φₚ, ψₚ with pre-trained M-BERT encoder  
              b. (φₚ*, ψₚ*) ← arg min₍φₚ,ψₚ₎ Lₚₐᵢᵣ(Dₜ, Dᴴₚ)  
              c. R̂ₚ ← {φₚ*, ψₚ*}  
              d. Φ ← Φ ∪ R̂ₚ  
            3. Return Φ
          

Algorithm 2: High-Resource Example Bank Selection

            Input:
              Dₜ = low-resource target-language examples  
              {Dᴴ₁, …, Dᴴₚ} = high-resource language banks  
              ρ = pre-trained M-BERT encoder  
              γ = similarity threshold  
            Output:
              Dᵃᵘˣ ⊆ {Dᴴ₁, …, Dᴴₚ}

            1. Compute  
                eₜ = (1/|Dₜ|) ∑₍xᵢ,yᵢ₎∈Dₜ ρ(xᵢ, yᵢ)  
            2. sim_list ← []  
            3. For each Dᴴₚ ∈ {Dᴴ₁, …, Dᴴₚ} do  
              a. eₚ = (1/|Dᴴₚ|) ∑₍xⱼ,yⱼ₎∈Dᴴₚ ρ(xⱼ, yⱼ)  
              b. simₚ = cosine_sim(eₚ, eₜ)  
              c. sim_list ← sim_list ∪ {simₚ}  
            4. Dᵃᵘˣ = {Dᴴₚ | simₚ ≥ percentile(sim_list, γ)}  
            5. Return Dᵃᵘˣ
          

Experimental Results

Table below reports accuracy (%) on the PKU-SafeRLHF preference dataset: the fraction of test pairs where the reward model assigns a higher score to the safe response. We evaluate five retrieval-or-prompt-only baselines—Zero-shot, Random, BM25, Top-K and EPR—and compare them to our method, RELIC.

“Finetuning-Based” ✓ denotes methods that update model weights on preference data; ✗ denotes purely in-context or retrieval-only approaches. Results are shown for four low-resource Indic languages (Bodo, Santali, Manipuri, Odia), using two LLaMA-3 variants per language: 3.1-8B (8 B parameters) and 3.2-3B (3 B parameters). Boldface highlights the best accuracy per column, and the Δ row gives the absolute improvement of RELIC over the strongest baseline.

Finetuning-Based Methods Bodo Santali Manipuri Odia
LM-3.1-8B LM-3.2-3B LM-3.1-8B LM-3.2-3B LM-3.1-8B LM-3.2-3B LM-3.1-8B LM-3.2-3B
Zero-shot 54.3253.18 43.7651.90 47.8853.65 62.5352.27
Random 55.4152.67 45.8850.66 45.3954.03 56.4553.77
BM25 48.2256.09 43.9456.82 48.3355.47 66.9463.38
Top-K 57.5358.91 45.8455.48 50.9151.32 68.3460.72
EPR 58.7460.57 46.6659.11 54.1855.64 71.6762.01
RELIC (Ours) 64.2962.73 67.9265.48 62.8459.91 77.6769.93
Absolute Gain (∆) 5.552.16 21.266.37 8.664.27 6.006.55

Top row: Histogram of zero-shot reward scores for safe and unsafe responses in Santali shows substantial overlap, while incorporating in-context learning (ICL) examples from RELIC leads to more distinct and separable distributions. Bottom row: UMAP visualization of the final hidden-layer representations of the reward model for safe and unsafe responses in Santali, both with and without ICL examples selected by RELIC.

Figure 2: Histogram and UMAP visualizations for Santali

BibTeX

@misc{ghosal2025relicenhancingrewardmodel,
  title={Relic: Enhancing Reward Model Generalization for Low-Resource Indic Languages with Few-Shot Examples}, 
  author={Soumya Suvra Ghosal and Vaibhav Singh and Akash Ghosh and Soumyabrata Pal and Subhadip Baidya and Sriparna Saha and Dinesh Manocha},
  year={2025},
  eprint={2506.16502},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2506.16502}, 
}