🧠 In-Context Meta-Reinforcement Learning | 🪞 Self-Reflection | 🔁 Learning How to Search at Test Time
Teng Xiao*, Yige Yuan*, Hamish Ivison, Faeze Brahman, Huaisheng Zhu, Nathan Lambert,
Pradeep Dasigi, Noah A. Smith, Hannaneh Hajishirzi
Existing RL-based agentic search methods optimize within a single episode, treating each attempt in isolation. We introduce MR-Search, a Meta-Reinforcement Learning framework that trains agents to improve across episodes via explicit self-reflection. As shown below, MR-Search organizes training into meta-episodes of multiple inner-episodes. After each failed attempt, the agent generates Self-Reflection that is prepended to the next episode, enabling progressive strategy refinement. We train this policy with a multi-turn RL algorithm featuring fine-grained credit assignment across the full multi-episode trajectory, achieving 9.2%–19.3% improvements over strong baselines on agentic search benchmarks.
# Requires: torch==2.6.0 + sglang==0.4.6.post3 + sgl-kernel==0.1.1 + flash-attn==v2.7.4.post1 + verl==0.5.x
conda create -n verl-sglang python==3.10
conda activate verl-sglang
USE_MEGATRON=0 bash install_vllm_sglang_mcore.sh
python -m pip install --no-deps -e .
# Dependencies for retrieval
python -m pip install pyserini==1.2.0 uvicorn==0.35.0 fastapi==0.116.1
conda install -c pytorch -c nvidia faiss-gpu=1.8.0(1) Download and process the training/eval dataset.
bash scripts/data_process/data_process.sh(2) Download and process the retrieval corpus.
save_path=/path/to/retrieval/data
python scripts/download.py --save_path $save_path
cat $save_path/part_* > $save_path/e5_Flat.index
gzip -d $save_path/wiki-18.jsonl.gz(1) Launch a retrieval server.
# Set $WORK_DIR in retrieval_launch.sh before running
# IP and PORT can be configured in `search_r1/search/retrieval_server.py` (line 392)
# Default endpoint: http://127.0.0.1:8000
conda activate verl-sglang
bash retrieval_launch.sh(2) Run RL training.
# SEARCH_IP must match the IP/PORT configured in `/search/retrieval_server.py` (line 392)
# Default: SEARCH_IP="http://127.0.0.1:8000/retrieve"
conda activate verl-sglang
bash train_grpo_step.shAll checkpoints and evaluation results are saved to the $WORK_DIR/save directory.
We thank the following open-source projects:
If you find this work useful, please cite our paper:
@article{MetaAgent2026,
title = {Meta-Reinforcement Learning with Self-Reflection for Agentic Search},
author = {Xiao, Teng and Yuan, Yige and Ivison, Hamish and Brahman, Faeze and Zhu, Huaisheng and Lambert, Nathan and Dasigi, Pradeep and Smith, Noah A. and Hajishirzi, Hannaneh},
journal = {arXiv preprint arXiv:2603.11327},
year = {2026}
}
