This directory provides scripts and datasets for generating fine-tuning data with long reasoning chains and reflection signals using LLMs, and for converting that data into formats suitable for LLaMA-Factory SFT and Hugging Face GRPO training.
- visual_model_data/data_maker.py: Uses an LLM to generate visual Chain-of-Thought (CoT) data that includes long reasoning and reflection information.
- visual_model_data/o1_data_visual_cot_all.json: Raw data produced by
data_maker.py. - visual_model_data/sft_data_maker.py: Converts the raw data into an Alpaca-style format compatible with LLaMA-Factory supervised fine-tuning (SFT).
- visual_model_data/alpaca_format_o1_data_visual_cot.json: The SFT-ready dataset output by
sft_data_maker.py. - rl/convert_to_hf_vl.py: Further converts the SFT dataset into a Hugging Face vision-language dataset for GRPO training.
- visual_model_data/android_lab_o1_visual_hf/: The resulting Hugging Face-formatted dataset used for GRPO.
Before starting, please place the fine-tuning data 'android-lab-train' provided by AndroidLab benchmark in the ./ground_data directory. This data can be downloaded directly from the original AndroidLab project or through this project's Hugging Face repository.
- Generate raw data (long reasoning + reflection) with the LLM:
- Run
visual_model_data/data_maker.pyto producevisual_model_data/o1_data_visual_cot_all.json.
- Run
- Convert raw data to LLaMA-Factory SFT format:
- Run
visual_model_data/sft_data_maker.pyto producevisual_model_data/alpaca_format_o1_data_visual_cot.json.
- Run
- Convert SFT data to Hugging Face V+L format for GRPO:
- Run
rl/convert_to_hf_vl.pyto create/updatevisual_model_data/android_lab_o1_visual_hf/.
- Run
Note: The current
o1_data_visual_cot_all.jsondoes not include data related to the "pimusic" app. Data for this app is provided separately ino1_data_visual_cot_pimusic.json. We plan to merge these two data files as soon as possible.