This repository contains the datasets used in the DeepPrune experiments, organized into four categories: pre-experiment, fine-tuning, offline evaluation, and online evaluation.
The pre_exp_data/ directory contains datasets used in preliminary experiments. These datasets facilitate a comparative analysis between semantic similarity (computed using Sentence-BERT) and zero-shot judgments from large language models (LLMs).
Located in the finetune_data/ directory, these datasets are formatted for use with Llama-Factory and include:
train.jsonl– Training datatest.jsonl– Evaluation data
Each line in these .jsonl files is a JSON object with the following fields:
{
"instruction": "It's like a system prompt or task description",
"input": "Two truncated answers to be checked whether their answers are identical",
"output": "The expected model response: identical/not identical"
}These datasets are used to fine-tune base models before applying the DeepPrune pruning strategy.
⚠️ Here .jsonl files have been truncated can be used to finetune models directly. If you want to try other strategies, please use train.json and test.json to generate your own datasets.
In the offline_test_data/ directory, we provide model-generated responses from the following models on a shared set of problems:
glm-4.5-airQwen3-4B-Thinking-2507QwQ-32B
These outputs are used to evaluate the performance of models after fine-tuning.
The online_test_data/ directory contains datasets collected through active querying of large language models. Specifically:
- For each problem, we gathered 512 model-generated answers from:
DeepSeek-R1-0528-Qwen3-8Bgpt-oss-20bQwen3-32B
Each JSON file in this folder includes the following fields:
{
"problem": "The original question or task",
"answer": "The model's generated response",
"true_answer": "The ground-truth or reference answer"
}These datasets are used to empirically validate the effectiveness of DeepPrune in real-world, dynamic settings—measuring how pruning impacts output quality under diverse sampling conditions.