📁 DeepPrune Datasets

This repository contains the datasets used in the DeepPrune experiments, organized into four categories: pre-experiment, fine-tuning, offline evaluation, and online evaluation.

🔍 Pre-experiment Datasets

The pre_exp_data/ directory contains datasets used in preliminary experiments. These datasets facilitate a comparative analysis between semantic similarity (computed using Sentence-BERT) and zero-shot judgments from large language models (LLMs).

🛠️ Fine-tuning Datasets

Located in the finetune_data/ directory, these datasets are formatted for use with Llama-Factory and include:

train.jsonl – Training data
test.jsonl – Evaluation data

Each line in these .jsonl files is a JSON object with the following fields:

{
  "instruction": "It's like a system prompt or task description",
  "input": "Two truncated answers to be checked whether their answers are identical",
  "output": "The expected model response: identical/not identical"
}

These datasets are used to fine-tune base models before applying the DeepPrune pruning strategy.

⚠️ Here .jsonl files have been truncated can be used to finetune models directly. If you want to try other strategies, please use train.json and test.json to generate your own datasets.

📊 Offline Evaluation Datasets

In the offline_test_data/ directory, we provide model-generated responses from the following models on a shared set of problems:

glm-4.5-air
Qwen3-4B-Thinking-2507
QwQ-32B

These outputs are used to evaluate the performance of models after fine-tuning.

🌐 Online Evaluation Datasets

The online_test_data/ directory contains datasets collected through active querying of large language models. Specifically:

For each problem, we gathered 512 model-generated answers from:
- DeepSeek-R1-0528-Qwen3-8B
- gpt-oss-20b
- Qwen3-32B

Each JSON file in this folder includes the following fields:

{
  "problem": "The original question or task",
  "answer": "The model's generated response",
  "true_answer": "The ground-truth or reference answer"
}

These datasets are used to empirically validate the effectiveness of DeepPrune in real-world, dynamic settings—measuring how pruning impacts output quality under diverse sampling conditions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

📁 DeepPrune Datasets

🔍 Pre-experiment Datasets

🛠️ Fine-tuning Datasets

⚠️ Here .jsonl files have been truncated can be used to finetune models directly. If you want to try other strategies, please use train.json and test.json to generate your own datasets.

📊 Offline Evaluation Datasets

🌐 Online Evaluation Datasets

Uh oh!

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

📁 DeepPrune Datasets

🔍 Pre-experiment Datasets

🛠️ Fine-tuning Datasets

⚠️ Here .jsonl files have been truncated can be used to finetune models directly. If you want to try other strategies, please use train.json and test.json to generate your own datasets.

📊 Offline Evaluation Datasets

🌐 Online Evaluation Datasets