Skip to content
This repository was archived by the owner on Apr 12, 2026. It is now read-only.

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

README.md

Getting Started with Sequence Classification

Welcome to the Sequence Classification example! This guide will help you get started with training a sequence classification model using the Transformers and Accelerate Python libraries.

Installation

To install the required packages, you can use pip:

pip install -U torch transformers accelerate tqdm python-dotenv

Usage

To use the Sequence Classification example, you can run the fine_tune_sequence_classification.py script:

python fine_tune_sequence_classification.py

This will train a sequence classification model on a dataset and evaluate its performance on the validation and test sets.

Configuration

The behavior of the Sequence Classification example can be configured using environment variables. Here are the available environment variables and their default values:

  • CHECKPOINT: The path or identifier of the pre-trained checkpoint to use. Default is distilbert-base-uncased.
  • NUM_EPOCHS: The number of epochs to train for. Default is 3.
  • LR: The learning rate to use for the optimizer. Default is 3e-5.
  • DATA_PATH: The path to the dataset. This is a required environment variable.
  • TOKENIZER: The path or identifier of the tokenizer to use. Default is distilbert-base-uncased.
  • TRAIN_RATIO: The ratio of examples to use for training. Default is 0.8.
  • EVAL_RATIO: The ratio of examples to use for evaluation. Default is 0.1.
  • VAL_RATIO: The ratio of examples to use for validation. Default is 0.05.
  • TEST_RATIO: The ratio of examples to use for testing. Default is 0.05.
  • SEED: The random seed to use for shuffling the dataset. Default is 42.
  • BATCH_SIZE: The batch size to use for training, evaluation, and validation. Default is 16.

GPT Description

This Python script defines a Trainer class that can be used to fine-tune a pre-trained sequence classification model using the Hugging Face Transformers library. The Trainer class provides methods for preparing the dataset, training the model, and evaluating the model's performance. The split_dataset function is also defined in the script, which can be used to split a dataset into training, evaluation, validation, and test subsets.

The script includes an example usage section that demonstrates how to use the Trainer class and split_dataset function with a custom dataset. The example usage section shows how to load a pre-trained model, prepare the dataset, fine-tune the model, and evaluate the model's performance. The example usage section also shows how to save the fine-tuned model to disk for later use.

Finally, the script includes a unit test class TestFineTuneSequenceClassificationModel that tests the split_dataset, prepare, train, and evaluate methods of the Trainer class. The unit test class provides a set of test cases that can be used to verify the correctness of the Trainer class implementation. The unit test class can be run using a testing framework such as unittest to ensure that the Trainer class is working as expected.

To improve the readability of the code, it may be helpful to add comments to explain the purpose of each method and variable. Additionally, it may be helpful to break up the Trainer class into smaller, more focused classes or functions to improve the modularity of the code. Finally, it may be helpful to add more error handling and input validation to the code to make it more robust and prevent unexpected errors.