Question Answering with DistilBERT README
This repository contains code to train a Question Answering model using the DistilBERT architecture on the SQuAD (Stanford Question Answering Dataset) dataset. The model is trained to answer questions based on a given context paragraph. The training process utilizes PyTorch, the Hugging Face transformers library, and the datasets library.
Prerequisites
Before running the code, make sure you have the following installed:
NVIDIA GPU (for faster training, optional but recommended) NVIDIA CUDA Toolkit (if using GPU) Python 3.x Jupyter Notebook or another Python environment
Installation
You can set up your environment by running the following commands:
!nvidia-smi # Check GPU availability
!pip install -q transformers datasets torch tqdm
Usage
-
Loading and Preprocessing Data: The code loads the SQuAD dataset and selects a subset for training. You can adjust the subset_size variable to control the size of the subset.
-
Tokenization and Dataset Creation: The QADataset class is defined to preprocess and tokenize the data for training. It converts question and context pairs into tokenized format suitable for DistilBERT input. It also prepares the start and end positions for the answers in the context.
-
Model Configuration: The model is based on the DistilBERT architecture, specifically the "distilbert-base-cased" version.
-
Training Loop: The code sets up a training loop for a specified number of epochs. It trains the model to predict the start and end positions of the answer span in the context paragraph.
-
Saving the Model: The final trained model is saved to a specified directory in Google Drive. You can adjust the final_model_output_dir variable to change the save location.
Training
To train the model, follow these steps:
- Run the provided code cells in a Jupyter Notebook or Python environment.
- The code will load the dataset, tokenize it, and set up the training loop.
- The model's training progress will be displayed using a progress bar.
- After training completes, the final trained model will be saved to the specified directory in Google Drive.
Notes
- This code assumes you are using Google Colab to access the Google Drive API for saving the model. If you're using a different environment, you might need to adjust the saving mechanism.
- Make sure you have sufficient space in your Google Drive to save the model.
- You can modify hyperparameters such as batch size, learning rate, and the number of epochs to experiment with different training settings.
Credits
- The code in this repository is based on the Hugging Face Transformers library and the SQuAD dataset.
- DistilBERT
- SQuAD dataset
License
This code is provided under the MIT License. Feel free to modify and use it as needed.