A decoder only model trained on minipile dataset from huggingface trained using gpt2 like model architecture.The following configuration files are used for training. The training was done on 3 NVIDA A100 GPUs.
batch_size: 4
block_size: 1024
gradient_accumulation_steps: 21
max_iters: 200000
lr_decay_iters: 180000
warmup_iters: 20000 # 10% of max_iters
weight_decay: 0.1
dropout: 0.1
device: 'cuda'
n_layer: 16
n_head: 16
n_embd: 2048
The files pytorch_model.bin contains the final checkpoint for the last iteration. The file "checkpoint_iter60k" contains the intermediate checkpoint at 60k-th iteration.