A decoder only model trained on minipile dataset from huggingface trained using gpt2 like model architecture.The following configuration files are used for training. The training was done on 3 NVIDA A100 GPUs.

batch_size: 4

block_size: 1024

gradient_accumulation_steps: 21

max_iters: 200000

lr_decay_iters: 180000

warmup_iters: 20000 # 10% of max_iters

weight_decay: 0.1

dropout: 0.1

device: 'cuda'

n_layer: 16

n_head: 16

n_embd: 2048

The files pytorch_model.bin contains the final checkpoint for the last iteration. The file "checkpoint_iter60k" contains the intermediate checkpoint at 60k-th iteration.