#12 epochs, each batch size 4, gradient accumulation steps 1, tail 4096. #THIS SEEMS TO BE THE OPTIMAL SETUP.