distilbert-base-uncased trained for 680K steps (lowest loss on dev dataset) with batch size 64 on C4, MSMARCO, Wikipedia, S2ORC, News