BERT trained with YFCC15M with the same capacity with CLIP text encoder