Update README.md
Exploring the Impact of Corpus Diversity on Financial Pretrained Language Models
(EMNLP 2023 findings)
Paper: https://arxiv.org/abs/2310.13312
Github: https://github.com/deep-over/FiLM
FiLM(Financial Language Model) Models 🌟
FiLM is a Pre-trained Language Model (PLM) optimized for the Financial domain, built upon a diverse range of Financial domain corpora. Initialized with the RoBERTa-base model, FiLM undergoes further training to achieve performance that surpasses RoBERTa-base in financial domain for the first time.
To train FiLM, we have categorized our Financial Corpus into specific groups and gathered a diverse range of corpora to ensure optimal performance.
We offer two versions of the FiLM model, each tailored for specific use-cases in the Financial domain:
This is our foundational model, trained on the entire range of corpora as outlined in the above Corpus table. Ideal for a wide array of financial applications. 📊
FiLM (5.5B): Optimized for SEC Filings
This model is specialized for handling SEC filings. We expanded the training set by adding 3.1 billion tokens from the SEC filings corpus dataset. The dataset is sourced from EDGAR-CORPUS: Billions of Tokens Make The World Go Round (Loukas et al., ECONLP 2021) and can be downloaded from Zenodo. 📑
The method to load a tokenizer and a model. For the FiLM model, you can call 'roberta-base' from the tokenizer.
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
model = AutoModel.from_pretrained('HYdsl/FiLM-SEC')
Types of Training Corpora 📚