Varta-T5

Model Description

Varta-T5 is a model pre-trained on the full training set of Varta in 14 Indic languages (Assamese, Bhojpuri, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Nepali, Oriya, Punjabi, Tamil, Telugu, and Urdu) and English, using span corruption and gap-sentence generation as objectives.

Varta is a large-scale news corpus for Indic languages, including 41.8 million news articles in 14 different Indic languages (and English), which come from a variety of high-quality sources. The dataset and the model are introduced in this paper. The code is released in this repository.

Uses

You can use this model for causal language modeling, but it's mostly intended to be fine-tuned on a downstream task.

Note that the text-to-text framework allows us to use the same model on any NLP task, including text generation tasks (e.g., machine translation, document summarization, question answering), and classification tasks (e.g., sentiment analysis).

Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

This work is mainly dedicated to the curation of a new multilingual dataset for Indic languages, many of which are low-resource languages. During data collection, we face several limitations that can potentially result in ethical concerns. Some of the important ones are mentioned below: <br>

How to Get Started with the Model

You can use this model directly for span in-filling.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("rahular/varta-t5")

model = AutoModelForSeq2SeqLM.from_pretrained("rahular/varta-t5")

Training Details

Training Data

Varta contains 41.8 million high-quality news articles in 14 Indic languages and English. With 34.5 million non-English article-headline pairs, it is the largest document-level dataset of its kind.

Pretraining

Since data sizes across languages in Varta vary from 1.5K (Bhojpuri) to 14.4M articles (Hindi), we use standard temperature-based sampling to upsample data when necessary.

<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

Evaluation Results

Please see the paper.

Citation

@misc{aralikatte2023varta,
      title={V\=arta: A Large-Scale Headline-Generation Dataset for Indic Languages}, 
      author={Rahul Aralikatte and Ziling Cheng and Sumanth Doddapaneni and Jackie Chi Kit Cheung},
      year={2023},
      eprint={2305.05858},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}