Model Card for budgetlongformer-diverse

<!-- Provide a quick summary of what the model is/does. [Optional] --> Legal pretrained model using Replaced Token Detection (RTD) task, trained on Pile-of-Law dataset with 4096 tokens as context windows.

Model Details

Model Description

<!-- Provide a longer summary of what this model is/does. --> Legal pretrained model using ELECTRA objective task, trained on Pile-of-Law dataset with 4096 tokens as context windows.

Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. --> <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->

The model can directly be used to generate embeddings for example for similarity search. It likely works best on US focused legal data.

Downstream Use

<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app --> <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->

The model can be finetuned for any NLU task or when coupled with a decoder also for generative tasks. In our experiments on summarization with the BillSum dataset, we found that random initialization of the decoder improved performance.

Out-of-Scope Use

<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. --> <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->

This model will likely work worse on non-legal text in non-English languages originating from outside the US.

Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.

Considerations about the training dataset

Social Impact of Dataset

As described in the dataset card, the internal variation allows contextual privacy rules to be learned. If robust mechanisms for this are developed they can applied more broadly. As discussed in ``On the Opportunities and Risks of Foundation Models'', legal language models can help improve access to justice in various ways. But they can also be used in potentially harmful ways. While such models are not ready for most production environments and are the subject of significant research, we ask that model users and model creators using this model, particularly when creating generative models (e.g. attaching a decoder), consider the impacts of their model and make a good faith effort to weigh the benefits against the harms of their method. As our license, the training dataset license also restricts commercial usage.

Discussion of Biases

The data reflects the biases of governments and courts. As discussed in their work Pile of Law, these can be significant, though more recent text will likely be less overtly toxic. Please consult the above statement and keep it in mind in the use and/or any modification of this model, implementing responsible use.

Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

As with any large LM there is the risk of it producing biased or unfair output. Researchers using the model should put into place respective safeguards to identify biased and/or toxic language.

Training Details

Training Data

<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

The diverse model was trained on caselaw (“Court Listener Opinions” & “Court Listener Docket Entry Documents”), legislation (“US Code”, “State Codes” & “EURLEX”) and contracts (“Atticus Contracts” & “EDGAR Contracts”) from public Pile-of-Law dataset. To balance the training data, we limited the number of documents to 500K (this affects Court Listener Opinions, Court Listener Docket Entry Documents and EDGAR Contracts).

Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

Preprocessing

More information needed

Speeds, Sizes, Times

<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->

More information needed

Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

Testing Data, Factors & Metrics

Testing Data

<!-- This should link to a Data Card if possible. -->

We tested the model on the BillSum and PubMed summarization datasets achieving SotA Rouge scores for the respective parameter sizes in August 2022.

Factors

<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->

More information needed

Metrics

<!-- These are the evaluation metrics being used, ideally with a description of why. -->

We followed the standard in research on summarization datasets and used Rouge 1, 2 and L.

Results

More information needed

Environmental Impact

<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

Carbon emissions estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Model Architecture and Objective

We used a Longformer attention window of 256 as generator and discriminator. The generator model was three times smaller than the discriminator model. In particular, the generator’s depth (number of hidden layers) instead of its width (embedding size, hidden size and intermediate size). We used a MLM probability of 25% for the generator.

Compute Infrastructure

Amazon SageMaker Notebooks.

Hardware

4 x 16GB NVIDIA V100

Software

transformers

Citation

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

BibTeX:

@misc{niklaus2022budgetlongformer, title={BudgetLongformer: Can we Cheaply Pretrain a SotA Legal Language Model From Scratch?}, author={Joel Niklaus and Daniele Giofré}, year={2022}, eprint={2211.17135}, archivePrefix={arXiv}, primaryClass={cs.CL} }

Model Card Authors

<!-- This section provides another layer of transparency and accountability. Whose views is this model card representing? How many voices were included in its construction? Etc. -->

Joel Niklaus, Daniele Giofré