Legal act Extraction Model
With growing legal complexity keeping track of changes in interconnectivity and hierarchical structure of the legislation is a challenging task. Entity extraction technique (also known as token classification) facilitates document analysis by assigning a label to each word in a text.
A way to decide which data elements are to be extracted and how they should be labeled mostly depends on a particular business problem and is limited only by a tokenization process meaning that an element shouldn’t be less than a token as split by a tokenizer. So as long as these data elements correspond to at least one whole token they could represent legal terms, legal entities, legal parties, deadlines and so on.
This model is fine-tuned to label mentioned legal acts and their articles. Extracted information could be used to create an interconnectivity map for legal acts.
Model Description
This model is a fine-tuned checkpoint of RoBERTa-large. More details about RoBERTa large are available in RoBERTa large model card.
Id | Label | Description |
---|---|---|
0 | O | Not a legal act and not an article |
1 | abbreviation_relevant_following_act | A legal act abbreviation relevant to the following legal act |
2 | abbreviation_relevant_previous_act | A legal act abbreviation relevant to a previously mentioned legal act |
3 | another_act | A legal act |
4 | another_act_abbreviation | A legal act mentioned as an abbreviation |
5 | another_act_equal_previous_act | An assumed legal act introduced previously |
6 | another_act_sequence_end | Inside a sequence of legal acts |
7 | another_act_sequence_start | At the beginning of a sequence of legal acts |
8 | another_article_equal_previous_article | An assumed article introduced previously |
9 | article_current | An article mentioning itself |
10 | article_relevant_current_act | An article of the same legal act as the one being processed |
11 | article_relevant_current_act_range_end | A range end of articles belonging to the current act |
12 | article_relevant_current_act_range_start | A range start of articles belonging to the current act |
13 | article_relevant_following_act | An article of a following legal act |
15 | article_relevant_following_act_range_end | A range end of articles belonging to a following act |
16 | article_relevant_following_act_range_start | A range start of articles belonging to a following legal act |
17 | article_relevant_previous_act | An article of a previously mentioned legal act |
18 | article_relevant_previous_act_range_end | A range end of articles belonging to a previously mentioned legal act |
19 | article_relevant_previous_act_range_start | A range start of articles belonging to a previously mentioned legal act |
20 | current_act | A legal act mentioning itself |
21 | treaty_abbreviation | A treaty mentioned as an abbreviation |
22 | treaty_name | A treaty |
23 | service_label | A token comprising more than 1 label |
Intended Uses & Limitations
The model could be used to extract mentioned legal acts and their articles.
Limitations
This legal-act extraction model is very domain-specific and will perform well on legal texts. It's not recommended to use this model for other domains, but you are free to test it out. It was intended for English documents only.
How To Use
from transformers import (
TokenClassificationPipeline,
RobertaForTokenClassification,
RobertaTokenizerFast,
)
legal_act_extraction_model = RobertaForTokenClassification.from_pretrained(
'Lexemo/roberta_large_legal_act_extraction')
tokenizer = RobertaTokenizerFast.from_pretrained("roberta-large")
pypeline = TokenClassificationPipeline(model=legal_act_extraction_model,
tokenizer=tokenizer,
aggregation_strategy='simple')
# Inference
import pandas as pd
from tabulate import tabulate
text = """When Member States adopt those measures, they shall contain a
reference to this Directive or be accompanied by such reference on the
occasion of their official publication. They shall also include a statement
that references in existing laws, regulations and administrative provisions
to Article 9 of Directive 97/23/EC shall be construed as references to
Article 13 of this Directive. Member States shall determine how such
reference is to be made and how that statement is to be formulated."""
entities = pypeline(text)
df = pd.DataFrame(entities)
print(tabulate(df, showindex=True, headers=df.columns))
# Output
entity_group score word start end
-- ------------------------------ -------- ------------------ ------- -----
0 current_act 0.999999 Directive 80 89
1 article_relevant_following_act 0.999995 9 296 297
2 another_act 0.999999 Directive 97/23/EC 301 319
3 article_relevant_following_act 0.999996 13 364 366
4 current_act 0.999999 Directive 375 384
Fine-tuning hyper-parameters
- learning_rate = 2e-5
- batch_size = 4
- weight_decay=0.01
- max_seq_length = 514
- num_train_epochs = 56