Overview

ABILaBERT was created for the purpose of classifying a text to one or more concepts of a Taxonomy describing the banking domain. The taxonomy can be bank specific, or general to the domain knowledge: it will be modeled for the text classifier through a pre-training process acting over the Taxonomy itself.

In this work, we will consider the ABILab Process Tree Taxonomyas a general, i.e., bank-independent, formalization of the processes currently active in the Italian bank eco-system. The objective of this taxonomy is to achieve a complete and shared mapping of the bank’s processes, covering all areas of activity at a level of detail that can be considered common across different banks and financial organizations, without explicit reference to existing organizational structures, products offered or delivery channels.

In order to remedy the complete absence of training data, we use the Zero-Shot Learning approach by relying on a semantic model, so we exploit the taxonomy itself as a source of information, in particular by making explicit all relationships between concepts. In the suggested augmentation, we map individual taxonomic information (e.g. relations) into short texts, aiming to declare the corresponding semntic evidence (e.g. the correctness of a definition for the concept name, or the declaration of the underlying hierarchical relation). We will call the processes of recognizing these information, that is accepting the texts as true, as Sub-Tasks.

As a result, a dataset of more than 1 million examples was obtained with which we trained the initial gilberto-uncased-from-camembert .

For more information on dataset creation, training and classification of a text into one or more concepts of the taxonomy, please refer to the paper (Margiotta et al., 2021), titled "Knowledge-based neural pre-training for Intelligent Document Management", available at: link. Here we will refer only to the use of the model for solving Sub-Tasks used in training.

Sub-Tasks for Domain Specific Pre-training

The Sub-Tasks aim at acquiring domain knowledge implicitly from definitions and from relational texts, i.e., statements about direct subsumption relationships between concepts of different levels in the taxonomy.

In particular, the model was trained in providing predictions regarding the following tasks:

Definition Recognition: a description and the term of a concept in the taxonomy are related and the model is thus expected to recognize if that association is true or false.
Subsumption Recognition: in this sub-task, hierarchical relations are mapped into composite sentences declaring the property over two concepts.

Examples of Sub-Tasks

An instance for the Definition Recognition sub-task is as follows:
- IT:"Processo di gestione del piano di budget attraverso l'individuazione delle regole di predisposizione, la predisposizione effettiva e il controllo del suo rispetto definisce il termine definizione budget aziendale."
- EN:"Process of managing the budget plan by identifying the rules of preparation, actual preparation and monitoring of its compliance defines the term corporate budget definition."
An instance for the Subsumption Recognition sub-task is as follows:
- IT:"Processo di gestione del piano di budget attraverso l'individuazione delle regole di predisposizione, la predisposizione effettiva e il controllo del suo rispetto è un concetto più specifico del concetto denominato allocazione risorse e definizione del budget."
- EN:"Process of managing the budget plan by identifying rules of preparation, actual preparation, and monitoring compliance is a concept more specific than the one named resource allocation and budget setting."

Code Example

The following is a brief snippet of Python code describing a correct use of the model for a prediction in the Definition Recognition task.

Define a list of input sentence

We create the examples composing the 2 parts that we will call "banking_text", which is a candidate definition of a concept, and "concept" that will contain the name of the concept of which we want to establish the truth.

In this case the canididate definition "banking_text" will be: "Processo di gestione del piano di budget attraverso l'individuazione delle regole di predisposizione, la predisposizione effettiva e il controllo del suo rispetto"
The "concept", for the first association will be: "definizione budget aziendale" (TRUE)
The "concept", for the second association will be: "lavorazione assegni tratti" (FALSE)

inputs = []

banking_text = "Processo di gestione del piano di budget attraverso l'individuazione delle regole di predisposizione, la predisposizione effettiva e il controllo del suo rispetto"
concepts = ["definizione budget aziendale", "lavorazione assegni tratti", ...]

for concept in concepts:
    inputs.append(banking_text + " definisce il termine " + concept)

Download the model

from transformers import CamembertTokenizer, CamembertForSequenceClassification

tokenizer = CamembertTokenizer.from_pretrained("Abilab-Uniroma2/ABILaBERT")
model = CamembertForSequenceClassification.from_pretrained("Abilab-Uniroma2/ABILaBERT")

Set the functions for predictions output

def generateDataLoader(sentences):
    encoded_data_classifications = tokenizer.batch_encode_plus(
        sentences, 
        add_special_tokens=True, 
        return_attention_mask=True, 
        pad_to_max_length=True, 
        max_length=256, 
        return_tensors='pt',
        truncation=True,
        padding=True
    )
    
    input_ids_classifications = encoded_data_classifications['input_ids']
    attention_masks_classifications = encoded_data_classifications['attention_mask']
    labels_classifications = torch.tensor([0]*len(sentences))
    dataset_classifications = TensorDataset(input_ids_classifications, attention_masks_classifications, labels_classifications)

    return DataLoader(dataset_classifications, 
                      sampler=SequentialSampler(dataset_classifications), 
                      batch_size=16)

def prediction(dataloader_val):
    model.eval()
    predictions = []
    
    for batch in dataloader_val:
        batch = tuple(b.to(device) for b in batch)
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }
        with torch.no_grad():        
            outputs = model(**inputs)

        logits = outputs[1]
        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()

        predictions.append(logits)
    return predictions

def showPrediction(inputs, outputs):
    for i,x in enumerate(inputs):
        if outputs[0][i][0] > outputs[0][i][1]:
            print(f'INPUT:\t{x}\nOUTPUT:\t\x1b[31mFALSE\x1b[0m')
        else:
            print(f'INPUT:\t{x}\nOUTPUT:\t\033[92mTRUE\x1b[0m')

Get the output

def modelPredictions(inputs):
    dataLoader = generateDataLoader(inputs)
    outputs = prediction(dataLoader)
    showPrediction(inputs, outputs)

modelPredictions(inputs)

INPUT: Processo di gestione del piano di budget attraverso l'individuazione delle regole di predisposizione, la predisposizione effettiva e il controllo del suo rispetto definisce il termine definizione budget aziendale
OUTPUT: TRUE
INPUT: Processo di gestione del piano di budget attraverso l'individuazione delle regole di predisposizione, la predisposizione effettiva e il controllo del suo rispetto definisce il termine lavorazione assegni tratti
OUTPUT: FALSE
...