banks taxonomy

Overview

ABILaBERT was created for the purpose of classifying a text to one or more concepts of a Taxonomy describing the banking domain. The taxonomy can be bank specific, or general to the domain knowledge: it will be modeled for the text classifier through a pre-training process acting over the Taxonomy itself.

In this work, we will consider the ABILab Process Tree Taxonomyas a general, i.e., bank-independent, formalization of the processes currently active in the Italian bank eco-system. The objective of this taxonomy is to achieve a complete and shared mapping of the bank’s processes, covering all areas of activity at a level of detail that can be considered common across different banks and financial organizations, without explicit reference to existing organizational structures, products offered or delivery channels.

In order to remedy the complete absence of training data, we use the Zero-Shot Learning approach by relying on a semantic model, so we exploit the taxonomy itself as a source of information, in particular by making explicit all relationships between concepts. In the suggested augmentation, we map individual taxonomic information (e.g. relations) into short texts, aiming to declare the corresponding semntic evidence (e.g. the correctness of a definition for the concept name, or the declaration of the underlying hierarchical relation). We will call the processes of recognizing these information, that is accepting the texts as true, as Sub-Tasks.

As a result, a dataset of more than 1 million examples was obtained with which we trained the initial gilberto-uncased-from-camembert .

For more information on dataset creation, training and classification of a text into one or more concepts of the taxonomy, please refer to the paper (Margiotta et al., 2021), titled "Knowledge-based neural pre-training for Intelligent Document Management", available at: link. Here we will refer only to the use of the model for solving Sub-Tasks used in training.

Sub-Tasks for Domain Specific Pre-training

The Sub-Tasks aim at acquiring domain knowledge implicitly from definitions and from relational texts, i.e., statements about direct subsumption relationships between concepts of different levels in the taxonomy.

In particular, the model was trained in providing predictions regarding the following tasks:

Examples of Sub-Tasks

Code Example

The following is a brief snippet of Python code describing a correct use of the model for a prediction in the Definition Recognition task.

Define a list of input sentence

We create the examples composing the 2 parts that we will call "banking_text", which is a candidate definition of a concept, and "concept" that will contain the name of the concept of which we want to establish the truth.

inputs = []

banking_text = "Processo di gestione del piano di budget attraverso l'individuazione delle regole di predisposizione, la predisposizione effettiva e il controllo del suo rispetto"
concepts = ["definizione budget aziendale", "lavorazione assegni tratti", ...]

for concept in concepts:
    inputs.append(banking_text + " definisce il termine " + concept)

Download the model

from transformers import CamembertTokenizer, CamembertForSequenceClassification

tokenizer = CamembertTokenizer.from_pretrained("Abilab-Uniroma2/ABILaBERT")
model = CamembertForSequenceClassification.from_pretrained("Abilab-Uniroma2/ABILaBERT")

Set the functions for predictions output

def generateDataLoader(sentences):
    encoded_data_classifications = tokenizer.batch_encode_plus(
        sentences, 
        add_special_tokens=True, 
        return_attention_mask=True, 
        pad_to_max_length=True, 
        max_length=256, 
        return_tensors='pt',
        truncation=True,
        padding=True
    )
    
    input_ids_classifications = encoded_data_classifications['input_ids']
    attention_masks_classifications = encoded_data_classifications['attention_mask']
    labels_classifications = torch.tensor([0]*len(sentences))
    dataset_classifications = TensorDataset(input_ids_classifications, attention_masks_classifications, labels_classifications)

    return DataLoader(dataset_classifications, 
                      sampler=SequentialSampler(dataset_classifications), 
                      batch_size=16)

def prediction(dataloader_val):
    model.eval()
    predictions = []
    
    for batch in dataloader_val:
        batch = tuple(b.to(device) for b in batch)
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }
        with torch.no_grad():        
            outputs = model(**inputs)

        logits = outputs[1]
        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()

        predictions.append(logits)
    return predictions

def showPrediction(inputs, outputs):
    for i,x in enumerate(inputs):
        if outputs[0][i][0] > outputs[0][i][1]:
            print(f'INPUT:\t{x}\nOUTPUT:\t\x1b[31mFALSE\x1b[0m')
        else:
            print(f'INPUT:\t{x}\nOUTPUT:\t\033[92mTRUE\x1b[0m')

Get the output

def modelPredictions(inputs):
    dataLoader = generateDataLoader(inputs)
    outputs = prediction(dataLoader)
    showPrediction(inputs, outputs)

modelPredictions(inputs)