flan-t5-base-geoqa

This model is a fine-tuned version of google/flan-t5-base on the petroglyphs-nlp-consulting/res_syn_sentences_qa_lg dataset.

Model description

The google/flan-t5-base model has been fine-tuned on 24750 question-answer pairs obtained from the petroglyphs-nlp-consulting/res_syn_sentences_qa_lg dataset.

Intended uses & limitations

The model can be used for Q&A inferences covering Geosciences (specifically oil and Gas Reservoir Geology) topics.

Training and evaluation data

Training: 24750 question-answer pairs
Validation: 4980 question-answer pairs
Test: 250 question-answer pairs

Evaluation run on the test set using the following methods:

def normalize_answer(s):
    def remove_articles(text):
        return re.sub(r'\b(a|an|the)\b', ' ', text)

    def white_space_fix(text):
        return ' '.join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return ''.join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))

def f1_score(prediction, ground_truth):
    prediction_tokens = normalize_answer(prediction).split()
    ground_truth_tokens = normalize_answer(ground_truth).split()
    common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
    num_same = sum(common.values())
    if num_same == 0:
        return 0
    precision = 1.0 * num_same / len(prediction_tokens)
    recall = 1.0 * num_same / len(ground_truth_tokens)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1

def exact_match_score(prediction, ground_truth):
    return (normalize_answer(prediction) == normalize_answer(ground_truth))

def evaluate(gold_answers, predictions):
    f1 = exact_match = total = 0

    for gold_answer, prediction in zip(gold_answers, predictions):
      total += 1
      exact_match += exact_match_score(prediction, gold_answer)
      f1 += f1_score(prediction, gold_answer)
    
    exact_match = 100.0 * exact_match / total
    f1 = 100.0 * f1 / total

    return {'exact_match': exact_match, 'f1': f1}

'exact_match': 70.4
'f1': 83.07678804855274

Comparison with vanilla google/flan-t5-base

'exact_match': 65.2
'f1': 80.156117099275

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 32
eval_batch_size: 32
seed: 42
gradient_accumulation_steps: 16
total_train_batch_size: 512
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
num_epochs: 6

Training results

'train_samples_per_second': 40.608, 'train_steps_per_second': 0.079, 'train_loss': 0.06748775382422739, 'epoch': 15}

Training hardware

2 x RTX Titan GPU (24Gb each)

Framework versions

Transformers 4.27.2
Pytorch 2.0.0+cu117
Datasets 2.10.1
Tokenizers 0.13.2

Environmental footprint

A single GPU has been used but both cards have been used for memory sharing. 280W x 3h = 0.84 kWh x 0.3 kg eq. CO2/kWh = 0.25 kg eq. CO2