flan-t5-text2sparql-custom-tokenizer

This model is a fine-tuned version of google/flan-t5-base on the lc_quad dataset. It achieves the following results on the evaluation set:

Loss: 1.8039

Model description

This model uses the T5 tokenizer just for the input and a custom one for the SPARQL queries. This has lead to a dramatic improvement in performance, albeit not quite usable yet.

Intended uses & limitations

Because we used two different tokenizers, you cannot use this model simply in a pipeline. Use the following Python code as a starting point:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_checkpoint = "InfAI/flan-t5-text2sparql-custom-tokenizer"
question = "What was the population of Clermont-Ferrand on 1-1-2013?"
gold_answer = "SELECT ?obj WHERE { wd:Q42168 p:P1082 ?s . ?s ps:P1082 ?obj . ?s pq:P585 ?x filter(contains(YEAR(?x),'2013')) }"

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

tokenizer_in = AutoTokenizer.from_pretrained("google/flan-t5-base")
tokenizer_out = AutoTokenizer.from_pretrained("InfAI/sparql-tokenizer")

sample = f"Create SPARQL Query: {question}"

inputs = tokenizer_in([sample], return_tensors="pt")
outputs = model.generate(**inputs)

print(f"Gold answer: {gold_answer}")
print("       Model:" + tokenizer_out.decode(outputs[0], skip_special_tokens=True))

Gold answer: SELECT ?obj WHERE { wd:Q42168 p:P1082 ?s . ?s ps:P1082 ?obj . ?s pq:P585 ?x filter(contains(YEAR(?x),'2013'))
      Model: SELECT?obj WHERE { wd:Q4754 p:P1082?s.?s ps:P1082?obj.?s pq:P585?x filter(contains(YEAR(?x),'2013')) }

Common errors include:

A stray closed curly brace at the end
One of subject / predicate / object is wrong, while the other two are correct

Training and evaluation data

More information needed

Training procedure

We trained the model for 50 epochs, which was way over the top. The loss stagnates after about 25 epochs and looking manually at some examples from the validation set showed us that the queries do not improve beyond this point using these hyperparameters. We were aware that the number of epochs was probably too high, but our goal was to find out how many epochs were beneficial to the performance.

There are two avenues we will explore to get rid of these errors:

Continue training with different hyperparameters
Apply more preprocessing to the dataset

The results will be uploaded to this repo.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 16
eval_batch_size: 16
seed: 42
gradient_accumulation_steps: 4
total_train_batch_size: 64
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 50

Training results

Training Loss	Epoch	Step	Validation Loss
No log	1.0	301	2.6503
3.2271	2.0	602	2.3894
3.2271	3.0	903	2.2532
2.3957	4.0	1204	2.1631
2.18	5.0	1505	2.0788
2.18	6.0	1806	2.0195
2.0209	7.0	2107	1.9681
2.0209	8.0	2408	1.9353
1.9087	9.0	2709	1.8936
1.8114	10.0	3010	1.8683
1.8114	11.0	3311	1.8556
1.7254	12.0	3612	1.8284
1.7254	13.0	3913	1.8099
1.6556	14.0	4214	1.7932
1.5891	15.0	4515	1.7823
1.5891	16.0	4816	1.7691
1.528	17.0	5117	1.7569
1.528	18.0	5418	1.7578
1.4784	19.0	5719	1.7561
1.4288	20.0	6020	1.7514
1.4288	21.0	6321	1.7372
1.3793	22.0	6622	1.7318
1.3793	23.0	6923	1.7244
1.3436	24.0	7224	1.7382
1.3073	25.0	7525	1.7254
1.3073	26.0	7826	1.7494
1.2692	27.0	8127	1.7378
1.2692	28.0	8428	1.7387
1.242	29.0	8729	1.7290
1.2107	30.0	9030	1.7391
1.2107	31.0	9331	1.7458
1.1817	32.0	9632	1.7528
1.1817	33.0	9933	1.7521
1.1661	34.0	10234	1.7672
1.136	35.0	10535	1.7594
1.136	36.0	10836	1.7564
1.1216	37.0	11137	1.7670
1.1216	38.0	11438	1.7724
1.1031	39.0	11739	1.7766
1.0834	40.0	12040	1.7756
1.0834	41.0	12341	1.7786
1.0707	42.0	12642	1.7947
1.0707	43.0	12943	1.7931
1.058	44.0	13244	1.7925
1.0489	45.0	13545	1.7939
1.0489	46.0	13846	1.7969
1.0421	47.0	14147	1.7982
1.0421	48.0	14448	1.7994
1.0357	49.0	14749	1.8018
1.03	50.0	15050	1.8039

Framework versions

Transformers 4.18.0
Pytorch 1.10.2+cu102
Datasets 2.4.0
Tokenizers 0.12.1