<h1>Model description</h1>

This is a fine-tuned BioBERT model for extracting primary outcomes from articles reporting clinical trials. This is the second version of the model; the original model development was reported in:

Anna Koroleva, Sanjay Kamath, Patrick Paroubek. Extracting primary and reported outcomes from articles reporting randomized controlled trials using pre-trained deep language representations. Preprint: https://easychair.org/publications/preprint/qpml

The original work was conducted within the scope of the Assisted authoring for avoiding inadequate claims in scientific reporting PhD project of the Methods for Research on Research (MiRoR, http://miror-ejd.eu/) program.

Model creator: Anna Koroleva

<h1>Intended uses & limitations</h1>

The model is intended to be used for extracting primary outcomes from texts of clinical trials.

The main limitation is that the model was trained on a fairly small (2000 sentences) sample of data annotated by a single annotator. Annotating more data or involvig more annotators was not possiblw within the PhD project.

Another possible issue with the model use if the complex nature of outcomes: a typical description of an outcome can include the outcome name, measurement tool, timepoints, e.g. "Health-Related Quality of Life at 12 months, measured using the Assessment of Quality of Life instrument". Ideally, this should be broken into 3 separate entities ("Health-Related Quality of Life" - outcome", "at 12 months" - timepoint", "the Assessment of Quality of Life instrument" - measurement tool), and relation between the three should be extracted to capture all the outcome-related information. However, in our annotation we annotated this type of examples as a sinale outcome entity.

<h1>How to use</h1>

The model should be used with the BioBERT tokeniser. A sample code for getting model predictions is below:


  import numpy as np

  from transformers import AutoTokenizer

  from transformers import AutoModelForTokenClassification


  tokenizer = AutoTokenizer.from_pretrained('dmis-lab/biobert-v1.1')

  model = AutoModelForTokenClassification.from_pretrained(r'aakorolyova/primary_outcome_extraction')


  text = 'Primary endpoints were overall survival in patients with oesophageal squamous cell carcinoma and PD-L1 combined positive score (CPS) of 10 or more, and overall survival and progression-free survival in patients with oesophageal squamous cell carcinoma, PD-L1 CPS of 10 or more, and in all randomised patients.'

  encoded_input = tokenizer(text, padding=True, truncation=True, max_length=2000, return_tensors='pt')

  output = model(**encoded_input)['logits']

  output = np.argmax(output.detach().numpy(), axis=2)

  print(output)

Some more useful functions can be found in or Github repository: https://github.com/aakorolyova/DeSpin-2.0

<h1>Training data</h1>

Training data can be found in https://github.com/aakorolyova/DeSpin-2.0/tree/main/data/Primary_Outcomes

<h1>Training procedure</h1>

The model was fine-tuned using Huggingface Trainer API. Training scripts can be found in https://github.com/aakorolyova/DeSpin-2.0

<h1>Evaluation</h1>

Precision: 74.41%

Recall: 88.7%

F1: 80.93%