beto-prescripciones-medicas
Fine-tunning BETO for detection of entities in medical prescriptions. More models and detailes can be found in our repository. This is a fine-tuned version of bert-clinical-scratch-wl-es from PLN group @ CMM. Which is a fine-tunned version bert-base-spanish-wwm-uncased (BETO) from DCC UChile.
This work is part of a project that aims to have entity recognition models on prescription data from Minsal (Chile Health Minsistry), for the MDS7201 course from Data Science MSc program at UChile. We use data from a Chilean Hospital, which is not available for public use, but we do provide the files with which we trained the models. The procedure is the following one:
- We use a model using regular expresions (RegEx) in order to tag around 100k unique samples from the original dataset.
- We fine-tune bert-clinical-scratch-wl-es using data tagged with the RegEx method. (5 epochs)
- We further fine-tune the model with data human tagged data (800 samples, 20 epochs).
- The data is tested on human tagged data (200 samples).
The resulting evaluation metrics are the following ones
f1 | precision | recall |
---|---|---|
0.93 | 0.92 | 0.94 |
Collaborators:
- Daniel Carmona G. (Ing. Civil Eléctrica)
- Martín Sepúlveda (Ing. Civil Eléctrica)
- Monserrat Prado (Ing. Civil en Ciencias de la Computación)
- Camilo Carvajal Reyes (Ing. Civil Matemática)
Supervised by:
- Patricio Wolff (Minsal)
- Constanza Contreras (Docente MDS7201)
- Francisco Förster (Docente MDS7201)
Example
We provide a demo. Here we introduce those funtions that are necessary in order to translate the model's output into understandable tags.
We also provide a complementary model: beto-prescripciones-medicas-ADMIN. This model tags the output of the current model of those tokens tagged as ADMIN. The demo includes such model, and the output of both is shown as an example below:
ACTIVE_PRINCIPLE | FORMA_FARMA | CANT-ADMIN | UND-ADMIN | VIA-ADMIN | PERIODICITY | DURATION |
---|---|---|---|---|---|---|
PARACETAMOL | 500 MG COMPRIMIDO | 1 | COMPRIMIDO | ORAL | cada 6 horas | durante 3 dias |
This example is also shown in this notebook, which uses the model as a blackbox.
Reproducibility
Training parameters (fine-tunning on RegEx data):
training_args = TrainingArguments(
output_dir="./results",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=5,
weight_decay=0.01
)
Training parameters (fine-tunning on human tagged data)
training_args = TrainingArguments(
output_dir = "./results",
evaluation_strategy = "epoch",
learning_rate = 2e-5,
per_device_train_batch_size = 16,
per_device_eval_batch_size = 16,
num_train_epochs = 20,
weight_decay = 0.01,
)