SpanMarker for Disease Named Entity Recognition

This is a SpanMarker model trained on the ncbi_disease dataset. In particular, this SpanMarker model uses bert-base-cased as the underlying encoder. See train.py for the training script.

Metrics

This model achieves the following results on the testing set:

Overall Precision: 0.8661
Overall Recall: 0.8971
Overall F1: 0.8813
Overall Accuracy: 0.9837

Labels

Label	Examples
DISEASE	"ataxia-telangiectasia", "T-cell leukaemia", "C5D", "neutrophilic leukocytosis", "pyogenic infection"

Usage

To use this model for inference, first install the span_marker library:

pip install span_marker

You can then run inference with this model like so:

from span_marker import SpanMarkerModel

# Download from the 🤗 Hub
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-ncbi-disease")
# Run inference
entities = model.predict("Canavan disease is inherited as an autosomal recessive trait that is caused by the deficiency of aspartoacylase (ASPA).")

See the SpanMarker repository for documentation and additional information on this library.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 32
eval_batch_size: 32
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.1
num_epochs: 3

Training results

Training Loss	Epoch	Step	Validation Loss	Overall Precision	Overall Recall	Overall F1	Overall Accuracy
0.0038	1.41	300	0.0059	0.8141	0.8579	0.8354	0.9818
0.0018	2.82	600	0.0054	0.8315	0.8720	0.8513	0.9840

Framework versions

SpanMarker 1.2.4
Transformers 4.31.0
Pytorch 1.13.1+cu117
Datasets 2.14.3
Tokenizers 0.13.2