Model Description

TinyBioBERT is a distilled version of the BioBERT which is distilled for 100k training steps using a total batch size of 192 on the PubMed dataset.

Distillation Procedure

This model uses a unique distillation method called ‘transformer-layer distillation’ which is applied on each layer of the student to align the attention maps and the hidden states of the student with those of the teacher.

Architecture and Initialisation

This model uses 4 hidden layers with a hidden dimension size and an embedding size of 768 resulting in a total of 15M parameters. Due to the model's small hidden dimension size, it uses random initialisation.

Citation

If you use this model, please consider citing the following paper:

@misc{https://doi.org/10.48550/arxiv.2209.03182,
  doi = {10.48550/ARXIV.2209.03182},
  url = {https://arxiv.org/abs/2209.03182},
  author = {Rohanian, Omid and Nouriborji, Mohammadmahdi and Kouchaki, Samaneh and Clifton, David A.},
  keywords = {Computation and Language (cs.CL), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences, 68T50},
  title = {On the Effectiveness of Compact Biomedical Transformers},
  publisher = {arXiv},
  year = {2022}, 
  copyright = {arXiv.org perpetual, non-exclusive license}
}