Molecular BERT Pretrained Using ChEMBL Database
This model has been pretrained based on the methodology outlined in the paper Pushing the Boundaries of Molecular Property Prediction for Drug Discovery with Multitask Learning BERT Enhanced by SMILES Enumeration. While the original model was initially trained using custom code, it has been adapted for use within the Hugging Face Transformers framework in this project.
Model Details
The model architecture utilized is based on BERT. Here are the key configuration details:
BertConfig(
vocab_size=70,
hidden_size=256,
num_hidden_layers=8,
num_attention_heads=8,
intermediate_size=1024,
hidden_act="gelu",
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
max_position_embeddings=max_seq_len,
type_vocab_size=1,
pad_token_id=tokenizer_pretrained.vocab["[PAD]"],
position_embedding_type="absolute"
)
- Optimizer: AdamW
- Learning rate: 1e-4
- Learning rate scheduler: False
- Epochs: 50
- AMP: True
- GPU: Single Nvidia RTX 3090
Pretraining Database
The model was pretrained using data from the ChEMBL database, specifically version 33. You can download the database from ChEMBL. Additionally, the dataset is available on the Hugging Face Datasets Hub and can be accessed at Hugging Face Datasets - ChEMBL_v33_pretraining.
Performance
The accuracy score achieved by the pretrained model is 0.9672. The testing dataset used for evaluation constitutes 10% of the ChEMBL dataset.