Model Card for Model hogru/MolReactGen-USPTO50K-Reaction-Templates

MolReactGen is a model that generates reaction templates in SMARTS format (this model) and molecules in SMILES format.

Model Details

Model Description

MolReactGen is based on the the GPT-2 transformer decoder architecture and has been trained on a pre-processed version of the USPTO-50K dataset. More information can be found in these introductory slides.

Developed by: Stephan Holzgruber
Model type: Transformer decoder
License: MIT

Model Sources

Repository: https://github.com/hogru/MolReactGen
Presentation: https://github.com/hogru/MolReactGen/blob/main/presentations/Slides%20(A4%20size).pdf
Poster: https://github.com/hogru/MolReactGen/blob/main/presentations/Poster%20(A0%20size).pdf

Uses

The main use of this model is to pass the master's examination of the author ;-)

Direct Use

The model can be used in a Hugging Face text generation pipeline. For the intended use case a wrapper around the raw text generation pipeline is needed. This is the generate.py from the repository. The model has a default GenerationConfig() (generation_config.json) which can be overwritten. Depending on the number of molecules to be generated (num_return_sequences in the JSON file) this might take a while. The generation code above shows a progress bar during generation.

Bias, Risks, and Limitations

The model generates reaction templates that are similar to the USPTO-50K training data. Any checks of the reaction templates, e.g. chemical feasiblitly, must be adressed by the user of the model.

Training Details

Training Data

Pre-processed version of the USPTO-50K dataset, originally introduced by Schneider et al..

Training Procedure

The default Hugging Face Trainer() has been used, with an EarlyStoppingCallback().

Preprocessing

The training data was pre-processed with a PreTrainedTokenizerFast() trained on the training data with a bespoke RegEx pre-tokenizer which "understands" the SMARTS syntax.

Training Hyperparameters

Batch size: 8
Gradient accumulation steps: 4
Mixed precision: fp16, native amp
Learning rate: 0.0005
Learning rate scheduler: Cosine
Learning rate scheduler warmup: 0.1
Optimizer: AdamW with betas=(0.9,0.95) and epsilon=1e-08
Number of epochs: 43 (early stopping)

More configuration (options) can be found in the conf directory of the repository.

Evaluation

Please see the slides / the poster mentioned above.

Metrics

Please see the slides / the poster mentioned above.

Results

Please see the slides / the poster mentioned above.

Technical Specifications

Framework versions

Transformers 4.27.1
Pytorch 1.13.1
Datasets 2.10.1
Tokenizers 0.13.2

Hardware

Local PC running Ubuntu 22.04
NVIDIA GEFORCE RTX 3080Ti (12GB)

Model Card for Model hogru/MolReactGen-USPTO50K-Reaction-Templates

Model Details

Model Description

Model Sources

Uses

Direct Use

Bias, Risks, and Limitations

Training Details

Training Data

Training Procedure

Preprocessing

Training Hyperparameters

Evaluation

Metrics

Results

Technical Specifications

Framework versions

Hardware

NSDT 3DConvert

UnrealSynth

DreamTexture.js