chemistry smiles

Model Card for Model hogru/MolReactGen-GuacaMol-Molecules

<!-- Provide a quick summary of what the model is/does. -->

MolReactGen is a model that generates molecules in SMILES format (this model) and reaction templates in SMARTS format.

Model Details

Model Description

<!-- Provide a longer summary of what this model is. -->

MolReactGen is based on the the GPT-2 transformer decoder architecture and has been trained on the GuacaMol dataset. More information can be found in these introductory slides.

Model Sources

<!-- Provide the basic links for the model. -->

Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

The main use of this model is to pass the master's examination of the author ;-)

Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

The model can be used in a Hugging Face text generation pipeline. For the intended use case a wrapper around the raw text generation pipeline is needed. This is the generate.py from the repository. The model has a default GenerationConfig() (generation_config.json) which can be overwritten. Depending on the number of molecules to be generated (num_return_sequences in the JSON file) this might take a while. The generation code above shows a progress bar during generation.

Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

The model generates molecules that are similar to the GuacaMol training data, which itself is based on ChEMBL. Any checks of the molecules, e.g. chemical feasiblitly, must be adressed by the user of the model.

Training Details

Training Data

<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

GuacaMol dataset

Training Procedure

The default Hugging Face Trainer() has been used, with an EarlyStoppingCallback().

Preprocessing

The training data was pre-processed with a PreTrainedTokenizerFast() trained on the training data with a character level pre-tokenizer and Unigram as the sub-word tokenization algorithm with a vocabulary size of 88. Other tokenizers can be configured.

Training Hyperparameters

More configuration (options) can be found in the conf directory of the repository.

Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

Please see the slides / the poster mentioned above.

Metrics

<!-- These are the evaluation metrics being used, ideally with a description of why. -->

Please see the slides / the poster mentioned above.

Results

Please see the slides / the poster mentioned above.

Technical Specifications

Framework versions

Hardware