chemistry biology protein instructions

This repo contains a fully fine-tuned LLaMA-7b, trained on the ๐Ÿงฌ protein-oriented instructions from the ๐Ÿงช Mol-Instructions dataset.

Instructions for running it can be found at https://github.com/zjunlp/Mol-Instructions.

Please refer to our paper for more details.

image.png

<h3> ๐Ÿงฌ Tasks</h3>

<details> <summary><b>Protein design</b></summary>

  1. The presence of Mg(2+) is necessary for the protein to function in the desired environment.
  2. The AMP, (6S)-NADPHX binding site should be located in a region of the protein that is accessible to the ligand.
  3. The designed protein should have ATP binding, NADPHX epimerase activity, metal ion binding, ADP-dependent NAD(P)H-hydrate dehydratase activity to facilitate nicotinamide nucleotide metabolic process.
  4. For general function, the protein need meet that Catalyzes the epimerization of the S- and R-forms of NAD(P)HX, a damaged form of NAD(P)H that is a result of enzymatic or heat-dependent hydration
MSNELVLSREQVRRVDQRAIEAYGVPGIVLMENAGRGAAEIIRAACPSAQRVLIACGPGNNGGDGFVIARHLANAGWMVELLLACPADRITGDAQGNHEIIRRMNLPCAVMADARDLEAANDRFATADVIVDALLGTGASGPPREPIASLIRAINEAHRRVSAQPAPSVFAVDIPSGLDCDTGEAANPTVRADHTITFVARKIGFRNPAARDLLGRVHVVDIGAPRAAIQDALTGKSG

</details>

<details> <summary><b>Catalytic activity prediction</b></summary>

</details>

<details> <summary><b>Protein function prediction</b></summary>

</details>

<details> <summary><b>Functional description generation</b></summary>

</details>

<details> <summary><b>Domain/Motif prediction</b></summary>

</details>

<h3> ๐Ÿ“ Demo</h3>

As illustrated in our repository, we provide an example to perform generation.

For model fine-tuned on protein-oriented instructions, you can conveniently recover the model weights we trained through the following command.

Please download llama-7b-hf to obtain the pre-training weights of LLaMA-7B, refine the --base_model to point towards the location where the model weights are saved.

Then replace $DIFF_WEIGHT_PATH with the path of our provided diff weights, and replace $RECOVER_WEIGHT_PATH with the desired path to save the recovered weights. If the directory of recovered weights lacks required files (e.g., tokenizer configuration files), you can copy from $DIFF_WEIGHT_PATH.

python weight_diff.py recover \
  --path_raw $BASE_MODEL_PATH \
  --path_diff $DIFF_WEIGHT_PATH \
  --path_tuned $RECOVER_WEIGHT_PATH

After that, you can execute the following command to generate outputs with the fine-tuned LLaMA model.

>> python generate.py \
    --CLI True \
    --protein True \
    --base_model $RECOVER_WEIGHT_PATH \

<h3> ๐Ÿšจ Limitations</h3>

The current state of the model, obtained via instruction tuning, is a preliminary demonstration. Its capacity to handle real-world, production-grade tasks remains limited.

<h3> ๐Ÿ“š References</h3> If you use our repository, please cite the following related paper:

@article{molinst,
  title={Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models},
  author={Fang, Yin and Liang, Xiaozhuan and Zhang, Ningyu and Liu, Kangwei and Huang, Rui and Chen, Zhuo and Fan, Xiaohui and Chen, Huajun},
  journal={arXiv preprint arXiv:2306.08018},
  year={2023}
}

<h3> ๐Ÿซฑ๐Ÿปโ€๐Ÿซฒ Acknowledgements</h3>

We appreciate LLaMA, Huggingface Transformers Llama, Alpaca, Alpaca-LoRA, Chatbot Service and many other related works for their open-source contributions.