FlyBaseGeneAbstractClassifier

This repository hosts the FlyBaseGeneAbstractClassifier, a machine learning model designed to classify gene-paper abstract pairs into two labels for Drosophila genes:

LABEL_1: The gene is a topic of the paper.
LABEL_0: The gene is not a topic of the paper.

The model was trained on a dataset made from open papers tagged by FlyBase as of February 2022. The training data set consists of 43,000 gene-abstract pairs, and was tested on 4,846 gene-abstract pairs.

Requirements

The model requires the transformers library and was trained on a system with the following hardware:

CPU count: 6
GPU count: 1
GPU type: NVIDIA A100-SXM4-40GB

Usage

To use the model, follow these steps:

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

# Initialize the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained("scibert", model_max_length=512)
model = AutoModelForSequenceClassification.from_pretrained("cgrivaz/FlyBaseGeneAbstractClassifier", num_labels=2)

# Create a pipeline
model_pipeline = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

Training and Evaluation

Detailed information about the training process and evaluation metrics can be found on the project's Weights & Biases page here.

Limitations and Future Work

As the model is in its initial version, it is likely that there are areas for improvement and potential biases that have not been thoroughly investigated. Users are encouraged to provide feedback and report any issues they encounter during usage.

Contributing

Contributions to improve the model are welcome. Please feel free to open an issue or submit a pull request.

License

mit