Transformer Model for Medical English to Spanish Translation
- Model Name: Med_English2Spanish
- Model Type: Transformer-based Neural Machine Translation (NMT) Model
- Task: English to Spanish Medical Translation
Model Description
Med_English2Spanish is a specialized neural machine translation model designed for translating medical content from English to Spanish. It has been fine-tuned to cater specifically to the medical domain, ensuring accurate and contextually relevant translations for healthcare professionals and researchers.
About Dataset:
The dataset used in Med_English2Spanish is a critical component in ensuring accurate and contextually relevant medical translations. It is a subset of the "WMT-16-PubMed" dataset, which has been meticulously curated and adapted for this specific machine translation task. The dataset was compiled by collecting data from various reputable sources on the internet, as well as integrating content from another medical dataset, resulting in a comprehensive and diverse collection of medical documents.
https://huggingface.co/datasets/ayoubkirouane/med_en2es
https://huggingface.co/datasets/qanastek/WMT-16-PubMed
Dataset Statistics:
- Source: Adapted from the WMT-16-PubMed dataset and other reputable medical sources.
- Total Examples: 286,000
- Content: The dataset comprises a wide range of medical texts, including research papers, clinical notes, and medical literature, covering various subfields within the healthcare domain.
- Data Cleaning: The dataset underwent rigorous data cleaning and preprocessing, including the removal of personally identifiable information (PII) to ensure privacy and compliance with ethical standards.
Ethical Considerations
Med_English2Spanish is intended for medical professionals and researchers. Care has been taken to minimize biases in translations and ensure privacy by stripping PII during preprocessing. However, users are encouraged to review translations for accuracy in sensitive medical contexts.
- Bias and Fairness: We have attempted to reduce bias, but users should be aware of potential translation biases.
- Privacy: PII has been removed, but users should handle sensitive data with caution.
- Transparency: The model's decisions are not explicitly explainable but can be understood through inspection of the input and output.
Intended Use
Med_English2Spanish is designed for medical professionals, researchers, and students. It can be used for tasks like translating medical documents, research papers, and clinical notes from English to Spanish.
- Target Audience: Medical professionals, researchers, and students.
- Use Cases: Medical document translation, research paper translation, clinical note translation.
Limitations
- Data Limitations: Performance may vary for extremely rare medical terms or languages other than English and Spanish.
- Generalization: The model may struggle with highly specialized subfields of medicine.
- Domain-Specific Challenges: Slang and colloquialisms may not be accurately translated.
Usage
!pip -q install transformers[sentencepiece] sacremoses
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("ayoubkirouane/Med_English2Spanish")
model = AutoModelForSeq2SeqLM.from_pretrained("ayoubkirouane/Med_English2Spanish")
src_text = ['Adult pneumococcal sepsis: Should we rule out congenital anesthesia?']
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
[tokenizer.decode(t, skip_special_tokens=True) for t in translated]