Model card for PM-AI/paraphrase-distilroberta-base-v2_de-en
For internal purposes and for testing, we have made a monolingual paraphrasing model from Sentence Transformers usable for German + English via Knowledge Distillation. The decision was made in favor of sentence-transformers/paraphrase-distilroberta-base-v2 because this model has no public available multilingual version (to our knowledge). In addition, it has significantly more training samples compared to its predecessor: 83.3 million samples were used instead of 24.6 million samples.
Training
- Download of datasets
- Execution of knowledge distillation
Training Data
Datasets used based on offical source:
- AllNLI
- sentence-compression
- SimpleWiki
- altlex
- msmarco-triplets
- quora_duplicates
- coco_captions
- flickr30k_captions
- yahoo_answers_title_question
- S2ORC_citation_pairs
- stackexchange_duplicate_questions
- wiki-atomic-edits
Training Execution
First we downloaded some german-english parallel datasets via get_parallel_data_*.py.
These datasets are: Tatoeba, WikiMatrix, TED2020, OpenSubtitles, Europarl, News-Commentary
Then we started knowledge distillation with make_multilingual_sys.py
Parameterization of training
- Script: make_multilingual_sys.py
- Datasets: Tatoeba, WikiMatrix, TED2020, OpenSubtitles, Europarl, News-Commentary
- GPU: NVIDIA A40 (Driver Version: 515.48.07; CUDA Version: 11.7)
- Batch Size: 64
- Max Sequence Length: 256
- Train Max Sentence Length: 600
- Max Sentences Per Train File: 1000000
- Teacher Model: sentence-transformers/paraphrase-distilroberta-base-v2
- Student Model: xlm-roberta-base
- Loss Function: MSE Loss
- Learning Rate: 2e-5
- Epochs: 20
- Evaluation Steps: 10000
- Warmup Steps: 10000
Acknowledgment
This work is a collaboration between Technical University of Applied Sciences Wildau (TH Wildau) and sense.ai.tion GmbH. You can contact us via:
- Philipp Müller (M.Eng.); Author
- Prof. Dr. Janett Mohnke; TH Wildau
- Dr. Matthias Boldt, Jörg Oehmichen; sense.AI.tion GmbH
This work was funded by the European Regional Development Fund (EFRE) and the State of Brandenburg. Project/Vorhaben: "ProFIT: Natürlichsprachliche Dialogassistenten in der Pflege".
<div style="display:flex"> <div style="padding-left:20px;"> <a href="https://efre.brandenburg.de/efre/de/"><img src="https://huggingface.co/datasets/PM-AI/germandpr-beir/resolve/main/res/EFRE-Logo_rechts_oweb_en_rgb.jpeg" alt="Logo of European Regional Development Fund (EFRE)" width="200"/></a> </div> <div style="padding-left:20px;"> <a href="https://www.senseaition.com"><img src="https://senseaition.com/wp-content/uploads/thegem-logos/logo_c847aaa8f42141c4055d4a8665eb208d_3x.png" alt="Logo of senseaition GmbH" width="200"/></a> </div> <div style="padding-left:20px;"> <a href="https://www.th-wildau.de"><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/f6/TH_Wildau_Logo.png/640px-TH_Wildau_Logo.png" alt="Logo of TH Wildau" width="180"/></a> </div> </div>