avatar

Sentence-Doctor

Sentence doctor is a T5 model that attempts to correct the errors or mistakes found in sentences. Model works on English, German and French text.

1. Problem:

Many NLP models depend on tasks like Text Extraction Libraries, OCR, Speech to Text libraries and Sentence Boundary Detection As a consequence errors caused by these tasks in your NLP pipeline can affect the quality of models in applications. Especially since models are often trained on clean input.

2. Solution:

Here we provide a model that attempts to reconstruct sentences based on the its context (sourrounding text). The task is pretty straightforward:

3. Use Cases:

4. Disclaimer

Note how we always emphises on the word attempt. The current version of the model was only trained on 150K sentences from the tatoeba dataset: https://tatoeba.org/eng. (50K per language -- En, Fr, De). Hence, we strongly encourage you to finetune the model on your dataset. We might release a version trained on more data.

5. Datasets

We generated synthetic data from the tatoeba dataset: https://tatoeba.org/eng. Randomly applying different transformations on words and characters based on some probabilities. The datasets are available in the data folder (where sentence_doctor_dataset_300K is a larger dataset with 100K sentences for each language).

6. Usage

6.1 Preprocessing

text = "That is my job I am a medical doctor I save lives"
sentences = ["That is my job I a", "m a medical doct", "I save lives"]

Here is the single preprocessing step for the model:

input_text = "repair_sentence: " + sentences[1] + " context: {" + sentences[0] + "}{" + sentences[2] + "} </s>"

Explanation:</br>

print(input_text) # repair_sentence: m a medical doct context: {That is my job I a}{or I save lives} </s>

<br/>

The context is optional, so the input could also be repair_sentence: m a medical doct context: {}{} </s>

6.2 Inference


from transformers import AutoTokenizer, AutoModelWithLMHead

tokenizer = AutoTokenizer.from_pretrained("flexudy/t5-base-multi-sentence-doctor")

model = AutoModelWithLMHead.from_pretrained("flexudy/t5-base-multi-sentence-doctor")

input_text = "repair_sentence: m a medical doct context: {That is my job I a}{or I save lives} </s>"

input_ids = tokenizer.encode(input_text, return_tensors="pt")

outputs = model.generate(input_ids, max_length=32, num_beams=1)

sentence = tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)

assert sentence == "I am a medical doctor."

7. Fine-tuning

We also provide a script train_any_t5_task.py that might help you fine-tune any Text2Text Task with T5. We added #TODO comments all over to help you use train with ease. For example:

# TODO Set your training epochs
config.TRAIN_EPOCHS = 3

If you don't want to read the #TODO comments, just pass in your data like this

# TODO Where is your data ? Enter the path
trainer.start("data/sentence_doctor_dataset_300.csv")

and voila!! Please feel free to correct any mistakes in the code and make a pull request.

8. Attribution