Grammatical Error Correction
You can test the model at Grammatical Error Correction.<br /> If you want to find out more information, please contact us at sg-nlp@aisingapore.org.
Table of Contents
Model Details
Model Name: Cross Sentence GEC
- Description: This model is based on the convolutional encoder-decoder architecture described in the associated paper
- Paper: Cross-sentence grammatical error correction. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, July 2019 (pp. 435-445).
- Author(s): Chollampatt, S., Wang, W., & Ng, H. T. (2019).
- URL: https://aclanthology.org/P19-1042
How to Get Started With the Model
Install Python package
SGnlp is an initiative by AI Singapore's NLP Hub. They aim to bridge the gap between research and industry, promote translational research, and encourage adoption of NLP techniques in the industry. <br><br> Various NLP models, other than relation extraction are available in the python package. You can try them out at SGNLP-Demo | SGNLP-Github.
pip install sgnlp
Examples
For more full code (such as Grammatical Error Correction), please refer to this github. <br> Alternatively, you can also try out the Demo | SGNLP-Docs.
Example of Grammatical Error Correction:
from sgnlp.models.csgec import (
CsgConfig,
CsgModel,
CsgTokenizer,
CsgecPreprocessor,
CsgecPostprocessor,
download_tokenizer_files,
)
config = CsgConfig.from_pretrained("https://storage.googleapis.com/sgnlp-models/models/csgec/config.json")
model = CsgModel.from_pretrained(
"https://storage.googleapis.com/sgnlp-models/models/csgec/pytorch_model.bin",
config=config,
)
download_tokenizer_files(
"https://storage.googleapis.com/sgnlp-models/models/csgec/src_tokenizer/",
"csgec_src_tokenizer",
)
download_tokenizer_files(
"https://storage.googleapis.com/sgnlp-models/models/csgec/ctx_tokenizer/",
"csgec_ctx_tokenizer",
)
download_tokenizer_files(
"https://storage.googleapis.com/sgnlp-models/models/csgec/tgt_tokenizer/",
"csgec_tgt_tokenizer",
)
src_tokenizer = CsgTokenizer.from_pretrained("csgec_src_tokenizer")
ctx_tokenizer = CsgTokenizer.from_pretrained("csgec_ctx_tokenizer")
tgt_tokenizer = CsgTokenizer.from_pretrained("csgec_tgt_tokenizer")
preprocessor = CsgecPreprocessor(src_tokenizer=src_tokenizer, ctx_tokenizer=ctx_tokenizer)
postprocessor = CsgecPostprocessor(tgt_tokenizer=tgt_tokenizer)
texts = [
"All of us are living in the technology realm society. Have you ever wondered why we use these tools to connect "
"ourselves with other people? It started withthe invention of technology which has evolved tremendously over the "
"past few decades. In the past, we travel by ship and now we can use airplane to do so. In the past, it took a few "
"days to receive a message as we need to post our letter and now, we can use e-mail which stands for electronic "
"message to send messages to our friends or even use our handphone to send our messages.",
"Machines have replaced a bunch of coolies and heavy labor. Cars and trucks diminish the redundancy of long time "
"shipment. As a result, people have more time to enjoy advantage of modern life. One can easily travel to the "
"other half of the globe to see beautiful scenery that one dreams for his lifetime. One can also easily see his "
"deeply loved one through internet from miles away."
]
batch_source_ids, batch_context_ids = preprocessor(texts)
predicted_ids = model.decode(batch_source_ids, batch_context_ids)
predicted_texts = postprocessor(predicted_ids)
Training
The train dataset comprises of the Lang-8 and NUCLE datasets. Both datasets have to be requested from NAIST and NUS respectively.
Evaluation
The evaluation scores reported are based on evaluation on CoNLL-2014 benchmark. The full dataset can be downloaded from their respective shared task pages.
Evaluation Scores
- Retrained scores: N/A. Demo uses the author's original code
- Scores reported in paper: (Single Model F0.5: 53.06, Ensemble + BERT Rescoring F0.5: 54.87%)
Model Parameters
- Model Inputs: Source Sentence - sentence to be corrected, context - the two immediately preceeding sentences, target - either padding tokens and the start token or the last 3 previously predicted tokens.
- Model Outputs: Array of logits for each token in the target vocabulary. This can be converted into the probability distribution for the next word using the softmax function.
- Model Inference Info: Not available.
- Usage Scenarios: Grammar and spell checker app / feature.
Other Information
- Original Code: link