roberta

RoBERTa base model fine-tuned on pronoun fill masking

This is RoBERTa base fine-tuned for fill masking of just pronouns. The model's purpose is to post process machine translated text where sentence level translation may not have enough context to correctly deduce the correct pronoun to use.

This model was trained on 10B tokens of literature (private light novel and book dataset as well as books1 and 20% of books3 from The Pile).

This model achieves an 88% top1 accuracy, evaluated with a sliding window of 512 tokens (84% without a sliding window).

How to use

Mask all pronoun tokens. The use the fill mask pipeline to get the model's predictions.

PRONOUN_TOKENS = {
    'I',                           'ĠI',
    'you',    'You',    'Ġyou',    'ĠYou',
    'he',     'He',     'Ġhe',     'ĠHe',
    'she',    'She',    'Ġshe',    'ĠShe',
    'it',     'It',     'Ġit',     'ĠIt',
    'we',     'We',     'Ġwe',     'ĠWe',
    'they',   'They',   'Ġthey',   'ĠThey',
    'my',     'My',     'Ġmy',     'ĠMy',
    'your',   'Your',   'Ġyour',   'ĠYour',
    'his',    'His',    'Ġhis',    'ĠHis',
    'her',    'Her',    'Ġher',    'ĠHer',
    'its',    'Its',    'Ġits',    'ĠIts',
    'our',    'Our',    'Ġour',    'ĠOur',
    'their',  'Their',  'Ġtheir',  'ĠTheir',
    'mine',   'Mine',   'Ġmine',   'ĠMine',
    'yours',  'Yours',  'Ġyours',  'ĠYours',
    'hers',   'Hers',   'Ġhers',   'ĠHers',
    'ours',   'Ours',   'Ġours',   'ĠOurs',
    'theirs', 'Theirs', 'Ġtheirs', 'ĠTheirs',
}