donut-base-ascii

This is "naver-clova-ix/donut-base" but with all non-ascii tokens removed. This means the model is good for basic English use cases where the text is primarily a-zA-Z0-9 and basic punctuation.

The original model, "naver-clova-ix/donut-base", did not have a token for "1", so that has also been added. The notebook remove-donut-tokens.ipynb details the whole process.

This has not been trained any more than the original model.

I made a whole video about it: https://youtu.be/Uzr553x1gdM

I did a quick speed test for generation against the default model and using bad_words_ids. The bad_words_ids was only 12k tokens instead of the 30k that were removed and it was still noticeably slower.

Speed script here
Launched with this

approach | time to generate 10 tokens

| - "naver-clova-ix/donut-base" | 205ms "naver-clova-ix/donut-base" + 12k bad_words_ids | 280ms "donut-base-ascii" | 195ms