mms

Massively Multilingual Speech (MMS) - Common Crawl Language Models

This repository consists of the n-gram language models trained on Common Crawl data (Conneau et al. 2020b, NLLB_Team et al. 2022) using KenLM library.

For the following languages, the LMs are not present in the repository (due to 50GB limit on HuggingFace) and can be downloaded using the link provided here.

Mandarin Chinese (Simplified) - Download LM

Japanese - Download LM

Thai - Download LM

Cantonese(Traditional) - Download LM

Table Of Content

Example

Checkout the code here - https://huggingface.co/spaces/mms-meta/MMS/blob/main/asr.py which uses LMs for decoding the output from ASR models.

Supported Languages

We support language models in 102 languages. Unclick the following to toogle all supported languages of this checkpoint in ISO 639-3 code. You can find more details about the languages and their ISO 639-3 codes in the MMS Language Coverage Overview. <details> <summary>Click to toggle</summary>

Model details

Additional Links