Massively Multilingual Speech (MMS) - Common Crawl Language Models

This repository consists of the n-gram language models trained on Common Crawl data (Conneau et al. 2020b, NLLB_Team et al. 2022) using KenLM library.

For the following languages, the LMs are not present in the repository (due to 50GB limit on HuggingFace) and can be downloaded using the link provided here.

Mandarin Chinese (Simplified) - Download LM

Japanese - Download LM

Thai - Download LM

Cantonese(Traditional) - Download LM

Example

Checkout the code here - https://huggingface.co/spaces/mms-meta/MMS/blob/main/asr.py which uses LMs for decoding the output from ASR models.

Supported Languages

We support language models in 102 languages. Unclick the following to toogle all supported languages of this checkpoint in ISO 639-3 code. You can find more details about the languages and their ISO 639-3 codes in the MMS Language Coverage Overview. <details> <summary>Click to toggle</summary>

afr
amh
ara
asm
ast
azj
bel
ben
bos
bul
cat
ceb
ces
ckb
cmn
cym
dan
deu
ell
eng
est
fas
fin
fra
ful
gle
glg
guj
hau
heb
hin
hrv
hun
hye
ibo
ind
isl
ita
jav
jpn
kam
kan
kat
kaz
kea
khm
kir
kor
lao
lav
lin
lit
ltz
lug
luo
mal
mar
mkd
mlt
mon
mri
mya
nld
nob
npi
nso
nya
oci
orm
ory
pan
pol
por
pus
ron
rus
slk
slv
sna
snd
som
spa
srp
swe
swh
tam
tel
tgk
tgl
tha
tur
ukr
umb
urd
uzb
vie
wol
xho
yor
yue
zlm
zul </details>

Model details

Developed by: Vineel Pratap et al.
Model type: Multi-Lingual Automatic Speech Recognition model
Language(s): 126 languages, see supported languages
License: CC-BY-NC 4.0 license
Num parameters: 1 billion
Audio sampling rate: 16,000 kHz

Cite as:

@article{pratap2023mms,
  title={Scaling Speech Technology to 1,000+ Languages},
  author={Vineel Pratap and Andros Tjandra and Bowen Shi and Paden Tomasello and Arun Babu and Sayani Kundu and Ali Elkahky and Zhaoheng Ni and Apoorv Vyas and Maryam Fazel-Zarandi and Alexei Baevski and Yossi Adi and Xiaohui Zhang and Wei-Ning Hsu and Alexis Conneau and Michael Auli},
journal={arXiv},
year={2023}
}

Additional Links

Blog post
Transformers documentation.
Paper
GitHub Repository
Other MMS checkpoints
MMS base checkpoints:
- facebook/mms-1b
- facebook/mms-300m
Official Space

Massively Multilingual Speech (MMS) - Common Crawl Language Models

Table Of Content

Example

Supported Languages

Model details

Additional Links

NSDT 3DConvert

UnrealSynth

DreamTexture.js