text-classification language-identification

GlotLID

GlotLID

Description

GlotLID is a Fasttext language identification (LID) model that supports more than 1600 languages.

How to use

Here is how to use this model to detect the language of a given text:

>>> import fasttext
>>> from huggingface_hub import hf_hub_download

>>> model_path = hf_hub_download(repo_id="cis-lmu/glotlid", filename="model.bin")
>>> model = fasttext.load_model(model_path)
>>> model.predict("Hello, world!")

If you are not a fan of huggingface_hub, then download the model directyly:

>>> ! wget https://huggingface.co/cis-lmu/glotlid/resolve/main/model.bin
>>> import fasttext

>>> model = fasttext.load_model("/path/to/model.bin")
>>> model.predict("Hello, world!")

License

The model is distributed under the Apache License, Version 2.0.

Version

We always maintain the previous version of GlotLID in our repository.

To access a specific version, simply append the version number to the filename.

model.bin always refers to the latest version (v2).

References

If you use this model, please cite the following paper:

@inproceedings{
  kargaran2023glotlid,
  title={{GlotLID}: Language Identification for Low-Resource Languages},
  author={Kargaran, Amir Hossein and Imani, Ayyoob and Yvon, Fran{\c{c}}ois and Sch{\"u}tze, Hinrich},
  booktitle={The 2023 Conference on Empirical Methods in Natural Language Processing},
  year={2023},
  url={https://openreview.net/forum?id=dl4e3EBz5j}
}