Description
German word embedding model trained by Müller with the following parameter configuration:
- a corpus as big as possible (and as diverse as possible without being informal) filtering of punctuation and stopwords
- forming bigramm tokens
- using skip-gram as training algorithm with hierarchical softmax
- window size between 5 and 10
- dimensionality of feature vectors of 300 or more
- using negative sampling with 10 samples
- ignoring all words with total frequency lower than 50
For more information, see https://devmount.github.io/GermanWordEmbeddings/
How to use?
from gensim.models import KeyedVectors
from huggingface_hub import hf_hub_download
model = KeyedVectors.load_word2vec_format(hf_hub_download(repo_id="Word2vec/german_model", filename="german.model"), binary=True, unicode_errors="ignore")
Citation
@thesis{mueller2015,
author = {{Müller}, Andreas},
title = "{Analyse von Wort-Vektoren deutscher Textkorpora}",
school = {Technische Universität Berlin},
year = 2015,
month = jun,
type = {Bachelor's Thesis},
url = {https://devmount.github.io/GermanWordEmbeddings}
}