
Model Card for Maken Books

<!-- Provide a quick summary of what the model is/does. -->

Maken Books is a Doc2Vec model trained using Gensim on almost 600,000 books from the National Library of Norway.

Model Details

Model Description

Model Sources

<!-- Provide the basic links for the model. -->


Direct Use

It allows to cluster book-length texts or find similarities between long-form texts using a embedding space of 1024 dimensions. The model is used in production at the Maken site.

import re
from pathlib import Path

from gensim.models.doc2vec import Doc2Vec
from huggingface_hub import snapshot_download

model = Doc2Vec.load(str(
    Path(snapshot_download("NbAiLab/maken-books")) / "model.bin"

book = "A long text"
words = [c for c in re.split(r"\W+", book) if len(c) > 0]
embedding = model.infer_vector(words)
# array([ 0.01048528, -0.00491689,  0.01981961, ...,  0.00250911,
#        -0.00657777, -0.01207202], dtype=float32)

Bias, Risks, and Limitations

The majority of books used for training are written in the Norwegian languages, either Bokmål or Nynorsk. As such, the semantics of the embedding space might not work as expected with books in other languages, as no work has been done to align those.

Training Details

Training Data

Books from the National Library of Norway up to Nov 21st 2022.

Training Procedure

Preprocessing

Plain text files split on white spaces with re.split(r"\W+", book).

Training Hyperparameters

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Technical Specifications

Model Architecture and Objective

Doc2Vec using distributed memory (PV-DM) with 1024 dimensions, while ignoring all words with total frequency lower than 1000.

Compute Infrastructure


Architecture:           x86_64
  CPU op-mode(s):       32-bit, 64-bit
  Address sizes:        46 bits physical, 48 bits virtual
  Byte Order:           Little Endian
CPU(s):                 96
  On-line CPU(s) list:  0-95
Vendor ID:              GenuineIntel
  Model name:           Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz
    CPU family:         6
    Model:              85
    Thread(s) per core: 2
    Core(s) per socket: 24



Model Card Authors and Contact

Javier de la Rosa (<a href=""></a>)

## Disclaimer

The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions. When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of artificial intelligence. In no event shall the owner of the models (The National Library of Norway) be liable for any results arising from the use made by third parties of these models.