Model Card for Simple Latin BERT

A simple BERT Masked Language Model for Latin for my portfolio, trained on Latin Corpora from the Classical Language Toolkit corpora.

NOT apt for production nor commercial use.
This model's performance is really poor, and it has not been evaluated.

This model comes with its own tokenizer! It will automatically use lowercase.

Check the training notebooks folder for the preprocessing and training scripts.

Inspired by

This repo, which has a BERT model for latin that is actually useful!
This tutorial
This tutorial
This tutorial

Model Card for Simple Latin BERT
Table of Contents
Table of Contents
Model Details
- Model Description
Uses
- Direct Use
- Downstream Use [Optional]
Training Details
- Training Data
- Training Procedure
  - Preprocessing
  - Speeds, Sizes, Times
Evaluation

Model Details

Model Description

A simple BERT Masked Language Model for Latin for my portfolio, trained on Latin Corpora from the Classical Language Toolkit corpora.

NOT apt for production nor commercial use.
This model's performance is really poor, and it has not been evaluated.

This model comes with its own tokenizer!

Check the notebooks folder for the preprocessing and training scripts.

Developed by: Luis Antonio VASQUEZ
Model type: Language model
Language(s) (NLP): la
License: mit

Uses

Direct Use

This model can be used directly for Masked Language Modelling.

Downstream Use

This model could be used as a base model for other NLP tasks, for example, Text Classification (that is, using transformers' BertForSequenceClassification)

Training Details

Training Data

The training data comes from the corpora freely available from the Classical Language Toolkit

The Latin Library
Latin section of the Perseus Digital Library
Latin section of the Tesserae Project
Corpus Grammaticorum Latinorum

Training Procedure

Preprocessing

For preprocessing, the raw text from each of the corpora was extracted by parsing. Then, it was lowercased and written onto txt files. Ideally, in these files one line would correspond to one sentence.

Other data from the corpora, like Entity Tags, POS Tags, etc., were discarded.

Training hyperparameters:

epochs: 1
Batch size: 64
Attention heads: 12
Hidden Layers: 12
Max input size: 512 tokens

Speeds, Sizes, Times

After having the dataset ready, training this model on a 16 GB Nvidia Graphics card took around 10 hours.

Evaluation

No evaluation was performed on this dataset.