Vocab is:

\n\" !$&'#,/+=-<>*@.:;[]^?0123456789abcdefghijklmnopqrstuvwxyzèé§↨
§ (made from alt+21) was used as end of file/sample 
↨ (made from alt+23) is the shift key (and gets removed and the next character gets replaced with an uppdercase character)

Model is trained on scraped youtube subtitles and whispered transcripts of youtube/tv shows. totalling approx 2.3billion tokens when processed.

Data was Deduped, Had all UPPERCASE samples removed, and ran a 'ranker' that removed random data which somehow was included in the subtitles on youtube. (such as total gibberish)

Training took 72 hours, and was stopped when overfitting occured. (this is checkpoint 264000 out of a planned 400000)

gradient_accumulation_steps = 2 # used to simulate larger batch sizes
batch_size = 45 # if gradient_accumulation_steps > 1, this is the micro-batch size
block_size = 768    
n_layer = 12
n_head = 8
n_embd = 512
dropout = 0.00001 # for pretraining 0 is good, for finetuning try 0.1+
bias = False # do we use bias inside LayerNorm and Linear layers?
learning_rate = 0.0008 # max learning rate
min_lr = 0.00008

function to fix text from the model:

def remove_caseifer(text):
    new_text = ""
    i = 0
    while i < len(text):
        if text[i] == "↨":
            if i+1 < len(text):
                new_text += text[i+1].upper()
                i += 1
            else:
                pass  # skip this index
        else:
            new_text += text[i]
        i += 1
    return new_text

function to prepare text for the model:

def add_caseifer(text):
    uppers = 0
    lowers = 0
    tokenlist = set("\n\" !$&'#,/+=-<>*@.:;[]{}()^?0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzèé")
    replace_map = {  # Define a mapping of characters to be replaced
        "{": "[",
        "(": "[",
        "}": "]",
        ")": "]"
    }
    upperlist = set("ABCDEFGHIJKLMNOPQRSTUVWXYZ")
    lowerlist = set("abcdefghijklmnopqrstuvwxyz")
    new_text = ""
    for char in text:
        if char in tokenlist:
            if char in upperlist:
                new_text += "↨" + char.lower()
            elif char in replace_map:
                new_text += replace_map[char]
            else:
                new_text += char
        else:
            continue      
    return new_text