bilingual-gpt-neox-4b

rinna-icon

Overview

This repository provides an English-Japanese bilingual GPT-NeoX model of 3.8 billion parameters.


Benchmarking


How to use the model

Notice: Since the model is sensitive to decoding hyper-parameters (e.g. temperature, top_p, top_k, repetition_penalty), it is suggested to explore the best setting for your task.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("rinna/bilingual-gpt-neox-4b", use_fast=False)
model = AutoModelForCausalLM.from_pretrained("rinna/bilingual-gpt-neox-4b")

if torch.cuda.is_available():
    model = model.to("cuda")

text = "西田幾多郎は、"
token_ids = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt")

with torch.no_grad():
    output_ids = model.generate(
        token_ids.to(model.device),
        max_new_tokens=100,
        min_new_tokens=100,
        do_sample=True,
        temperature=1.0,
        top_p=0.95,
        pad_token_id=tokenizer.pad_token_id,
        bos_token_id=tokenizer.bos_token_id,
        eos_token_id=tokenizer.eos_token_id
    )

output = tokenizer.decode(output_ids.tolist()[0])
print(output)
"""
西田幾多郎は、その著書「自覚の哲学」の中で、次のように書きました。  
「知識を、自分のものと考えることに満足していると、自己の限界に目覚めることを忘れてしまう。しかし、他者との協同なしには、自己の本当の理解に達することはできないのだ。知識は他者と相互の、協同の力によってこそ、得られるのである。」(引用終わり)  
この一節を、私たちは今から学び直すべきです。そして、これからの社会をリードする子どもたちに、その能力を伸ばすべく、
"""
text = "Socrates says"
token_ids = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt")

with torch.no_grad():
    output_ids = model.generate(
        token_ids.to(model.device),
        max_new_tokens=100,
        min_new_tokens=100,
        do_sample=True,
        temperature=1.0,
        top_p=0.95,
        pad_token_id=tokenizer.pad_token_id,
        bos_token_id=tokenizer.bos_token_id,
        eos_token_id=tokenizer.eos_token_id
    )

output = tokenizer.decode(output_ids.tolist()[0])
print(output)

"""
Socrates says: he thinks that philosophy, as opposed to myth, can be demonstrated; as opposed to poetry, that it is not possible to have knowledge of the unknowable (that is, neither by reason nor by any art of divination). So in this case he is in agreement with Socrates in not thinking that we could prove the existence of the gods or of fate. Now, I do not know the content of Xenophon's _Symposium_, but he must have made a point of this passage that has ex
"""
text = "def bubble_sort(array):"
token_ids = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt")

with torch.no_grad():
    output_ids = model.generate(
        token_ids.to(model.device),
        max_new_tokens=200,
        min_new_tokens=200,
        do_sample=True,
        temperature=1.0,
        top_p=0.5,
        pad_token_id=tokenizer.pad_token_id,
        bos_token_id=tokenizer.bos_token_id,
        eos_token_id=tokenizer.eos_token_id
    )

output = tokenizer.decode(output_ids.tolist()[0])
print(output)
"""
def bubble_sort(array):
    for i in range(len(array)):
        for j in range(len(array)-1):
            if array[j] > array[j+1]:
                array[j], array[j+1] = array[j+1], array[j]
    return array

print(bubble_sort([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]))

The code above will sort the array from 1 to 10 in the following order:
1, 2, 3, 4, 5, 6, 7, 8, 9, 10

However, I am not sure how to do
"""

Tokenization

The model uses a sentencepiece-based tokenizer.


Licenese

The MIT license