japanese-stablelm causal-lm

Japanese-StableLM-Instruct-Alpha-7B

japanese-stablelm-icon

"A parrot able to speak Japanese, ukiyoe, edo period" — Stable Diffusion XL

Model Description

japanese-stablelm-instruct-alpha-7b is a 7B parameter decoder-only language models pre-trained built on top of the Japanese-StableLM-Base-Alpha-7B model and further fine-tuned on various instruction-following datasets.

Usage

First install additional dependencies in requirements.txt:

pip install sentencepiece einops

Then start generating text with japanese-stablelm-instruct-alpha-7b by using the following code snippet:

import torch
from transformers import LlamaTokenizer, AutoModelForCausalLM

tokenizer = LlamaTokenizer.from_pretrained("novelai/nerdstash-tokenizer-v1", additional_special_tokens=['▁▁'])

model = AutoModelForCausalLM.from_pretrained(
    "stabilityai/japanese-stablelm-instruct-alpha-7b",    
    trust_remote_code=True,
)
model.half()
model.eval()

if torch.cuda.is_available():
    model = model.to("cuda")

def build_prompt(user_query, inputs="", sep="\n\n### "):
    sys_msg = "以下は、タスクを説明する指示と、文脈のある入力の組み合わせです。要求を適切に満たす応答を書きなさい。"
    p = sys_msg
    roles = ["指示", "応答"]
    msgs = [": \n" + user_query, ": "]
    if inputs:
        roles.insert(1, "入力")
        msgs.insert(1, ": \n" + inputs)
    for role, msg in zip(roles, msgs):
        p += sep + role + msg
    return p

# this is for reproducibility.
# feel free to change to get different result
seed = 42
torch.manual_seed(seed)

# Infer with prompt without any additional input
user_inputs = {
    "user_query": "VR とはどのようなものですか?",
    "inputs": ""
}
prompt = build_prompt(**user_inputs)

input_ids = tokenizer.encode(
    prompt, 
    add_special_tokens=False, 
    return_tensors="pt"
)

tokens = model.generate(
    input_ids.to(device=model.device),
    max_new_tokens=256,
    temperature=1,
    top_p=0.95,
    do_sample=True,
)

out = tokenizer.decode(tokens[0][input_ids.shape[1]:], skip_special_tokens=True).strip()
print(out)
"""バーチャルリアリティは、現実の世界のように見える仮想世界の 3D 仮想現実のシミュレーションです。これは、ヘッドセットを介して、ユーザーが見たり、聞いたり、体験できるものです。"""
seed = 42
torch.manual_seed(seed)

# Infer with prompt with additional input
user_inputs = {
    "user_query": "VR について、以下の比較対象との違いを箇条書きで教えてください。",
    "inputs": "比較対象: AR"
}
prompt = build_prompt(**user_inputs)

input_ids = tokenizer.encode(
    prompt, 
    add_special_tokens=False, 
    return_tensors="pt"
)

tokens = model.generate(
    input_ids.to(device=model.device),
    max_new_tokens=256,
    temperature=1,
    top_p=0.95,
    do_sample=True,
)

out = tokenizer.decode(tokens[0][input_ids.shape[1]:], skip_special_tokens=True).strip()
print(out)
"""
以下は、VR と AR の比較対象の比較です。
1. VR はユーザーが3D の世界を体験することを可能にし、ユーザーが自分の目で世界を見ることを可能にします。
2. VR は、ユーザーが目の前の環境をより詳細に感じ、より多くのことができるようにすることを可能にします。
3. VR は、ユーザーの感覚を刺激し、拡張することを可能にします。
4. VR は、視覚的、触覚的、および聴覚的な感覚体験を提供するために使用されます。
5. AR は、現実の世界に重ね合わせて、情報を表示し、ユーザーに拡張現実体験を提供することを可能にします。
6. AR は、ユーザーが仮想オブジェクトを仮想環境に持ち込むことを可能にするため、物理的な世界をシミュレートするのに最適です。
7. VR は、3D 世界を実現する仮想世界を作成することに最適です。
8. AR は、ユーザーが現実世界のオブジェクトをシミュレートし、現実世界の現実的な世界に重ね合わせて情報を表示することを可能にします。
9. VR は、ユーザーの感覚や感情に与える影響が最も大きいと考えられています。
"""

Model Details

Training

Parameters Hidden Size Layers Heads Sequence Length
7B 4096 32 32 1024

Training Dataset

japanese-stablelm-instruct-alpha-7b is fine-tuned on a combination of following datasets:

Use and Limitations

Intended Use

This model is intended to be used by the open-source community in chat-like applications in adherence with the research license.

Limitations and bias

Although the aforementioned datasets help to steer the base language models into "safer" distributions of text, not all biases and toxicity can be mitigated through fine-tuning. We ask that users be mindful of such potential issues that can arise in generated responses. Do not treat model outputs as substitutes for human judgment or as sources of truth. Please use responsibly.

Authors

Acknowledgements

We are utilizing the v1 version of the novelai-tokenizer, introduced by NovelAI, because it processes both Japanese and English text both effectively and efficiently. We extend our gratitude to NovelAI for allowing us to use their remarkable work. For more details about the tokenizer, please refer to their blog post.

We are grateful for the contributions of the EleutherAI Polyglot-JA team in helping us to collect a large amount of pre-training data in Japanese. Polyglot-JA members includes Hyunwoong Ko (Project Lead), Fujiki Nakamura (originally started this project when he committed to the Polyglot team), Yunho Mo, Minji Jung, KeunSeok Im, and Su-Kyeong Jang.

We are also appreciative of AI Novelist/Sta (Bit192, Inc.) and the numerous contributors from Stable Community Japan for assisting us in gathering a large amount of high-quality Japanese textual data for model training.

How to cite

@misc{JapaneseStableLMInstructAlpha7B, 
      url={[https://huggingface.co/stabilityai/japanese-stablelm-instruct-alpha-7b](https://huggingface.co/stabilityai/japanese-stablelm-instruct-alpha-7b)}, 
      title={Japanese StableLM Instruct Alpha 7B}, 
      author={Lee, Meng and Nakamura, Fujiki and Shing, Makoto and McCann, Paul and Akiba, Takuya and Orii, Naoki}
}

Citations

@misc{alpaca,
  author = {Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto },
  title = {Stanford Alpaca: An Instruction-following LLaMA model},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/tatsu-lab/stanford_alpaca}},
}
@software{gpt-neox-library,
  title = {{GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch}},
  author = {Andonian, Alex and Anthony, Quentin and Biderman, Stella and Black, Sid and Gali, Preetham and Gao, Leo and Hallahan, Eric and Levy-Kramer, Josh and Leahy, Connor and Nestler, Lucas and Parker, Kip and Pieler, Michael and Purohit, Shivanshu and Songz, Tri and Phil, Wang and Weinbach, Samuel},
  url = {https://www.github.com/eleutherai/gpt-neox},
  doi = {10.5281/zenodo.5879544},
  month = {8},
  year = {2021},
  version = {0.0.1},
}