Model Card for KEByT5-base (580M #params)

KEByT5: Korean-Enhanced/Enriched Byte-level Text-to-Text Transfer Transformer(T5)

Cross-modal, Multilingual Friendly Token-free Pretrained Language Model

본 사전학습 언어모델은 시각, 청각과 같은 텍스트 이외의 모달리티와 교차언어 지식 교환에 용이한 토큰-프리 사전학습 언어모델을 목표로 합니다.
현재 Preview 스테이지에 있는 모델이며, 활용에는 fine-tuning이 필요합니다.

Acknowledgements

본 사전학습 언어모델은 2022년도 정부(과학기술정보통신부)의 재원으로 정보통신기획평가원의 지원을 받아 수행된 연구임 (No. RS-2022-00187238, 효율적 사전학습이 가능한 한국어 대형 언어모델 사전학습 기술 개발) (EN=This pretrained language model was supported by the Institute of Information & communication Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) (No. RS-2022-00187238, Development of Large Korean Language Model Technology for Efficient Pre-training))

Model Details

본 사전학습 언어모델은 다음과 같은 규모를 가집니다:

kebyt5-mini : 124M
kebyt5-small : 330M
kebyt5-base : 580M
kebyt5-large : 1.23B (추후 공개 예정)

특히, small 및 base 모델은 google/byt5-small, google/byt5-base 모델과 동일한 신경망 구조와 크기를 가지며, 토크나이저(ByT5Tokenizer)와 구현 상 두 모델은 별도의 수정없이 바로 교환하여 사용할 수 있습니다. huggingface transformers에서의 사용법 역시, T5ForConditionalGeneration을 동일하게 사용할 수 있습니다.

Model Description

Developed by: Language Intelligence Research Section, Electronics and Telecommunications Research Institute(ETRI)
Model type: Encoder-Decoder Transformer, specifically, ByT5.
Language(s) (NLP): Korean, English, Chinese, Japanese.
License: [More Information Needed]
Finetuned from model [optional]: kebyt5-small/-base/-large model weight was initialized by google/byt5-* for Warm-start (Language-Adaptation Pre-Training)

Model Sources [optional]

Repository: https://github.com/etri-crossmodal/etri-llm-byt5 (currently private repos, we will make them public when we ready.)
Paper [optional]: [More Information Needed]
Demo [optional]: [More Information Needed]

Uses

해당 사전학습 언어모델은 연구 및 교육 목적의 활용으로 그 사용 목적이 제한됩니다.

Direct Use

현재 공개되는 모델은 T5 모델 학습에 사용된 Corrupted span denoising 만으로 학습되어 있어, 실제 응용 태스크에 적용하기 위해서는 fine-tuning 과정이 필요합니다.

Downstream Use [optional]

Token-free 모델의 특성 상, 복잡하거나 Noisy한 입력에 강건하며, 짧은 시퀀스 길이의 생성에 적합합니다. (예: 언어 이해, 대화 응답 생성) 사전학습은 1024 bytes 길이의 데이터를 학습했기 때문에, 이를 초과하는 긴 시퀀스를 다루는 문제에 적합하지 않습니다.

[More Information Needed]

Out-of-Scope Use

[More Information Needed]

Bias, Risks, and Limitations

[More Information Needed]

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

Use the code below to get started with the model.

[More Information Needed]

Training Details

Training Data

[More Information Needed]

Training Procedure [optional]

Preprocessing

[More Information Needed]

Speeds, Sizes, Times

[More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Testing Data

[More Information Needed]

Factors

[More Information Needed]

Metrics

[More Information Needed]

Results

[More Information Needed]

Summary

Model Examination [optional]

[More Information Needed]

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: [More Information Needed]
Hours used: [More Information Needed]
Cloud Provider: [More Information Needed]
Compute Region: [More Information Needed]
Carbon Emitted: [More Information Needed]

Technical Specifications [optional]

Model Architecture and Objective

[More Information Needed]

Compute Infrastructure

Trained on nVidia A100 80GB * 4EA

Hardware

[More Information Needed]

Software

Pytorch-lightning
huggingface/transformers
microsoft/deepspeed
...

Citation [optional]

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Glossary [optional]

[More Information Needed]

More Information [optional]

[More Information Needed]

Model Card Authors [optional]

[More Information Needed]

Model Card Contact

[More Information Needed]