Model Card for KEByT5-base (580M #params)
<!-- Provide a quick summary of what the model is/does. --> KEByT5: Korean-Enhanced/Enriched Byte-level Text-to-Text Transfer Transformer(T5)
Cross-modal, Multilingual Friendly Token-free Pretrained Language Model
- 본 사전학습 언어모델은 시각, 청각과 같은 텍스트 이외의 모달리티와 교차언어 지식 교환에 용이한 토큰-프리 사전학습 언어모델을 목표로 합니다.
- 현재 Preview 스테이지에 있는 모델이며, 활용에는 fine-tuning이 필요합니다.
Acknowledgements
- 본 사전학습 언어모델은 2022년도 정부(과학기술정보통신부)의 재원으로 정보통신기획평가원의 지원을 받아 수행된 연구임 (No. RS-2022-00187238, 효율적 사전학습이 가능한 한국어 대형 언어모델 사전학습 기술 개발) (EN=This pretrained language model was supported by the Institute of Information & communication Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) (No. RS-2022-00187238, Development of Large Korean Language Model Technology for Efficient Pre-training))
Model Details
본 사전학습 언어모델은 다음과 같은 규모를 가집니다:
- kebyt5-mini : 124M
- kebyt5-small : 330M
- kebyt5-base : 580M
- kebyt5-large : 1.23B (추후 공개 예정)
특히, small 및 base 모델은 google/byt5-small, google/byt5-base 모델과 동일한 신경망 구조와 크기를 가지며, 토크나이저(ByT5Tokenizer)와 구현 상 두 모델은 별도의 수정없이 바로 교환하여 사용할 수 있습니다. huggingface transformers에서의 사용법 역시, T5ForConditionalGeneration을 동일하게 사용할 수 있습니다.
Model Description
<!-- Provide a longer summary of what this model is. -->
- Developed by: Language Intelligence Research Section, Electronics and Telecommunications Research Institute(ETRI)
- Model type: Encoder-Decoder Transformer, specifically, ByT5.
- Language(s) (NLP): Korean, English, Chinese, Japanese.
- License: [More Information Needed]
- Finetuned from model [optional]: kebyt5-small/-base/-large model weight was initialized by google/byt5-* for Warm-start (Language-Adaptation Pre-Training)
Model Sources [optional]
<!-- Provide the basic links for the model. -->
- Repository: https://github.com/etri-crossmodal/etri-llm-byt5 (currently private repos, we will make them public when we ready.)
- Paper [optional]: [More Information Needed]
- Demo [optional]: [More Information Needed]
Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> 해당 사전학습 언어모델은 연구 및 교육 목적의 활용으로 그 사용 목적이 제한됩니다.
Direct Use
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
현재 공개되는 모델은 T5 모델 학습에 사용된 Corrupted span denoising 만으로 학습되어 있어, 실제 응용 태스크에 적용하기 위해서는 fine-tuning 과정이 필요합니다.
Downstream Use [optional]
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
Token-free 모델의 특성 상, 복잡하거나 Noisy한 입력에 강건하며, 짧은 시퀀스 길이의 생성에 적합합니다. (예: 언어 이해, 대화 응답 생성) 사전학습은 1024 bytes 길이의 데이터를 학습했기 때문에, 이를 초과하는 긴 시퀀스를 다루는 문제에 적합하지 않습니다.
[More Information Needed]
Out-of-Scope Use
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
[More Information Needed]
Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
[More Information Needed]
Recommendations
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
How to Get Started with the Model
Use the code below to get started with the model.
[More Information Needed]
Training Details
Training Data
<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
[More Information Needed]
Training Procedure [optional]
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
Preprocessing
[More Information Needed]
Speeds, Sizes, Times
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
[More Information Needed]
Evaluation
<!-- This section describes the evaluation protocols and provides the results. -->
Testing Data, Factors & Metrics
Testing Data
<!-- This should link to a Data Card if possible. -->
[More Information Needed]
Factors
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
[More Information Needed]
Metrics
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
[More Information Needed]
Results
[More Information Needed]
Summary
Model Examination [optional]
<!-- Relevant interpretability work for the model goes here -->
[More Information Needed]
Environmental Impact
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: [More Information Needed]
- Hours used: [More Information Needed]
- Cloud Provider: [More Information Needed]
- Compute Region: [More Information Needed]
- Carbon Emitted: [More Information Needed]
Technical Specifications [optional]
Model Architecture and Objective
[More Information Needed]
Compute Infrastructure
- Trained on nVidia A100 80GB * 4EA
Hardware
[More Information Needed]
Software
- Pytorch-lightning
- huggingface/transformers
- microsoft/deepspeed
- ...
Citation [optional]
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
BibTeX:
[More Information Needed]
APA:
[More Information Needed]
Glossary [optional]
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
[More Information Needed]
More Information [optional]
[More Information Needed]
Model Card Authors [optional]
[More Information Needed]
Model Card Contact
[More Information Needed]