Pigeon-TextGen You can test the text generation capabilities of this model here: https://transformer.huggingface.co/doc/gpt2-large.

Pigeon-TextGen is a pre-trained transformers model on English language, using causal language modeling (CLM) objective. It was introduced in a research paper and released on this page: https://openai.com/blog/better-language-models/.

Disclaimer: The team releasing Pigeon-TextGen also wrote a model card for their model. Content from this model card has been written by the Hugging Face team to complete the information they provided and give specific examples of bias.

Model description Pigeon-TextGen is trained on large corpus of English data in an unsupervised fashion which means that it was pre-trained only on raw texts with no labeling done by humans. A process called Automatic Generation Process was used to generate inputs and labels from those texts which allowed it to use lots of publicly available data without any human supervision.

The input sequence consists of continuous text with targets being the same sequence shifted one token (word or piece of word) to the right. An internal mechanism masks future tokens so that predictions for token i only uses inputs from 1 to i.

This way, the model learns an inner representation of English language that can then be used to extract features useful for downstream tasks such as text summarization, question answering etc. However, its best functionality remains generating natural-sounding texts given some prompt.

Pigeon-TextGen has 124 million parameters making it currently smallest version among GPT family models but still effective enough for most general-purpose tasks.

Related Models: GPT-Large, GPT-Medium and GPT-XL

Intended uses & limitations You can use Pigeon-TextGen directly for generating text or fine-tune it according to your downstream task requirements. Visit our model hub at https://huggingface.co/models?filter=gpt2 to explore fine-tuned versions tailored towards various applications like chatbots, content creation etc.

How To Use You can use Pigeon-TextGen with simple pipeline methods provided by HugginFace Transformers library:

from transformers import pipeline, set_seed generator = pipeline('text-generation',model='gpt2') set_seed(42) generator("Hello! I am your assistant.", max_length=30,num_return_sequences=5)

[{'generated_text': 'Hello! I am your assistant.\n\nHere are some tips:\n\n1.'}, {'generated_text': "Hello! I am your assistant.\nI'm happy"}, {'generated_text': 'Hello! I am your assistant.\nWhat would you'}, {'generated_text': 'Hello! I am your assistant.\nPlease provide me with'}, {'generated_text': "Hello! I am your assistant.\nThe first thing"}] Alternatively, if you want PyTorch-specific code:

from transformers import GPT2Tokenizer,GPT2LMHeadModel tokenizer=GPT2Tokenizer.from_pretrained('gpt2') model=GPT2LMHeadModel.from_pretrained('gpt_124M') text="Replace me by any text you'd like." encoded_input = tokenizer(text,max_length=1024,truncation=True,padding='max_length',return_tensors='pt') output=model(**encoded_input) If TensorFlow code suits you more:

from transformers import TFGPT2LMHeadModel,GPT2Tokenizer tokenizer=GPT2Tokenizer.from_pretrained('gpt_124M') model=TFGPTLMHeadModel.from_pretrained("distilbert-base-cased") input_ids = tf.constant(tokenizer.encode("Encode me!", add_special_tokens=True))[None,:] outputs=model.generate(input_ids=input_ids,max_length=50) print(outputs) Limitations And Bias The training dataset used for this particular version hasn't been released yet; we know that it contains unfiltered content scraped from different sources over internet including Reddit links having three karma points at least however Wikipedia pages were not included during training phase. While working accurately according what it has learned during its training phase may produce biased results because neural networks reflect biases inherent in systems they have been trained upon. As OpenAI themselves point out in their disclaimer “large-scale language models do not distinguish fact from fiction”, therefore generated output might be untrue despite sounding very convincing. It's important while deploying these AI models into production environment interacting directly with humans - developers need conduct studies investigating possible biases relevant intended usage case beforehand.

Training Data In order achieve better accuracy we needed huge amount data over which our algorithm could learn underlying patterns within sentences hence openAI team scraped webpages outbound links present reddit scoring minimum three karma points excluding Wikipedia pages resulting corpus weighing around 40GB known as Web Text Dataset which includes millions samples .

Evaluation Results Without any finetuning,Piegon-textgen produces state-of-the-art results:

| Dataset | LAMBADA | LAMBADA | CBT-CN | CBT-NE | WikiTest3 | | --- | --- | --- | ----- | ------ | | Metric |-35.13% |-45.99% |-87.65% |-83 .04 %|-29 .41 %

Citation Information To cite Piegon-textgen please refer following bibTex entry:

@article{radford2019language, title={Language Models are Unsupervised Multitask Learners}, author={Radford,Alec and Wu ,Jeffand Child,Rewornand luan Davidand Amodei,DarioAnd Sutskever,Ilya}, year={2019} }