gptj title generation headline generation teaser generation news

GPT-J-Title-Teaser-1k

<!-- Provide a quick summary of what the model is/does. -->

gptj-title-teaser-1k
Version 1.0 / 22 December 2022

A proof of concept for multitask fine-tuning GPT-J-6B-8bit for german news title and teaser generation.

Model Details

Model Description

Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

This model is not intended for use! It is a preliminary version of gptj-title-teaser-10k to prove the multitask fine-tuning approach.
For use please refer to gptj-title-teaser-10k.

Training Details

Training Data

<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

The model was finetuned on a collection of 1,000 news items scraped from different online news outlets in german language.

For each news item the dataset contains title, teaser and fulltext.

[
 {
    "title": ...,
    "teaser": ...,
    "fulltext": ...
  },
]

Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

The model was finetuned using a causal language modeling (CLM) objective for multitask finetuning.

Preprocessing

For each news item, two inputs were concatenated like below.

f"[Text]: {item.fulltext} \n [Title]: {item.title}"
f"[Text]: {item.fulltext} \n [Teaser]: {item.teaser}"

This results in one input per task for each news item.

Note: The inserted prompt "[Text]:" marks the beginning of the news item's fulltext.
In the same manner "[Title]:" prompts the news item's title and "[Teaser]:" the news item's teaser.

Evaluation

1,000 german news articles proved to be sufficient to validate the approach. Evaluation showed that the model improved compared to the GPT-J baseline in:

The evaluation also suggested that there is still opportunity for improvement with more data.
For the model trained with the same approach but 10x the amount of data pleaser refer to gptj-title-teaser-10k.

Environmental Impact

<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

Carbon emissions were estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Glossary

News Item, aka news article. A particular piece of news, usually from a journalistic source.
Snippet, a small section of text that is related to a news item.
Title aka headline. A few words that reflect the essence of the news story.
Teaser aka lede. A few sentences that spark curiosity about the "best of the rest" of the news story.