Summarization using LLM and measuring the performance with ROUGE— Part 1

Kunal Gandhi
4 min readOct 21, 2023

--

LLM (Large Language Models) are the main focus of today’s AI world especially in the area of Generative AI. In this article, we will try few LLMs from hugging-face through in-built Pipeline and will measure the performance of each model through ROUGE.

Summarization — So there is 2 way to perform summarization.

  1. Abstractive Summarization — Here we try to create a summary that represents the purpose and capture the essence of the document. This is hard to achieve as we may need to create new words and recreate sentences that are not present in the document which could create grammatical and semantic issues
  2. Extractive Summarization — Extractive summarization selects and extracts complete sentences from the source text to create the summary. It does not generate new sentences but rather chooses sentences that are the most informative or representative of the content.

Hugging Face transformers are trying to perform abstractive summarization. Let’s come to the point.

First you need to import following libraries,

# to load the dataset
from datasets import load_dataset
# to create summarization pipeline
from transformers import pipeline
# to calculate rouge score
from rouge_score import rouge_scorer
import pandas as pd

Please install libraries through pip if you do not already have them.

Now let’s load dataset that we going to use to measure the performance of LLMs.

xsum_dataset = load_dataset("xsum", version="1.2.0") 
xsum_sample = xsum_dataset["train"].select(range(5))
display(xsum_sample.to_pandas())

As you can see dataset have 3 columns.

  • document: Input news article.
  • summary: One sentence summary of the article.
  • id: BBC ID of the article.

You can find more about this dataset here.

Let’s create summarization pipeline and create summary passing document.

summarizer_t5 = pipeline(
task="summarization",
model="t5-small",
)

results = summarizer(xsum_sample["document"],min_length=20,max_length=40,truncation=True)

# convert to pandas df and print
opt_result = pd.DataFrame.from_dict(results).rename({"summary_text": "generated_summary"}, axis=1).join(pd.DataFrame.from_dict(xsum_sample))[
["generated_summary", "summary", "document"]
]
display(opt_result.head())

Pipeline takes mainly three arguments. Model, task and tokenizer. Here we are using the default tokenizer.

We are passing the minimum length as 20 and the maximum length for summary is 40.

Now measure the performance by calculating ROUGE Score.

What is ROUGE?

ROUGE stands for “Recall-Oriented Understudy for Gisting Evaluation.” It’s a metric designed to measure the quality of summaries by comparing them to human reference summaries. ROUGE is a collection of metrics, with the most commonly used one being ROUGE-N, which measures the overlap of N-grams (contiguous sequences of N words) between the system-generated summary and the reference summary.

How to Calculate ROUGE?

Let’s calculate ROUGE — 1 for the following example,

Reference Summary — Weather is hot here

Generated Summarty — Weather is very hot here

Calculate Recall for ROUGE
Calculate Precision for ROUGE

Calculate F1 Score using precision and Recall. F1 Score would be 0.88 .

We can calculate ROUGE — 2,ROUGE — 3 … ROUGE — N using bi-gram,tri-gram and N-grams.

ROUGE-L: Measures the longest common subsequence between the system and reference summaries. This metric is less sensitive to word order and can capture semantic similarity.

def calculate_rouge(data):
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
data["r1_fscore"] = data.apply(lambda row : scorer.score(row["summary"],row["generated_summary"])['rouge1'][2], axis=1)
data["r2_fscore"] = data.apply(lambda row : scorer.score(row["summary"],row["generated_summary"])['rouge2'][2], axis=1)
data["rl_fscore"] = data.apply(lambda row : scorer.score(row["summary"],row["generated_summary"])['rougeL'][2], axis=1)

return data

score_ret=calculate_rouge(opt_result)

print("ROUGE - 1 : ",score_ret["r1_fscore"].mean())
print("ROUGE - 2 : ",score_ret["r2_fscore"].mean())
print("ROUGE - L : ",score_ret["rl_fscore"].mean())

I have tried 2 pre-trained model for the summarization.

  1. T5 — Small
  2. facebook/bart-large-cnn
ROUGE Score Comparison

These are pre-trained models. We can further fine-tune these models to work better. You can find a list of models available for summarization tasks on HuggingFace here.

While ROUGE is a valuable tool, it has its limitations. For example, it doesn’t consider the fluency and coherence of the summary. It focuses on word overlap, which means a summary can receive a high ROUGE score even if it’s not very readable.

Please find the code in the git repo .

--

--

Kunal Gandhi

Senior Analyst -Data Science. IT Engineer. Interested in AI. Love coding.(Python,Js,Java)