Throughout this course, you've seen statements like, the model demonstrated good performance on this task, or this fine-tuned model showed a large improvement in performance over the base model. What do statements like this mean? How can you formalize the improvement in performance of your fine-tuned model over the pre-trained model you started with? Let's explore several metrics that are used by developers of large language models that you can use to assess the performance of your own models and compare to other models out in the world. In traditional machine learning, you can assess how well a model is doing by looking at its performance on training and validation datasets, where the output is already known. You're able to calculate simple metrics, such as accuracy, which states the fraction of all predictions that are correct, because the models are deterministic. But with large language models, where the output is non-deterministic and language-based, evaluation is much more challenging. Take, for example, the sentence, Mike really loves drinking tea. This is quite similar to, Mike adores sipping tea. But how do you measure the similarity? Let's look at these other two sentences. Mike does not drink coffee. And Mike does drink coffee. There is only one word difference between these two sentences. However, the meaning is completely different. Now, for humans, like us, with our squishy organic brains, we can see the similarities and differences. But when you train a model on millions of sentences, you need an automated, structured way to make measurements. Rouge and Blur are two widely used evaluation metrics for different tasks. Rouge, or recall-oriented understudy for gisting evaluation, is primarily employed to assess the quality of automatically generated summaries by comparing them to human-generated reference summaries. On the other hand, Blur, or bilingual evaluation understudy, is an algorithm designed to evaluate the quality of machine-translated text, again, by comparing it to human-generated translations. Now, the word Blur is French for blue. So, you might hear people calling this blue, but here I'm going to stick with the original Blur. Before we start calculating metrics, let's review some terminology. In the anatomy of language, a unigram is equivalent to a single word, a bigram is two words, and ngram is a group of n words. Pretty straightforward stuff. First, let's look at the Rouge 1 metric. And to do so, let's look at a human-generated reference sentence, It is cold outside. And a generated output, It is very cold outside. You can perform simple metric calculations, similar to other machine learning tasks, using recall, precision, and F1. The recall metric measures the number of words, or unigrams, that are matched between the reference and the generated output, divided by the number of words, or unigrams, in the reference. In this case, that gets a perfect score of 1, as all the generated words match words in the reference. Precision measures the unigram matches divided by the output size. And the F1 score is the harmonic mean of both of these values. These are very basic metrics that only focus on individual words, hence the 1 in the name, and don't consider the ordering of the words, so it can be deceptive. It's easily possible to generate sentences that score well, but would be subjectively poor. Stop for a moment and imagine that the sentence generated by the model was different by just one word, not. So it is not cold outside. The scores would be the same. You can get a slightly better score by taking into account bigrams, or collections of two words at a time from the reference and generated sentence. By working with pairs of words, you're acknowledging in a very simple way the ordering of the words in the sentence. By using bigrams, you're able to calculate Rouge 2. Now you can calculate the recall, precision, and F1 score using bigram matches instead of individual words. You'll notice that the scores are lower than the Rouge 1 scores, and with longer sentences, the greater chance that bigrams don't match, and the scores may be even lower. Rather than continue on with Rouge numbers growing bigger to n-grams of 3 or 4s, let's take a different approach. Instead, you'll look for the longest common subsequence present in both the generated output and the reference output. In this case, the longest matching subsequences are it is and cold outside, each with a length of 2. You can now use the LCS value to calculate the recall, precision, and F1 score, where the numerator in both the recall and the precision calculations is the length of the longest common subsequence, in this case, 2. Collectively, these three quantities are known as the Rouge L score. As with all of the Rouge scores, you need to take the values in context. You can only use the scores to compare the capabilities of models if the scores were determined for the same task. For example, summarization. Rouge scores for different tasks are not comparable to one another. As you've seen, a particular problem with simple Rouge scores is that it's possible for a bad completion to result in a good score. Take, for example, this generated output. Cold, cold, cold, cold. As this generated output contains one of the words from the reference sentence, it will score quite highly, even though the same word is repeated multiple times. The Rouge 1 precision score will be perfect. One way you can counter this issue is by using a clipping function to limit the number of unigram matches to the maximum count for that unigram within the reference. In this case, there is one appearance of cold in the reference, and so a modified precision with a clip on the unigram matches results in a dramatically reduced score. However, you'll still be challenged if the generated words are all present, but just in a different order. For example, with this generated sentence, outside cold it is, this sentence will score perfectly, even on the modified precision with the clipping function, as all of the words in the generated output are present in the reference. So, whilst using a different Rouge score can help, experimenting with an engram size that will calculate the most useful score will be dependent on the sentence, the sentence size, and your use case. Note that many language model libraries, for example, Hugging Face, which you used in the first week's lab, include implementations of Rouge score that you can use to easily evaluate the output of your model. You'll get to try the Rouge score and use it to compare the model's performance before and after fine-tuning in this week's lab. The other score that can be useful in evaluating the performance of your model is the BLEU score, which stands for Bilingual Evaluation Understudy. Just to remind you, the BLEU score is useful for evaluating the quality of machine-translated text. The score itself is calculated using the average precision over multiple engram sizes. So, just like the Rouge 1 score that we looked at before, but calculated for a range of engram sizes and then averaged. Let's take a closer look at what this measures and how it's calculated. The BLEU score quantifies the quality of a translation by checking how many engrams in the machine-generated translation match those in the reference translation. To calculate the score, you average precision across a range of different engram sizes. If you were to calculate this by hand, you would carry out multiple calculations and then average all of the results to find the BLEU score. So, for this example, let's take a look at a longer sentence so that you can get a better sense of the score's value. The reference, human-provided sentence, is, I am very happy to say that I am drinking a warm cup of tea. Now, as you've seen these individual calculations in depth when you looked at Rouge, I'll show you the results of BLEU using a standard library. Calculating the BLEU score is easy with pre-written libraries from providers like Hugging Face, and I've done just that for each of our candidate sentences. The first candidate is, I am very happy that I am drinking a cup of tea. And the BLEU score is 0.495. And as we get closer and closer to the original sentence, we get a score that is closer and closer to 1. Both Rouge and BLEU are quite simple metrics and are relatively low cost to calculate. You can use them for simple reference as you iterate over your models, but you shouldn't use them alone to report the final evaluation of a large language model. Use Rouge for diagnostic evaluation of summarization tasks and BLEU for translation tasks. For overall evaluation of your model's performance, however, you'll need to look at one of the evaluation benchmarks that have been developed by researchers. Let's take a look at some of these in more detail in the next video.

Generative AI with Large Language Models

Intermediate

10 hours 18 mins

Topics

Fine-Tuning

GenAI Applications

Generative Models

Prompt Engineering

Transformers

Collaborator

AWS

Week 2: Fine-tuning and evaluating large language models

Introduction - Week 2
Video
・
4 mins

Instruction fine-tuning
Video
・
7 mins

Fine-tuning on a single task
Video
・
3 mins

Multi-task instruction fine-tuning
Video
・
8 mins

Scaling instruct models
Reading
・
10 mins

Model evaluation
Video
・
10 mins

Benchmarks
Video
・
5 mins

Parameter efficient fine-tuning (PEFT)
Video
・
4 mins

PEFT techniques 1: LoRA
Video
・
8 mins

PEFT techniques 2: Soft prompts
Video
・
7 mins

Lab 2 walkthrough
Video
・
15 mins

Lab 2 - Fine-tune a generative AI model for dialogue summarization
Code Example
・
10 mins

Week 2 quiz

Graded・Quiz

・

1 hour

Week 2 Resources
Reading
・
10 mins

Lecture Notes Week 2
Reading
・
1 min

Week 3: Reinforcement learning and LLM-powered applications