my_ml_notes

Evaluating_LLMs

image-20231114103324871

Ref. All about evaluating Large language models

Evaluating Embeddings

Ref. Amazon Bedrock: How good (bad) is Titan Embeddings?

Evaluating Models

TLTR:

Building reliable benchmark for LLM chatbots is a challenge [ref. here]. Examples of challenges are:

Below are some highlevel descriptions of some benchmarks for model evaluation:

MT-Bench

Arena Hard

Benchmark from live data from Chatbot Arena.

It contains 500 challenging user queries. GPT-4 is used as the judge to compare the models responses against a baseline model (GPT-4-0314)

See below some examples of questions used by the benchmark from here.

{"question_id":"1f07cf6d146d4038b2b93aaba3935ce0","category":"arena-hard-v0.1","cluster":"AI & Sequence Alignment Challenges","turns":[{"content":"Explain the book the Alignment problem by Brian Christian. Provide a synopsis of themes and analysis. Recommend a bibliography of related reading. "}]}

{"question_id":"379a490a6eae40608abf3501807b2545","category":"arena-hard-v0.1","cluster":"Advanced Algebra and Number Theory","turns":[{"content":" Consider the state:\n$$\\ket{\\psi} = \\frac{\\ket{00} + \\ket{01} + \\ket{10}}{\\sqrt{3}}$$\n\n(a). Calculate the reduced density matrix of the second qubit of $\\ket{\\psi}$."}]}


Evaluating RAG

LLM application evaluation is very important due to the nature of the non-deterministic behaviour of the models.

Every RAG evaluation need to consider the following components:

The performance of the retriever is influeced by the chunking strategy and embedding model used, while the performance of the generator is influenced by the selection of the model and prompt technique.

RAGAS (Retrieval Augmented Generation Assessment)

RAGAS Paper.

Generation:

Retrieval:

from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
)

Picture below is based on RAGAS - Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines. Re. AWS-Bedrock example here and blog to try here.

image-20231114102045866

faithfulness (generation):

the factual consistency of the answer to the context base on the question. Performed in two step:

answer_relevancy (generation):

a measure of how relevant the answer is to the question. It evaluates how closely the generated answer aligns with the initial question or instruction,

Given an answer the LLM find out the probable questions that the generated answer would be an answer to and computes similarity to the actual question asked.

The implementation looks like:

Generate a question for the given answer.
answer: [answer]

image-20231122092153476

context_precision (retrieval):

a measure of how relevant the retrieved context is to the question. Conveys quality of the retrieval pipeline.

Also know as context relevancy, it measures the signal-to-noise ration in the retrieved contexts. Given a question, LLMs figure out sentences from the retrieved context that are needed to answer the question.

This metrics aims to penalise inclusion of redundant information. The steps used to calculate the metric are:

Please extract relevant sentences from the provided context that can potentially help answer the following question. If no relevant sentences are found, or if you believe the question cannot be answered from the given context, return the phrase “Insufficient Information”. While extracting candidate sentences you’re not allowed to make any changes to sentences from given context

AWS FMEval and SageMaker Clarify (by Amazon)

Amazon SageMaker Clarify (SM Clarify) allows you to evaluate and compare foundation models. FMEval is the open source package for SageMaker Clarify.

FMEval/SM Clarify library can help to evaluate:

It allows evaluation of models using automatic model evaluation and also human workers evaluation.

Automatic model evaluation metrics implemented by tool are:

Human workers might evaluate your model for more subjective dimentions such as helpfullness or style.

Table below shows a summary of the metrics you can use for each tasks using FMEval.

Tasks versus Metrics Factual knowledge Semantic robustness Prompt stereotyping Toxicity Accuracy
Open-ended generation x x x x -
Text summarization - x - x x
Question and answer - x - x x
Classification - x -   x

LightEval (By HuggingFace)

lighteval was originally built on top of the great Eleuther AI Harness (which is powering the Open LLM Leaderboard). We also took a lot of inspiration from the amazing HELM, notably for metrics.

Metrics:

Lanchain ContextQAEvalChain

Langchain Evaluation

LLMTest_NeedleInAHaystack

Tool Metrics  
RAGAS Generation: faithfulness, answer_relevancy
Retrieval: context precision, context recall
 
FMEval Accuracy, Toxiticy, Semantic robustness, prompt stereotyping  
     
     

Reference: