Ref. All about evaluating Large language models
Ref. Amazon Bedrock: How good (bad) is Titan Embeddings?
TLTR:
Benchmarks does normaly not reflect business usecases but general knowledge.
GTP-4 is used as the judge -> thus certain bias is expected.
Building reliable benchmark for LLM chatbots is a challenge [ref. here]. Examples of challenges are:
clearly identify/seperate model capabilities
Below are some highlevel descriptions of some benchmarks for model evaluation:
Benchmark from live data from Chatbot Arena.
It contains 500 challenging user queries. GPT-4 is used as the judge to compare the models responses against a baseline model (GPT-4-0314)
See below some examples of questions used by the benchmark from here.
{"question_id":"1f07cf6d146d4038b2b93aaba3935ce0","category":"arena-hard-v0.1","cluster":"AI & Sequence Alignment Challenges","turns":[{"content":"Explain the book the Alignment problem by Brian Christian. Provide a synopsis of themes and analysis. Recommend a bibliography of related reading. "}]}
{"question_id":"379a490a6eae40608abf3501807b2545","category":"arena-hard-v0.1","cluster":"Advanced Algebra and Number Theory","turns":[{"content":" Consider the state:\n$$\\ket{\\psi} = \\frac{\\ket{00} + \\ket{01} + \\ket{10}}{\\sqrt{3}}$$\n\n(a). Calculate the reduced density matrix of the second qubit of $\\ket{\\psi}$."}]}
LLM application evaluation is very important due to the nature of the non-deterministic behaviour of the models.
Every RAG evaluation need to consider the following components:
The performance of the retriever is influeced by the chunking strategy and embedding model used, while the performance of the generator is influenced by the selection of the model and prompt technique.
Generation:
Retrieval:
from ragas.metrics import (
answer_relevancy,
faithfulness,
context_recall,
context_precision,
)
Picture below is based on RAGAS - Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines. Re. AWS-Bedrock example here and blog to try here.
the factual consistency of the answer to the context base on the question. Performed in two step:
a measure of how relevant the answer is to the question. It evaluates how closely the generated answer aligns with the initial question or instruction,
Given an answer the LLM find out the probable questions that the generated answer would be an answer to and computes similarity to the actual question asked.
The implementation looks like:
Generate a question for the given answer.
answer: [answer]
a measure of how relevant the retrieved context is to the question. Conveys quality of the retrieval pipeline.
Also know as context relevancy, it measures the signal-to-noise ration in the retrieved contexts. Given a question, LLMs figure out sentences from the retrieved context that are needed to answer the question.
This metrics aims to penalise inclusion of redundant information. The steps used to calculate the metric are:
Please extract relevant sentences from the provided context that can potentially help answer the following question. If no relevant sentences are found, or if you believe the question cannot be answered from the given context, return the phrase “Insufficient Information”. While extracting candidate sentences you’re not allowed to make any changes to sentences from given context
Amazon SageMaker Clarify (SM Clarify) allows you to evaluate and compare foundation models. FMEval is the open source package for SageMaker Clarify.
FMEval/SM Clarify library can help to evaluate:
It allows evaluation of models using automatic model evaluation and also human workers evaluation.
Automatic model evaluation metrics implemented by tool are:
Human workers might evaluate your model for more subjective dimentions such as helpfullness or style.
Table below shows a summary of the metrics you can use for each tasks using FMEval.
Tasks versus Metrics | Factual knowledge | Semantic robustness | Prompt stereotyping | Toxicity | Accuracy |
---|---|---|---|---|---|
Open-ended generation | x | x | x | x | - |
Text summarization | - | x | - | x | x |
Question and answer | - | x | - | x | x |
Classification | - | x | - | x |
lighteval
was originally built on top of the great Eleuther AI Harness (which is powering the Open LLM Leaderboard). We also took a lot of inspiration from the amazing HELM, notably for metrics.
Metrics:
Tool | Metrics | |
---|---|---|
RAGAS | Generation: faithfulness, answer_relevancy Retrieval: context precision, context recall |
|
FMEval | Accuracy, Toxiticy, Semantic robustness, prompt stereotyping | |
Paper: FELM: Benchmarking Factuality Evaluation of Large Language Models
Blog: Evaluate LLMs and RAG a practical example using Langchain and Hugging Face
LLM Evaluation using ML flow: https://mlflow.org/docs/latest/llms/llm-evaluate/index.html