Metrics

Evaluation can be split into three levels:

Retrieval-Level Metrics

These measure how relevant the retrieved documents are to the query.

Recall@k: Measures if the correct document appears in the top-k results.
Precision@k: Measures the proportion of top-k results that are relevant.
Mean Reciprocal Rank (MRR): Focuses on the rank of the first relevant document.
Normalized Discounted Cumulative Gain (NDCG): Gives more credit for relevant documents appearing higher in the list.

Generation-Level Metrics

These assess the quality of the final output from the LLM, given the retrieved context.

BLEU / ROUGE / METEOR: Compare generated output with reference texts using n-gram overlap.
BERTScore: Uses embeddings instead of raw tokens to compare semantic similarity.
GPTScore / LLM-as-a-Judge: Use another LLM to assess answer quality (faithfulness, helpfulness, etc.).

End-to-End Metrics (Human-Centric)

Used to assess the system as a whole from the user’s perspective.

Faithfulness: Is the generated answer actually grounded in the retrieved docs?
Helpfulness: Does the response actually address the user's need?
Toxicity / Bias: Are responses free of harmful or offensive content?
Latency: Total time taken from query to response.