Datasets and Benchmarks

Evaluation requires standardized datasets with known answers. Here are commonly used ones for RAG:

Dataset / BenchmarkTask TypeDomainDescription
Natural Questions (NQ)Open-domain QAGeneralQuestions with long + short answers
TriviaQAOpen-domain QATriviaMulti-sentence answers with evidence
HotpotQAMultihop QAWikipediaRequires combining facts from multiple docs
FEVERFact verificationWikipediaVerify claims with evidence
ELI5Long-form QARedditRequires lengthy, detailed answers
MS MARCOPassage retrievalWebLarge-scale IR task; used to test retrievers
BEIR BenchmarkRetrieval (multi)MixedMassive suite of 18+ retrieval datasets

Notable Tools:

  • BEIR Benchmark: Covers multiple retrieval tasks in diverse domains.

  • OpenAI Eval / LLM-as-a-Judge: Uses LLMs like GPT-4 to grade model responses.

  • LangChain + LlamaIndex: Useful for custom eval pipelines using RAG.

Summary

  • Evaluating RAG = evaluating both retrieval and generation quality.

  • Use retrieval metrics (Recall@k, MRR) and generation metrics (ROUGE, BERTScore).

  • Use benchmarks like NQ, HotpotQA, and BEIR for testing.

  • For nuanced tasks, consider LLM-based evaluation as a modern alternative.