Skip to main content

Analysis of RAGAS Prompt, Fact Generation, and Evaluation Methodology

This document summarizes the evaluation methodology, fact generation prompts, and result analysis for RAGAS. The goal is to establish a reliable system for assessing semantic and factual alignment between generated and ground truth answers in a structured manner.


Prompt for Evaluation (TP, FP, FN Classification)

Classification Prompt

The prompt classifies each answer statement based on its alignment with the ground truth statements:

  1. True Positive (TP): Statements present in the answer that are directly supported by one or more statements in the ground truth.
  2. False Positive (FP): Statements present in the answer but not directly supported by any statement in the ground truth.
  3. False Negative (FN): Statements in the ground truth that are absent from the answer.

This classification is intended to quantify how well the generated response aligns with the essential factual elements of the ground truth.

Fact Generation Prompt

The fact generation process involves analyzing the complexity of each sentence within an answer, breaking it down into understandable statements without pronouns, and ensuring it captures all essential information.

Prompt Structure:

Given a question, an answer, and sentences from the answer, analyze the complexity of each sentence under 'sentences' and break down each sentence into one or more fully understandable statements while ensuring no pronouns are used in each statement. Format the output in JSON.

Example:

  • Question: "Who was Albert Einstein and what is he best known for?"
  • Answer: "He was a German-born theoretical physicist, widely acknowledged to be one of the greatest and most influential physicists of all time. He was best known for developing the theory of relativity and also contributed to quantum mechanics."

Output:

{
"sentences": [
{
"sentence_index": 0,
"simpler_statements": [
"Albert Einstein was a German-born theoretical physicist.",
"Albert Einstein is recognized as one of the greatest and most influential physicists of all time."
]
},
{
"sentence_index": 1,
"simpler_statements": [
"Albert Einstein was best known for developing the theory of relativity.",
"Albert Einstein also made important contributions to the development of quantum mechanics."
]
}
]
}

Methodology for Answer Correctness

Answer correctness is determined by two critical factors:

  1. Factual Correctness: The overlap of facts between the ground truth and generated answers.

    • True Positive (TP): Facts in both answers.
    • False Positive (FP): Facts in the generated answer but not in the ground truth.
    • False Negative (FN): Facts in the ground truth but missing in the generated answer.
    • F1 Score Calculation: Used to quantify the accuracy of factual overlap.
  2. Semantic Similarity: Measures the alignment in meaning between the answers.

    • Steps:
      1. Vectorize both ground truth and generated answers.
      2. Compute cosine similarity between the vectors.
  3. Final Score Calculation:

    • A weighted formula combining factual and semantic scores:
      Final_score = factual_score_weight * F1Score + semantic_similarity_score_weight * semantic_similarity_score

TP + FP Prompt

Given a ground truth and an answer, the prompt classifies each answer statement as:

  • TP (True Positive): If supported by the ground truth.
  • FP (False Positive): If not supported by the ground truth.

Prompt Format:

{
"reason": "<reason>",
"category": "<return 1 if true positive and 0 if false positive>"
}

False Negative Prompt

This prompt identifies statements in the ground truth not represented in the answer, providing insight into missing critical information.


Evaluation Results for RAGAS

  1. Average Absolute Difference between scores: 0.8607
  2. Score Bracket Accuracy: 64.08%

Link to Experiment File: RAGAS Evaluation Results


Comparative Analysis with Other Approaches

The iterative fact generation and TP+FP analysis prompt enable precise evaluation of factual overlap by distinguishing supported from unsupported statements. False negatives help identify missing information, which can be further analyzed for completeness in generated answers. This methodology enhances accuracy compared to simple semantic or factual checks alone.

Next Steps for Improvement

A notable area of improvement would be developing a method to create a standalone answer from the generated response, ensuring it retains semantic and factual fidelity to the ground truth, improving the system’s ability to self-assess responses independently.