Skip to main content

Evaluation of Retrieval Accuracy Using Different Prompts and Models

This evaluation assesses retrieval performance on a Discrete Annotated Synthetic Data Set (0-1 scoring), using Queryloop Retrieval prompts and G-eval scoring prompts across multiple versions of GPT-4 models.

Evaluation Prompts and Methodology

1. Queryloop Retrieval Evaluation Prompt

This prompt is designed to determine if the retrieved context contains complete information to generate a provided golden answer. The evaluation follows these criteria:

  • Answer "yes" if the context contains all necessary information to fully support the golden answer.
  • Answer "no" if the context lacks complete relevant information.

Prompt (Revised Version):

You are an expert evaluation system for a question answering chatbot.

You are given the following information:
- a golden answer,
- a context (information to answer the user's query)
Your task is to evaluate if the context contains the information needed to generate the provided golden answer.
Look closely at the context provided and follow a step-by-step approach looking for each component of the question.

You have two options to answer.
Answer "no" if the context does not contain COMPLETE relevant information to generate the golden answer provided.
Answer "yes" if the context contains complete information to generate the golden answer provided.

MUST FOLLOW GUIDELINE: Make sure to answer only in "yes", and "no". Don't add anything extra in your Answer.

GOLDEN ANSWER: {golden_answer}

CONTEXT:
{context}

Answer:

2. G-eval Prompt for Retrieval Evaluation

This prompt provides a structured approach to evaluate whether the retrieved context directly and completely discusses the golden answer. The evaluation process involves identifying key facts from the golden answer and assessing if these are directly mentioned in the context. A binary score (0 or 1) is assigned based on this assessment.

G-eval Prompt:

You are given a golden answer and a retrieved context. Your task is to evaluate whether the retrieved context directly and completely discusses the golden answer. You will give a score of 0 or 1 based on the quality of the retrieval:

Evaluation Criteria:
Score 0: The retrieved context does not directly or completely discuss the golden answer.
Score 1: The retrieved context directly and completely discusses the golden answer.

Evaluation Steps:
Identify Key Elements: Identify and extract the key facts from the golden answer.
Check for Mentions: Check if these key elements are mentioned in the retrieved context.
Assess Completeness: Determine if the retrieved context completely discusses all key elements of the golden answer.
Assign Score: Based on the above steps, assign a score of 0 or 1:
Score 0: The retrieved context does not directly or completely discuss the golden answer.
Score 1: The retrieved context directly and completely discusses the golden answer.
Explain Score: Provide an explanation for the score you assigned.

Your answer should follow this JSON format:
{
key_facts: list of extracted key facts from the answer,
explanation: explanation of score,
score: assigned score
}

Evaluation Results Summary

The experiments compared the performance of four different versions of GPT-4 models using both prompts: GPT-4-turbo-preview, GPT-4, GPT-4o, and GPT-4o-mini. The following confusion matrices outline their performance in evaluating retrieved contexts:

Queryloop Retrieval Prompt Results

ModelTrue PositiveTrue NegativeFalse PositiveFalse Negative
GPT-4-turbo-preview111241
GPT-411971
GPT-4o101602
GPT-4o-mini101602

G-eval Retrieval Prompt Results

ModelTrue PositiveTrue NegativeFalse PositiveFalse Negative
GPT-4-turbo-preview101602
GPT-4111231
GPT-4o91603
GPT-4o-mini101602

Analysis

Observations

  1. Queryloop Retrieval Prompt:

    • False Positives: There is a trend of false positives where the context partially or indirectly covers the golden answer but does not fully meet the required completeness. In these cases, a slight modification was added to clarify the need for complete information to mitigate such misclassifications.
    • Model Consistency: GPT-4o and GPT-4o-mini showed high consistency in scoring, particularly in detecting true positives and negatives with minimal false positives.
  2. G-eval Prompt:

    • Score Explanation: This prompt provides a structured output in JSON, which includes explanations and key facts from the golden answer. The explanations give insights into the reasoning behind the assigned scores.
    • False Positives: False positives were slightly less common with G-eval, suggesting the explicit scoring and structured approach may help minimize these errors.
  • Prompt Clarification: Adding the terms "directly" and "completely" in the Queryloop prompt improved the clarity of evaluation instructions, reducing false positives where partial information was incorrectly classified as sufficient.
  • Prompt Selection for High Accuracy: The G-eval prompt may be more suitable for cases requiring a structured output and detailed explanation, whereas the Queryloop prompt could be optimized for binary pass/fail scoring with strict adherence to completeness.

Final Remarks

The results indicate that GPT-4o and GPT-4o-mini with the revised Queryloop prompt perform well for strict, binary classification needs. For contexts where partial information might cause ambiguity, the G-eval prompt’s structured approach provides valuable explanations, making it suitable for detailed evaluation tasks.