Skip to main content

Evaluation on Annotated Test Dataset

Generation with Continuous Scoring for Retrieval Evaluation

Methodology

We calculated the average absolute difference between the ground truth weighted sum log probabilities and the predicted weighted sum log probabilities. The evaluation method with the smallest difference score was selected as the final method. If predicted scores were outside the range (0-1), they were scaled to (0-1) before calculating the absolute difference.


Experiment Results

Single IO Prompting with Instructions (Output: 0,1)

Prompt:

You are an expert evaluation system for a question-answering chatbot.

You are given the following information:
- a golden answer,
- context (information to answer the user's query).

Your task is to evaluate if the context contains the information to generate the provided golden answer or not.

"You have two options to answer:
Answer 0 if the context does not contain COMPLETE relevant information to generate the golden answer provided.
Answer 1 if the context contains complete information to generate the golden answer provided.

MUST FOLLOW GUIDELINE: Make sure to answer only in 0, and 1. Do not add anything extra to your answer."
ResultAverage Absolute DifferenceLink to Experiment File
Without log probabilities0.2825984252Experiment File

Single IO Prompting with Instructions (Output: 1-5)

Prompt:

You will be given an answer and a retrieved context. Rate the retrieved context on retrieval quality.

1. Carefully read the answer provided to fully understand its information.
2. Comprehend the information presented in the retrieved context.
3. Compare the information in the answer with the retrieved context.

Based on your comparison, rate the retrieval quality on a scale of 1 to 5:
- 1: Poor retrieval quality; all information is missing or incorrect.
- 2: Fair retrieval quality; some key points are missing or incorrect.
- 3: Good retrieval quality; most key points are covered, but some details may be missing.
- 4: Very good retrieval quality; almost all key points and details are covered.
- 5: Excellent retrieval quality; all key points and details are directly and completely covered.

Only return score.
ResultAverage Absolute DifferenceLink to Experiment File
Without log probabilities0.13Experiment File

Single CoT Prompting with Instructions (Output: 0,1)

Prompt:

You are an expert evaluation system for a question-answering chatbot.

You are given the following information:
- a golden answer,
- context (information to answer the user's query).

Read and comprehend the golden answer thoroughly.
Identify key points and specific details in the golden answer.
Identify the key points from the golden answer that are present in the context.
Identify the key points from the golden answer that are missing in the context.

If the context contains all the information needed to generate the golden answer, answer 1.
If the context lacks any part of the information needed, answer 0.
ResultAverage Absolute DifferenceLink to Experiment File
Without log probabilities0.6548865215Experiment File

Single CoT Prompting with Instructions (Output: 1-5)

Prompt:

You will be given an answer and retrieved context. You have to rate the retrieved context on retrieval quality.

1. Read and comprehend the golden answer thoroughly.
2. Identify key points and specific details in the golden answer.
3. Identify the key points from the golden answer that are present in the context.
4. Identify the key points from the golden answer that are missing in the context.

Based on completeness and directness, rate the retrieved context on a scale of 1 to 5:
- 1: Poor retrieval quality; all information is missing or incorrect.
- 2: Fair retrieval quality; some key points are missing or incorrect.
- 3: Good retrieval quality; most key points are covered, but some details may be missing.
- 4: Very good retrieval quality; almost all key points and details are covered.
- 5: Excellent retrieval quality; all key points and details are directly and completely covered.
ResultAverage Absolute DifferenceLink to Experiment File
Without log probabilities0.22Experiment File

Iterative IO Prompting with Instructions (Final Method)

Fact Generation Prompt:

You are given a text. Your task is to extract unique facts from the text and return them as a list. Each fact should be a concise statement that presents a specific piece of information. The facts should not repeat or overlap in content.

Return your response in the following JSON format:
{
facts: <list of facts>
}

Fact Comparison Prompt:

You are given a fact and a context. Check if the fact is directly and completely discussed in the context. If the fact is fully covered in the context, return 1; otherwise, return 0.
ResultAverage Absolute DifferenceLink to Experiment File
Without log probabilities0.10Experiment File

Iterative IO Fact Generation with CoT Score Prompting

Fact Generation Prompt:

You are given a text. Your task is to extract unique, self-contained facts from the text. Each extracted fact should be verifiable, and no information should be missed.

Return your response in the following JSON format:
{
facts: <list of facts>
}

Fact Comparison Prompt:

You are given a fact and a context. Think step-by-step to check if the fact is directly and completely discussed in the context. If the fact is fully covered in the context, return 1; otherwise, return 0.

Give output in the following JSON format:
{
justification: <justification whether the given fact is covered by the context>,
relevant_section: <relevant section if present, otherwise empty string>,
decision: <0 or 1>
}
ResultAverage Absolute DifferenceLink to Experiment File
Without log probabilities0.16Experiment File