Evaluation of MSRpar Dataset and Improvements to Queryloop
This document presents a detailed overview of the evaluation methods and prompts used to assess chatbot response faithfulness. We have implemented two evaluation methods—Average Absolute Difference and Score Bracket Accuracy—and introduced an improved Queryloop prompt to provide structured evaluation feedback.
Evaluation Methods
1. Average Absolute Difference Between Scores
In this method, both the predicted scores and ground truth scores are scaled to fall within a range of 0 to 5. The process involves:
- Calculating the absolute difference between each predicted scaled score and its corresponding ground truth score.
- Averaging these absolute differences across all data points to produce a final evaluation score.
Result for MSRpar Dataset: The average absolute difference score was 1.2526.
2. Score Bracket Accuracy
This method categorizes scaled predicted and ground truth scores into three brackets:
- 0-2: Low score bracket.
- 2-4: Moderate score bracket.
- 4-5: High score bracket.
Each prediction is assessed as accurate if it falls within the same bracket as the ground truth score. The final score bracket accuracy is calculated as the percentage of correct bracket predictions.
Result for MSRpar Dataset: The score bracket accuracy was 52.82%.
Improved Queryloop Prompt for Evaluation
Prompt Structure
The revised Queryloop prompt is designed to evaluate chatbot answers based on semantic similarity to reference answers. It categorizes evaluations into three types—pass, partially pass, and fail—which map to numerical scores of 2, 1, and 0, respectively.
Evaluation Prompt:
You are an expert evaluation system for a question-answering chatbot.
You are given the following information:
- a user query,
- a reference answer or golden answer, and
- a generated answer.
***YOU NEED TO ONLY FOCUS ON INFORMATION PRESENT IN THE REFERENCE ANSWER THAT IS ALSO SEMANTICALLY SIMILAR AND PRESENT IN THE GENERATED ANSWER. YOU CAN IGNORE DIFFERENT WORDING OR TONES***
For feedback, critically observe each element: the query, reference answer, and generated answer. Provide a brief reasoning on whether the generated answer is similar to the reference answer. Evaluate how the two answers convey the same or different message.
* IF THE ACTUAL INFORMATION OF REFERENCE ANSWER IS PRESENT IN THE GENERATED ANSWER. The answer is correct, evaluation passes, and the final verdict is "pass".
* IF THE ACTUAL INFORMATION OF THE REFERENCE ANSWER IS EVEN SLIGHTLY PRESENT IN THE GENERATED ANSWER. The answer is partially correct and evaluation passes partially, with the final verdict being "partially pass".
* IF THE ACTUAL INFORMATION OF THE REFERENCE ANSWER IS NOT IN THE GENERATED ANSWER. The answer is incorrect and evaluation fails, with the final verdict being "fail".
Your feedback should follow the JSON format:
{
"evaluation": <your reasoning>,
"final_verdict": <pass, partially pass, or fail>
}
Example:
question: In the early days, how were the Airbnb founders financing their startup?
golden_answer: The Airbnb founders initially funded themselves by selling breakfast cereal.
answer: They sold cereals.
feedback:
{
"evaluation": "The generated answer is semantically similar to the reference answer as it also mentions the founders selling cereal to fund their startup. The information in the reference answer is present in the generated answer. Therefore, the evaluation passes.",
"final_verdict": "pass"
}
Evaluation Results
Using this improved prompt, we ran tests on the MSRpar dataset. The mapped scores were used to calculate:
- Average Absolute Difference: 1.2526
- Score Bracket Accuracy: 52.82%
Queryloop Fact Generation and Fact Scoring Prompts
Fact Generation Prompt
For each reference answer, unique, self-contained facts are extracted. The prompt is structured to ensure that:
- Each fact is verifiable.
- No essential information is missed.
- The original text can be reconstructed from the generated list of facts.
Fact Generation Prompt:
You are given a text. Your task is to extract unique self-contained facts from the text. Each extracted fact should be verifiable, and no information should be missed. Make sure that the provided text is reconstructable from the list of facts provided.
Return your response in the following JSON format:
{
"facts": <list of facts>
}
Fact Scoring Prompt
Each extracted fact is evaluated against the generated answer to determine if it is fully covered in the response. The scoring is binary:
- 1: Fact is fully discussed.
- 0: Fact is not fully covered.
A weighted sum of log probabilities is applied for calculating fact scores. The final fact score is the average of all individual fact scores, resulting in values between 0 and 1.
Fact Scoring Prompt:
You are given a fact and context(s). Your task is to check if the fact is directly and completely discussed in the context(s). If the fact is fully covered in the context, return 1; otherwise, return 0. Only return the score.
Evaluation Results for Fact-Based Scoring
- Average Absolute Difference: 1.1672
- Score Bracket Accuracy: 56.06%
Summary of Results and Analysis
Results Comparison:
- Average Absolute Difference:
- MSRpar-based Correctness Evaluation: 1.2526
- Fact-based Scoring: 1.1672
- Score Bracket Accuracy:
- MSRpar-based Correctness Evaluation: 52.82%
- Fact-based Scoring: 56.06%
The improved fact-based scoring prompt shows a modest improvement in both average absolute difference and score bracket accuracy, suggesting that evaluating based on individual facts may offer a more precise metric for assessing semantic faithfulness.
Links to Experiment Files: