Faithfulness Evaluation Methodology
Overview
This document outlines the methodology and analysis for evaluating faithfulness in generated chatbot responses by assessing their alignment with reference answers. This evaluation uses the same approach as the retrieval evaluation to ensure consistency across assessments. However, while generating actual response facts, including questions alongside them may alter the context, potentially impacting fact extraction.
Problem Statement
The current methodology introduces potential contextual shifts in fact generation due to the inclusion of questions with actual response facts. This document details the evaluation method and the datasets used, focusing on measuring correctness accuracy.
Dataset: Microsoft Research Paraphrase Corpus (MRPC)
The MRPC dataset consists of 5,801 pairs of sentences extracted from online news sources. Each pair is annotated with a binary label, where:
- 1 indicates that the sentences are paraphrases.
- 0 indicates that the sentences are not paraphrases.
Example Data Points
Sentence 1 | Sentence 2 | Label |
---|---|---|
"The sky is clear and blue today." | "Today's sky is blue and clear." | 1 |
"The sky is clear and blue today." | "The weather is sunny." | 0 |
Issues with the MRPC Dataset
- Discrete Labels: The dataset is annotated with binary (discrete) labels, limiting the granularity of evaluation.
- Close Negative Examples: Some negative examples are semantically close to positive examples, making it challenging for human annotators to differentiate them accurately.
Additional Dataset: MSRpar Dataset
The MSRpar dataset contains 750 pairs of sentences from public datasets, providing continuous values (0-5) for scores, allowing for more detailed evaluations. The data fields are:
- Scores: Continuous values between 0 and 5.
- Sentence 1: The first sentence in the pair.
- Sentence 2: The second sentence in the pair.
MSRpar Dataset Example Data Points
Score | Sentence 1 | Sentence 2 |
---|---|---|
4.400 | "The problem likely will mean corrective changes before the shuttle fleet starts flying again." | "He said the problem needs to be corrected before the space shuttle fleet is cleared to fly again." |
0.800 | "The technology-laced Nasdaq Composite Index .IXIC inched down 1 point, or 0.11 percent, to 1,650." | "The broad Standard & Poor's 500 Index .SPX inched up 3 points, or 0.32 percent, to 970." |
Evaluation Prompts
Evaluation on MRPC Dataset
The following prompt is used to evaluate semantic similarity between a chatbot's generated answer and the reference answer:
Evaluation Prompt:
You are an expert evaluation system for a question-answering chatbot.
You are given:
- a user query,
- a reference answer (golden answer), and
- a generated answer.
Focus only on whether the information in the reference answer is also semantically similar and present in the generated answer, disregarding differences in wording or tone.
For feedback, examine every element critically: query, reference answer, and generated answer. Provide reasoning on whether the generated answer conveys the same or a different message as the reference. The answer only needs to be semantically similar and convey the exact same information.
Return one of three results: "pass," "partially-pass," or "fail" based on the following:
- If the actual information in the reference answer is present in the generated answer, return "pass."
- If any information from the reference answer is slightly present in the generated answer, return "partially-pass."
- If the actual information from the reference answer is missing in the generated answer, return "fail."
Example:
Question: In the early days, how were the Airbnb founders financing their startup?
Reference Answer: The Airbnb founders initially funded themselves by selling breakfast cereal.
Generated Answer: They sold their cereals.
Evaluation: The generated answer is semantically similar to the reference answer as it also mentions the founders selling cereal to fund their startup. Therefore, the evaluation passes.
result: pass
Synthetic Question Generation
In the MRPC dataset, questions are generated synthetically using GPT-3.5. Here, each data point treats Sentence 1 as the reference answer, with a question generated against it, and Sentence 2 as the chatbot-generated response.
Evaluation Results and Analysis
Correctness Evaluation Scoring
The correctness evaluation can return:
- 0: Bad response
- 1: Moderate response
- 2: Best response
Assumptions for Evaluation:
- Positive responses should receive a score of 1 or 2.
- Negative responses should receive a score of 0.
After evaluating 30 data points (15 positive and 15 negative examples), the results were as follows:
Result | Positive Examples | Negative Examples |
---|---|---|
Correctly Evaluated | 15/15 | 7/15 |
Incorrectly Evaluated | 0/15 | 8/15 |
Manual Analysis
- All positive examples were correctly evaluated.
- Misjudged negative examples were often evaluated as moderate (1), due to close semantic similarity between Sentence 1 and Sentence 2.
Confusion Matrix
Positive | Negative | |
---|---|---|
True | 15 | 7 |
False | 0 | 8 |
Conclusion and Recommendations
- Correctness Eval Adjustment: The current method’s effectiveness can be improved if it returns discrete scores (0 and 1) when applied to the MRPC dataset. Otherwise, the evaluation method may be better suited for datasets with labels 0, 1, and 2.
- Dataset Quality Concerns: Negative examples in MRPC can be semantically close to positive examples, which may lead to inaccurate evaluations.