Source Return Evaluation Process
This document outlines the methodology, dataset preparation, and evaluation procedures for a Source Return Evaluation task. The task focuses on combining question-answer pairs, assessing the factual correctness of sourced information, and evaluating the generated answers' adherence to ideal response lengths.
Dataset and Data Preparation
-
Dataset Creation:
Since no open-source dataset was found suitable for this task, we created our own dataset using the Paul Graham question and answer dataset. The dataset consists of two files:- Q&A file: Contains questions and their respective answers, along with smaller chunks of related text.
- Essay Chunk File: Contains chunks of Paul Graham’s essays (each chunk is a maximum of 2000 characters).
Link to Data Files:
-
Selecting Most Dissimilar Question and Answer Pairs:
- Step 1: 55 questions from the Q&A file were selected and embedded into a vector database (VectorDB).
- Step 2: A similarity search was performed with
top_k=5
to find the closest matches. - Step 3: The original question was excluded from the top 5 and a LLM Reranker was applied to rank the remaining 4 questions.
- Step 4: A new question was generated by combining the original question with the top-ranked question.
- Step 5: Duplicate pairs were removed, resulting in 39 unique question-answer pairs.
LLM Reranker Prompt:
{
"sorted_list": [index of most relevant question, index of second most relevant, ...]
}
- Combining Question and Answer Pairs:
These 39 most dissimilar question-answer pairs were then fed to an LLM to generate new combined question-answer pairs. The goal was to creatively merge the provided questions and answers into a coherent, new pair without adding any new information.
Prompt for Combining Question and Answer Pairs:
{
"new_question": <generated question>,
"new_answer": <generated answer>
}
Creating Data Points and Scoring
From the 39 combined question-answer pairs, 156 data points were created, categorized into four types of scoring:
- Score 1: Complete answer in context, with the correct source quoted.
- Score 0.5: Complete answer not in context, with one quoted source being real but wrong.
- Score 0: Complete answer in context, with the source being real but wrong.
- Score 0: Complete answer in context, with a source generated by the LLM and is incorrect.
Test Data:
Methodology for Evaluation
-
Parsing Sources from Responses:
- The sources from both the generated answer and the context chunks were parsed.
- A score of 0 was given when no sources were found in the answer or when the context was not matched correctly.
- For matches, a retrieval evaluation score was calculated.
-
Results Evaluation:
The average absolute difference across 156 data points was 0.08491803674.
Link to Experiment File:
Problems Identified
- Multiple Source Scenario:
The current method does not fully cover the scenario where multiple sources are provided, and one source is wrong while others cover the full answer. This can lead to incorrect evaluation. - Improvements Needed:
- A new scoring function should be designed to better address these complex scenarios.
Response Length Evaluation
Previous Method:
The goal is to evaluate whether the generated response adheres to an ideal response length, either "concise" or "detailed."
Evaluation Process:
- Inputs: The question, ideal response length (concise, detailed), and the generated answer.
- Prompt: A prompt was created to assess whether the generated response aligns with the ideal response length.
{
"evaluation_result": "1 or 0"
}
Problems Identified:
- No dataset available for evaluating response length.
- Lack of elaboration on what constitutes "concise" and "detailed" answers.
Improvement:
A methodology was designed where:
- The number of tokens in the golden answer and generated answer was calculated.
- The difference in tokens between the two was determined.
- A threshold was established (25%-30% token difference). If the difference is below the threshold, the answer is considered aligned with the ideal length; otherwise, it is not.
Link to Results:
Final Rating
To generate the final rating:
- Scores from retrieval, factual correctness, and response length evaluations are aggregated.
- The final score is scaled between 0 and 10.
Link to Notebook:
Conclusion:
The evaluation methodology is designed to assess the factual correctness of responses, the relevance of sources, and adherence to ideal response lengths. However, improvements are needed in handling multiple source scenarios, and new scoring functions should be developed to address such cases. The methodology provides a robust framework for evaluating complex question-answer pairs but requires fine-tuning for edge cases.