Skip to main content

Retrieval Accuracy Evaluation

Overview

This document outlines the methodology, dataset creation, and evaluation process used to assess the retrieval accuracy for the AI-powered retrieval system. The focus is on creating an annotated dataset and continuously scoring the relevance of retrieved contexts to better evaluate the performance of the system.

Dataset Creation

Data Source

For the retrieval evaluation, we initially faced a lack of open-source datasets suitable for our needs. Specifically, we required a dataset containing multiple large contexts for each question, as well as positive and negative examples for training. To address this gap, we generated our own dataset using a set of publicly available questions and essays.

Discrete Annotated Synthetic Dataset

We used the first 15 questions from Paul Graham's question and answer dataset and incorporated essays from his work. These essays were stored in Pinecone with a chunk size of 2000 characters. The retrieval system then fetched the top 5 documents for each of the 15 questions, and we defined positive and negative examples as follows:

  • Positive Examples: The first two documents retrieved, along with the golden response.
  • Negative Examples: The last two documents retrieved, along with the golden response.

This setup resulted in 30 data points, with half classified as positive and half as negative examples. After manual annotation review, two data points were found to be too ambiguous for clear classification, reducing the dataset to 28 data points (16 negative, 12 positive).

  • Paul Graham 15 Repeat Total 30 Current Query Loop Retrieval Eval (Chunk Size 2000)

Annotated Test Dataset Generation

The initial dataset only supported discrete 0-1 scoring for retrieval evaluation. For improved granularity in scoring, we created a continuous scoring system. This system annotates data based on how much of the golden response is covered by the retrieved context.

Methodology

Golden and Noisy Chunks Creation

We used four consecutive contexts from one of Paul Graham's essays as the "golden context" and supplemented these with three random contexts from different essays.

Golden Response Generation

A question-answer pair was created based on the golden context, where the answer included partial information from each of the four chunks. We used GPT-4 O-mini to generate the golden response.

Fact Extraction from Golden Response

Using the golden response, we asked GPT-4 O-mini to extract unique facts in list form. Two types of prompts were employed for this extraction: IO prompt and COT prompt.

Scoring Methodology

We employed a Weighted Sum of Log Probabilities approach for scoring, where:

  • A set of predefined scores, S = {s1, s2, ..., sn}, are calculated based on the probability of each score predicted by the LLM.

  • The final score is computed as a weighted sum:

    text{Score} = 0 \times p(0) + 1 \times p(1)

Fact Scoring Against Contexts

Each extracted fact was checked to determine if it was fully discussed in a specific context. This check was performed using GPT-4 O-mini, and the score was assigned as either 0 (fact not present) or 1 (fact present). The average score for each context was then calculated by summing the individual scores and dividing by the total number of facts.

Unique Data Points

We processed seven unique contexts (four golden and three noisy), resulting in multiple evaluations of the retrieved content, with (number of facts * number of unique contexts + 1) total calls to GPT-4.

Creating Additional Data Points

Using combinations of the seven unique contexts, we generated 120 additional data points. The combinations started with two contexts and gradually included more until reaching seven unique contexts in each set.

For each combination, we calculated the combined fact score by applying the logical OR operation between the fact scores of each context in the set. The average score for each combination was calculated by summing the combined scores and dividing by the total number of facts.

Example

Fact Scores Against Two Contexts

  • Context 1 Fact Scores:

    • The shift from desktop to web-based applications significantly impacted startups, developers, and users: 0
    • Web-based applications introduced a more efficient, convenient, and scalable model for software development and usage: 0
    • For startups, the transition to web-based applications allowed for the creation and launch of products with fewer resources: 0
    • (Additional facts omitted for brevity)
  • Context 2 Fact Scores:

    • The shift from desktop to web-based applications significantly impacted startups, developers, and users: 1
    • Web-based applications introduced a more efficient, convenient, and scalable model for software development and usage: 1
    • For startups, the transition to web-based applications allowed for the creation and launch of products with fewer resources: 1
    • (Additional facts omitted for brevity)

Combined Fact Score Calculation

After applying the OR function between the fact scores of the two contexts, we obtained the following combined scores. The average combined score for this set was calculated as 0.7857.

Example of Combined Scores:

  • The shift from desktop to web-based applications significantly impacted startups, developers, and users: 1
  • Web-based applications introduced a more efficient, convenient, and scalable model for software development and usage: 1
  • (Other facts omitted for brevity)

The final result provided an average score, indicating the percentage of the golden response covered by the combined set of contexts.

Conclusion

This retrieval accuracy evaluation framework provides a comprehensive method to assess the retrieval performance, offering both discrete and continuous scoring systems. By using a combination of golden and noisy chunks, generating multiple contexts, and applying fact extraction and scoring, the process ensures robust and precise evaluation of the AI system's ability to retrieve relevant content.