Comparison Between Different Methodologies

Fact Generation Methodologies

Prompts Used for Fact Generation

We experimented with two different prompts to generate unique facts from a given text: the IO prompt and the COT prompt. These prompts differ in their instructions and approach to fact extraction.

IO Prompt

The IO prompt is designed to extract facts by instructing the model to read a text and return a list of unique facts. Each fact should be a specific and concise statement that does not repeat or overlap with others. The output is returned in JSON format, listing the extracted facts.

Prompt:

"You are given a text. Your task is to extract unique facts from the text and return them as a list. Each fact should be a concise statement that presents a specific piece of information. The facts should not repeat or overlap in content.

Return your response in the following JSON format
{
  facts : <list of facts>
}"

COT Prompt

The COT prompt is more comprehensive. It instructs the model to thoroughly read the text, identify key points and main ideas, and extract facts by creating clear, concise statements based on those points. This prompt also returns the facts in a structured JSON format, which includes both the extracted key ideas and the facts.

Prompt:

"You are given a text. Your task is to extract unique facts from the text by following these instructions.
Read the Text Thoroughly:
1. Carefully read the entire text to understand its content.
2. Identify key points and main ideas.
3. Identify Specific Pieces of Information
4. Extract unique facts. Create each unique fact using main ideas and key points in a clear and concise manner.
5. Format the Output

Give output in the following JSON format:
{
  key_ideas : <list of key ideas and main points>,
  facts : <list of facts>
}"

Results from Fact Generation Experiments

We ran experiments using both the IO and COT prompts on five examples and manually analyzed the results. The findings were as follows:

IO Prompt: The facts generated by the IO prompt covered a higher percentage of the original response and were found to be more accurate.
COT Prompt: While the COT prompt also generated useful facts, it did not perform as well in terms of coverage and accuracy when compared to the IO prompt.

Experiment File Link:

Facts Generation Comparison (5 Golden Responses - Paul Graham)

Conclusion on Fact Generation

From our analysis, the IO prompt was more effective at extracting accurate facts with better coverage of the original response. Although the COT prompt produced useful facts, it was not as precise in comparison.

Fact Scoring Methodologies

Prompts Used for Fact Scoring

We experimented with three prompts for fact scoring, each designed to assess whether the facts were fully discussed in a given context. The models were evaluated with and without using the Weighted Sum of Log Probabilities.

IO Prompt for Fact Scoring

The IO prompt for fact scoring checks if a fact is directly and completely discussed in the provided context. If the fact is fully covered, the model returns a score of 1; otherwise, it returns 0.

Prompt:

You are given a fact and a context. Your task is to check if the fact is directly and completely discussed in the context. If the fact is fully covered in the context, return 1; otherwise, return 0.

COT1 Prompt for Fact Scoring

The COT1 prompt instructs the model to follow a more structured thought process to extract facts from the context. However, it was found to be incomplete in terms of delivering a full chain of thought, making it less effective for scoring.

Prompt:

You are given a text. Your task is to extract unique facts from the text by following these instructions:
1. Carefully read the entire text to understand its content.
2. Identify key points and main ideas.
3. Identify Specific Pieces of Information.
4. Extract unique facts. Create each unique fact using main ideas and key points in a clear and concise manner.
5. Format the Output."

Give output in the following JSON format:
{
  key_ideas : <list of key ideas and main points>,
  facts : <list of facts>
}

COT2 Prompt for Fact Scoring

The COT2 prompt is similar to the COT1 prompt, but it incorporates a more complete chain of thought. It provides a better structured approach for identifying facts in the context, leading to higher accuracy in scoring.

Prompt:

You are given a text. Your task is to extract unique facts from the text by following these instructions:
1. Carefully read the entire text to understand its content.
2. Identify key points and main ideas.
3. Identify Specific Pieces of Information.
4. Extract unique facts. Create each unique fact using main ideas and key points in a clear and concise manner.
5. Format the Output.

Give output in the following JSON format:
{
  key_ideas : <list of key ideas and main points>,
  facts : <list of facts>
}

Results from Fact Scoring Experiments

We ran multiple experiments on four unique data sets and analyzed the results manually. The following conclusions were drawn:

COT2 Prompt: The COT2 prompt provided the best results in terms of fact coverage and accuracy when scoring.
COT1 Prompt: The COT1 prompt did not perform as well because it lacked the full chain of thought, which affected its ability to extract facts accurately.
IO Prompt: The IO prompt showed good performance, but there was less difference between the weighted sum of log probabilities and simple scores, suggesting the model was more confident in this case.

Links to Experiment Files:

Paul Graham 4 Golden, 3 Noisy Chunk Size 2000 Continuous (COT2 GPT-4 O-mini)
Paul Graham 4 Golden, 3 Noisy Chunk Size 2000 Continuous (IO GPT-4 O-mini)
Paul Graham 4 Golden, 3 Noisy Chunk Size 2000 Continuous (COT1 GPT-4 O-mini)

Conclusion on Fact Scoring

The COT2 prompt provided the most accurate and consistent results for fact scoring, with minimal difference between the weighted and simple scoring methods, suggesting higher model confidence. On the other hand, the COT1 prompt was not as effective, due to its incomplete chain of thought approach.

Final Takeaways

For Fact Generation, the IO prompt outperformed the COT prompt in terms of accuracy and coverage.
For Fact Scoring, the COT2 prompt gave the best results, especially when combined with the weighted sum of log probabilities, while the IO prompt showed consistent results with high model confidence.

Final Dataset Documentation

Overview

This document provides details on the creation and structure of the final annotated dataset used for evaluating retrieval accuracy. It includes the prompts used for fact extraction and scoring, as well as the process for fact evaluation.

Dataset Link

Dataset: Paul Graham 4 Golden, 3 Noisy, Chunk Size 2000, Continuous Evaluation

Fact Creation Methodology

To ensure comprehensive and verifiable fact extraction, we employed two specific prompts: IO Prompt (Final) and Prompt 2 (Final). These prompts allowed us to create self-contained, unique facts, ensuring no information was lost and that the original text could be reconstructed.

Prompts for Fact Creation

IO Prompt (Final)

This prompt extracts unique, non-overlapping facts from the provided text. Each fact is designed to be concise and to present a specific piece of information without redundancy.

Prompt:

you are given a text. your task is to extract unique facts from the text and return them as a list. each fact should be a concise statement that presents a specific piece of information. the facts should not repeat or overlap in content.

return your response in the following json format: {
  facts: <list of facts>;
}

Prompt 2 (Final)

This prompt is a more rigorous version, designed to extract self-contained facts that, when combined, allow for the complete reconstruction of the text. Each fact is designed to be verifiable, without omitting any information present in the source text.

Prompt:

You are given a text. Your task is to extract unique self-contained facts from the text. Each extracted fact should be verifiable, and no information should be missed. Make sure that the provided text is reconstructable from the list of facts provided.

Return your response in the following JSON format:
{
  facts: <list of facts>
}

Rationale for Fact Creation Prompts

These prompts were chosen to maximize both coverage and conciseness:

IO Prompt (Final) is used when concise, non-overlapping facts are essential.
Prompt 2 (Final) provides a more detailed extraction approach, useful for maintaining the integrity of the original text through complete, self-contained facts.

Iterative Fact Evaluation Methodology

Once facts were created, they were evaluated iteratively using a Chain-of-Thought (CoT) approach. This process helped determine if each fact was fully covered within a given context, enhancing accuracy by systematically analyzing each fact against the context.

Iterative Fact Evaluation Prompt (CoT)

This prompt is structured to facilitate step-by-step evaluation, ensuring that each fact is directly and completely discussed in the context.

Prompt:

You are given a fact and a context. Think step by step to check if the fact is directly and completely discussed in the context. If the fact is fully covered in the context, return 1; otherwise, return 0.

1. Understand the specific information presented in the fact.
2. Read the entire context to grasp its content.
3. Look for sections in the context that relate to the fact.
4. Check if the fact is directly addressed in the context without any omissions.
5. If the fact is completely discussed in the context, return 1.
6. If the fact is not fully covered, return 0.

Give output in the following JSON format:
{
  justification: <give your justification here whether or not the given fact is covered by the provided context>,
  relevant_section: <relevant section if present otherwise empty string>,
  decision: <0 or 1>
}

Rationale for the Iterative CoT Evaluation Prompt

The step-by-step structure of the CoT prompt ensures that:

Accuracy is maintained by focusing on specific fact-to-context alignment.
Justification is provided for each decision, explaining the basis for the scoring.
Relevance is confirmed, with specific context sections cited when applicable.

Comparison Between Different Methodologies

Fact Generation Methodologies​

Prompts Used for Fact Generation​

IO Prompt​

COT Prompt​

Results from Fact Generation Experiments​

Experiment File Link:​

Conclusion on Fact Generation​

Fact Scoring Methodologies​

Prompts Used for Fact Scoring​

IO Prompt for Fact Scoring​

COT1 Prompt for Fact Scoring​

COT2 Prompt for Fact Scoring​

Results from Fact Scoring Experiments​

Links to Experiment Files:​

Conclusion on Fact Scoring​

Final Takeaways​

Final Dataset Documentation

Overview​

Dataset Link​

Fact Creation Methodology​

Prompts for Fact Creation​

IO Prompt (Final)​

Prompt 2 (Final)​

Rationale for Fact Creation Prompts​

Iterative Fact Evaluation Methodology​

Iterative Fact Evaluation Prompt (CoT)​

Rationale for the Iterative CoT Evaluation Prompt​

Fact Generation Methodologies

Prompts Used for Fact Generation

IO Prompt

COT Prompt

Results from Fact Generation Experiments

Experiment File Link:

Conclusion on Fact Generation

Fact Scoring Methodologies

Prompts Used for Fact Scoring

IO Prompt for Fact Scoring

COT1 Prompt for Fact Scoring

COT2 Prompt for Fact Scoring

Results from Fact Scoring Experiments

Links to Experiment Files:

Conclusion on Fact Scoring

Final Takeaways

Overview

Dataset Link

Fact Creation Methodology

Prompts for Fact Creation

IO Prompt (Final)

Prompt 2 (Final)

Rationale for Fact Creation Prompts

Iterative Fact Evaluation Methodology

Iterative Fact Evaluation Prompt (CoT)

Rationale for the Iterative CoT Evaluation Prompt