Configuring the Retrieval Module
The Retrieval Module determines how your AI application searches through and retrieves information from your documents. By properly configuring components like Chunk Size, Metric Type, Retrieval Method, Reranker, Top K, and Embedding Model, you can significantly enhance your application's ability to find and utilize relevant information.
Uploading Your Dataset
The first step in configuring the Retrieval Module is uploading the dataset your application will access. Queryloop supports various data types, each with specific characteristics and use cases.
Dataset Types at a Glance
Type | Format | Best For | Key Features |
---|---|---|---|
Structured | CSV, XLS, XLSX | Financial data, catalogs, logs | Automatic metadata recognition, field-based searches |
Unstructured | PDF, DOC, TXT | Articles, documentation, free-form text | Full-text semantic search, natural language processing |
Unstructured with Metadata | Documents with tags | Categorized collections, multi-topic content | Combined semantic and metadata filtering |
Selecting Dataset Type
Structured Data
Structured data refers to information organized in a predefined format with explicit relationships between data points. Examples include CSV and XLS files. Non-textual fields and small text fields, like numerical values, categories, or string tags, are automatically recognized as metadata, facilitating efficient filtering and retrieval.
Best for:
- Financial records and reports
- Product catalogs
- Transaction logs
- Any tabular information
Features:
- Automatic recognition of metadata fields
- Support for both natural language queries and semantic searches
- Enhanced filtering capabilities using metadata fields
Unstructured Data
Unstructured data includes information that does not follow a specific format, such as free-form text found in documents like PDFs, TXT, and DOC files.
Best for:
- Articles and research papers
- Policy documents
- General documentation
- Any free-form text content
Features:
- Full-text semantic search capabilities
- Natural language understanding of content
- Context-aware information retrieval
Unstructured Data with Metadata
Unstructured with metadata refers to unstructured data enhanced with additional information, such as tags or fields, that describe key characteristics of the document. For example, a document might have metadata like {'field': 'LLM development'}
or {'field': 'Business'}
. This metadata helps categorize and filter documents during searches, allowing you to apply specific filters based on these tags.
Best for:
- Categorized document collections
- Multi-topic knowledge bases
- Subject-specific libraries
- Cross-referenced materials
Features:
- Combines semantic search with targeted metadata filtering
- Improved precision for complex or focused queries
- Enhanced categorization and relationship mapping
Understanding Dataset Limitations
When working with multiple data types, be aware of these compatibility constraints:
✅ Compatible Combinations:
- Unstructured Data Files with Structured Data Files: These can be mixed freely, allowing the system to handle both formats within the same bot.
- Unstructured Data with and without Metadata: You can mix unstructured data files with and without metadata, but the metadata format must remain consistent across all metadata uploads.
❌ Incompatible Combinations:
- Structured Data with Unstructured Data with Metadata: You cannot upload structured data to a bot configured to handle unstructured data with metadata. Currently, the formats and processing methods differ significantly, making them incompatible in a mixed setting.
- Unstructured Data with Metadata on Structured Data Bot: Similarly, you cannot upload unstructured data with metadata to a bot designed to handle structured data only, as the metadata handling and retrieval requirements differ.
- Inconsistent Metadata Formats: If you are working with unstructured data with metadata, all metadata must follow the same format. Mixing different metadata structures within the same bot setup is not supported and will result in errors.
Setting Confidentiality Levels
Control document access by assigning appropriate confidentiality levels:
Public: Documents labeled as public can be accessed by all users, regardless of their privilege level. This setting is suitable for general information that doesn't require restricted access.
Private: Documents marked as private are accessible only to specific users or groups with the necessary permissions. This setting is ideal for sensitive information that needs to be restricted to a defined set of users.
Confidential: This level is for highly sensitive or restricted documents, where access is tightly controlled and limited to users with the highest privileges. It's best used for proprietary, legal, or classified information that must be safeguarded against unauthorized access.
These confidentiality settings ensure that each user interacts only with the data they are authorized to see, enhancing data security.
Configuring Retrieval Parameters
After uploading your dataset, you'll need to configure six key parameters that determine how information is processed and retrieved.
1. Chunk Size
Quick Selection Guide
Size | Characters | Best For | Trade-offs |
---|---|---|---|
Tiny | 300 | Precise facts, specific details | May lose broader context |
Small | 700 | Balanced detail with some context | Moderate context preservation |
Medium | 1800 | Broader contextual understanding | Less precise for isolated facts |
Large | 4200 | Comprehensive context, narrative flow | May include irrelevant information |
Selection Guidance:
- For factual Q&A (dates, statistics, definitions): Choose Tiny or Small
- For conceptual understanding (processes, relationships, theories): Choose Medium or Large
Detailed Information
Chunking is the process of breaking long pieces of text into smaller segments, known as chunks. This technique helps in managing and retrieving information more effectively, especially when dealing with lengthy or complex documents.
How Chunking Works
Queryloop uses a recursive character text splitter that prioritizes breaking the text at the paragraph level. If breaking at the paragraph level is not possible, it moves to sentences, and if that also isn't feasible, it breaks at the word level.
Smaller Chunks (Tiny, Small): Capture detailed insights but may lose broader context. Ideal for precise extractions or when dealing with data that requires granular analysis.
Larger Chunks (Medium, Large): Maintain broader context but might miss very fine details. Best for longer, continuous documents or when preserving the overall narrative is crucial.
Choosing the correct chunk size depends on the document type, retrieval needs, and the importance of detail versus context. For shorter documents, opting for larger chunks can help keep the document intact, avoiding unnecessary splitting and context loss.
2. Metric Type
Quick Selection Guide
Metric | How It Works | Best For | Considerations |
---|---|---|---|
Cosine Similarity | Measures angle between vectors | General search, varying document lengths | Not sensitive to magnitude |
Euclidean Distance | Measures straight-line distance | Exact matching, numerical precision | May bias toward longer documents |
Dot Product | Multiplies corresponding elements | Applications needing direction and magnitude | Can favor longer vectors |
Hybrid (Dense + Sparse) | Combines semantic and keyword matching | Complex queries needing both approaches | Computationally heavier |
Selection Guidance:
- For Short, Context-Heavy Queries (e.g., searching conversational or narrative texts): Cosine Similarity is typically sufficient and performs well.
- For Length-Sensitive Data (e.g., comparing reviews or recommendations where length variations carry meaning): Euclidean Distance might better capture differences.
- For Tasks Involving Neural Models or Importance of Alignment: Use the Dot Product to efficiently compare relevance.
- For Diverse Content with Need for Both Semantic Understanding and Keyword Matching: The Hybrid approach will provide the most comprehensive retrieval, particularly when exact keywords and contextual meaning are both critical.
Detailed Information
The metric type determines how similarity is calculated between queries and document segments. Your choice affects which content is deemed relevant to a user's question.
Cosine Similarity:
- What It Does: Measures the angle between two vectors, focusing on the direction rather than their magnitude. It essentially tells how similar two pieces of text are, regardless of their length.
- Best For: Comparing the semantic meaning of short texts, sentences, or queries where direction matters more than word count. Ideal for applications like document classification, clustering, and detecting similarities in content with varying lengths.
- Advantages: Robust to the size of the text, meaning longer or shorter texts can be compared without bias towards length.
- Disadvantages: Not sensitive to the absolute scale or length of the vectors, which means it might miss finer differences in content magnitude or emphasis.
Euclidean Distance:
- What It Does: Measures the straight-line distance between two vectors in a multidimensional space, accounting for both magnitude and direction.
- Best For: Use when exact numerical differences matter, such as in recommendation systems where the precise distance reflects the degree of dissimilarity between user preferences.
- Advantages: Captures differences in magnitude and can be useful when exact positional differences between vectors are significant.
- Disadvantages: Sensitive to vector length, meaning it may be biased towards longer texts or documents unless normalized.
Dot Product:
- What It Does: Calculates the alignment between vectors by multiplying corresponding elements and summing them up, combining both direction and magnitude information.
- Best For: Applications in neural networks and scenarios where magnitude and direction together define the level of similarity.
- Advantages: Efficient and directly measures similarity, capturing both intensity and alignment of vectors.
- Disadvantages: Can be biased towards longer vectors and does not account for scaling differences between vectors unless adjusted.
Hybrid (Dense + Sparse Embeddings):
- What It Does: Merges dense embeddings (like those generated by neural networks) with sparse embeddings (such as BM25, a traditional information retrieval algorithm that uses term frequency and inverse document frequency). This combination leverages the strengths of both dense, context-aware models and sparse, keyword-focused models.
- Best For: Scenarios where you need a balance between deep semantic understanding and precise keyword matching. Ideal for mixed-content data where both nuanced meaning and exact term presence are critical.
- Advantages:
- Dense Embeddings: Capture complex semantic relationships, making them great for understanding context, synonyms, and nuanced meanings.
- Sparse Embeddings (BM25): Excel in precise term matching, particularly useful when exact keyword presence is vital (e.g., legal documents or technical queries).
- Hybrid Strength: Offers a robust, balanced retrieval approach that handles both deep semantic connections and exact term matches.
- Disadvantages:
- Computationally heavier due to the integration of both dense and sparse computations.
- May require tuning to balance the contribution of dense vs. sparse components based on the nature of the data and queries.
3. Retrieval Method
Quick Selection Guide
Method | Description | Best For | Limitations |
---|---|---|---|
Basic | Simple vector similarity retrieval | Straightforward queries, direct matches | Less effective for complex questions |
Chunk Window | Includes adjacent text chunks for context | Long-form content, narrative flow | Higher computational cost |
Paraphrasing | Explores multiple query formulations | Queries with multiple interpretations | May introduce irrelevant results |
HyDE | Creates a hypothetical answer to guide retrieval | Complex, abstract, exploratory questions | Quality depends on hypothetical document |
Deconstruction | Breaks complex queries into sub-queries | Multi-part questions, detailed analysis | Computationally intensive |
Selection Guidance:
- For straightforward searches or when precision without context is sufficient: Choose Basic.
- When maintaining contextual understanding is crucial (e.g., long texts or sequential data): Use Chunk Window.
- To explore different ways of framing a question or when diversity in results is needed: Select Paraphrasing.
- For abstract, open-ended, or complex queries where answers might be indirectly related: Opt for HyDE.
- To tackle multi-part or very complex queries that need a breakdown for accurate retrieval: Go with Deconstruction.
Detailed Information
The retrieval method determines the technique used to find relevant information in your dataset. Different methods excel at different types of queries.
Basic:
- What It Does: Retrieves documents by finding the closest matches in a vector space using similarity metrics like Cosine Similarity.
- Best For: General searches where precise matching is needed without any special contextual requirements.
- Advantages: Simple and efficient; works well for straightforward queries and data.
- Disadvantages: Can miss context-specific details as it focuses purely on direct similarity without additional context. Not ideal for nuanced or complex information retrieval.
Chunk Window:
- What It Does: This method vectorizes individual chunks of text but keeps track of the preceding and following chunks. After retrieval, it appends these adjacent chunks to provide more context to the retrieved content.
- Best For: Situations where understanding the broader context of a text is important, such as in technical documents, articles, or long-form content.
- Advantages: Enhances comprehension by preserving the flow of information around the retrieved chunk, making responses more coherent.
- Disadvantages: Slightly increases computational load as it processes additional contextual chunks, which might not always be necessary for simpler queries.
Paraphrasing:
- What It Does: Rephrases the query in multiple ways to explore different linguistic expressions and perspectives of the same question, broadening the scope of results.
- Best For: Use when queries might be interpreted in various ways or when looking for diverse answers from the data, such as user-generated content or feedback analysis.
- Advantages: Expands the range of results by capturing varied expressions of the query, improving coverage and recall.
- Disadvantages: Can introduce irrelevant results if not well-tuned, as variations might stray too far from the original intent of the query.
HyDE (Hypothetical Document Embedding):
- What It Does: Generates a hypothetical answer or document based on the query and uses this generated text to perform retrieval, essentially searching for documents that are most similar to the hypothetical answer.
- Best For: Complex, open-ended, or abstract questions where direct search might miss relevant information. Great for exploratory searches where you're not exactly sure what the precise answer looks like.
- Advantages: Helps uncover hidden connections by broadening the retrieval to align with the intent rather than just the words of the query.
- Disadvantages: May introduce noise if the hypothetical document doesn't closely align with the actual relevant data, requiring careful balancing of generation quality.
Deconstruction:
- What It Does: Breaks down complex queries into simpler, more manageable sub-queries, enhancing the system's ability to retrieve relevant parts of information individually before recomposing them.
- Best For: Use in scenarios where queries are multi-faceted, layered, or too complex to handle in a single retrieval pass—like legal documents, research papers, or detailed data analysis tasks.
- Advantages: Improves retrieval accuracy by focusing on each element of a complex question, ensuring no aspect is overlooked.
- Disadvantages: Can be computationally intensive, as it requires multiple retrieval passes and reassembly, making it less efficient for simple queries.
4. Reranker
Quick Selection Guide
Reranker | Function | Best For | Trade-offs |
---|---|---|---|
None | Uses initial retrieval ranking only | Simple queries, efficiency-focused | May miss nuanced relevance |
Maximal Marginal Relevance (MMR) | Balances relevance with diversity | Exploration, overview generation | May exclude similar but relevant results |
Cohere Rerank | Uses cross-encoders for joint analysis | Detailed matching, precise relevance | Computationally intensive |
LLM Rerank | Leverages a language model for nuanced understanding | Complex queries requiring deep comprehension | Resource-intensive, potentially slower |
Selection Guidance:
- For quick, straightforward tasks where relevance is the only priority: Choose None.
- To maintain a diverse set of results that still align with the query: Opt for Maximal Marginal Relevance (MMR).
- When deep, context-rich alignment between query and results is critical: Use Cohere Rerank.
- For the most advanced, nuanced reranking with high accuracy and contextual fit: Go with LLM Rerank.
Detailed Information
No reranker:
- What It Does: Focuses solely on retrieving the most relevant results based on the initial query without any additional post-processing or reordering.
- Best For: Simple use cases where the initial search quality is sufficient, and you need the most straightforward and computationally efficient reranking approach.
- Advantages: Fast and efficient since it involves minimal processing beyond the initial retrieval. It's straightforward and reliable for direct, relevance-focused tasks.
- Disadvantages: Does not consider diversity or deeper contextual alignment, which can result in repetitive or narrowly focused results.
Maximal Marginal Relevance (MMR):
- What It Does: Balances relevance and diversity by iteratively selecting results that are both highly relevant to the query and distinct from previously selected items. This approach helps to ensure a varied set of results, reducing redundancy.
- Best For: Use in scenarios where you need a balanced set of results that capture different aspects of a query, such as when dealing with news aggregation, content curation, or diverse information needs.
- Advantages: Prevents the retrieval of duplicates and overly similar results, offering a broader perspective on the query topic.
- Disadvantages: The balance between relevance and diversity can sometimes dilute the precision of the most directly relevant results if not carefully tuned.
Cohere Rerank:
- What It Does: Utilizes a cross-encoder to jointly analyze and reorder search results based on coherence and contextual fit. Unlike traditional methods that assess queries and documents independently, cross-encoders evaluate the interaction between the two, offering a more integrated approach to relevance.
- Best For: Ideal for complex queries where context and detailed matching between the query and results are crucial, such as in academic search, customer support, or detailed content analysis.
- Advantages: Provides a deeper, joint understanding of query-document relevance, improving the quality of ranking through comprehensive analysis of both elements together.
- Disadvantages: Computationally intensive as it processes each query-result pair jointly, which may slow down the reranking for large datasets.
LLM Rerank:
- What It Does: Uses a large language model (LLM) to evaluate and reorder search results, enhancing the accuracy and relevance based on nuanced language understanding. The LLM can assess factors such as coherence, context, and semantic fit, making it highly adaptive to various types of queries.
- Best For: Situations where high-quality, contextually aware reranking is needed, such as in personalized search, advanced document retrieval, and contexts requiring deep semantic understanding.
- Advantages: Offers sophisticated reranking capabilities by leveraging advanced language models that can interpret complex relationships and contextual nuances.
- Disadvantages: Resource-intensive and can be slower than other reranking methods, especially with large volumes of data. Requires significant computational power and can be costly in deployment.
5. Top K
Quick Selection Guide
Value | Returns | Best For | Considerations |
---|---|---|---|
1 | Single most relevant document | Precise factual queries | May miss important context |
5 | Five most relevant documents | Balanced precision and context | Good for most general questions |
10 | Ten most relevant documents | Comprehensive overview | Includes broader context |
20 | Twenty most relevant documents | Research, exploration | May include less relevant information |
Selection Guidance:
- For fact-based questions: Lower values (1-5) often provide sufficient information
- For research or exploration: Higher values (10-20) offer broader perspectives
Detailed Information
Top K is a retrieval method that allows you to set a specific limit on the number of documents returned from a search. By choosing a value for "K," you directly control the number of results displayed, ensuring that only the most relevant documents are presented.
What It Does: Limits the number of documents retrieved by the search to the top K results, ranked by their relevance to the query. For instance, if you set K = 5, the system will return the five highest-ranked documents.
Best For: Use when you want to streamline the results and focus on the most relevant content, especially in scenarios where too many results would be overwhelming or unnecessary, such as in customer support, product searches, or targeted research.
Advantages:
- Efficiency: Helps manage large volumes of data by only presenting the most pertinent information, reducing noise and irrelevant content.
- Simplicity: Easy to implement and understand, making it ideal for straightforward retrieval tasks where only the top matches matter.
- Control: Gives users direct control over the breadth of results, allowing them to adjust the scope based on the context or complexity of their needs.
Disadvantages:
- Potentially Missed Information: By limiting results, there is a risk of overlooking less relevant but still useful information that lies beyond the top K threshold.
- Context Loss: For highly nuanced or complex queries, restricting results might exclude valuable context that would otherwise be captured in a broader search.
Choosing the Right K:
- Smaller K Values (e.g., K = 1 to 5): Best when precision is critical, and you want the most focused, relevant answers without distraction.
- Larger K Values (e.g., K = 10 or 20): Use when you need a wider range of insights or are exploring more complex queries that benefit from a broader view of relevant documents.
6. Embedding Model
Quick Selection Guide
Model | Dimensionality | Characteristics | Best For |
---|---|---|---|
text-embedding-ada-002 | 1536 | Balanced performance and efficiency | General applications, cost-effective deployment |
text-embedding-3-small | 1536 | Fast, lightweight processing | High-volume queries, latency-sensitive applications |
text-embedding-3-large | 3072 | Enhanced semantic understanding | Complex reasoning, nuanced semantic relationships |
Fine-Tuned Model | Varies | Domain-adapted for specific content | Industry-specific terminology, specialized jargon |
Selection Guidance:
- For general use with balanced speed and quality: Choose text-embedding-ada-002.
- For quick, low-latency applications: Use text-embedding-3-small.
- For in-depth analysis where understanding complex semantics is crucial: Opt for text-embedding-3-large.
- For specialized contexts requiring high relevance and adaptation to specific terminology: Go with a Fine-Tuned Model, particularly if you have access to relevant training data.
Detailed Information
Embedding models are essential for transforming text into numerical vectors that capture the semantic meaning of the content. These vectors are then used in various retrieval, ranking, and classification tasks. Different embedding models offer varying levels of performance and specificity, and the choice of model can significantly impact retrieval effectiveness.
text-embedding-ada-002:
- Description: A highly efficient and versatile embedding model known for its balance between performance and computational cost. It captures general semantic information, making it suitable for a wide range of retrieval tasks.
- Best For: General-purpose searches, topic modeling, and classification tasks where high quality and speed are needed without significant resource demands.
- Advantages: Fast, accurate, and cost-effective; widely used for various natural language processing tasks.
- Disadvantages: May not capture domain-specific nuances as effectively as more specialized or fine-tuned models.
text-embedding-3-small:
- Description: A lightweight embedding model designed for low-latency environments or applications where speed is prioritized over deep semantic understanding.
- Best For: Scenarios requiring quick responses, like real-time applications or low-resource settings.
- Advantages: Extremely fast and resource-efficient, ideal for high-volume or low-latency tasks.
- Disadvantages: Limited in capturing complex or nuanced text relationships compared to larger models.
text-embedding-3-large:
- Description: A more robust embedding model with increased capacity to understand complex language patterns and relationships.
- Best For: Advanced retrieval tasks, deep semantic analysis, and contexts where a high degree of text comprehension is required.
- Advantages: Offers deeper insight and captures complex language interactions, enhancing retrieval quality in intricate queries.
- Disadvantages: Slower and more resource-intensive, which might not be ideal for high-speed applications.
Fine-Tuned Model:
- Description: This model is tailored specifically to your data by training on uploaded documents. The fine-tuning process enhances the model's ability to understand domain-specific language, concepts, and terminology.
- Best For: Highly specialized retrieval tasks where the general embedding models might miss context-specific nuances, such as industry-specific documents, legal texts, or technical manuals.
- Advantages:
- Domain Adaptation: Fine-tuning allows the model to better align with the language and context specific to your data, improving retrieval accuracy.
- Automatic Training Data Generation: The model leverages your uploaded documents to create training data automatically, streamlining the fine-tuning process without requiring manual data curation.
- Disadvantages: Requires additional training time and computational resources. Effectiveness depends on the quality and diversity of the training data provided.
Optimizing Retrieval Configuration
Finding the optimal retrieval configuration often requires experimentation and fine-tuning.
Best Practices
- Start with defaults for your content type: Queryloop offers recommended starting configurations
- Test with representative queries: Use questions that reflect actual use patterns
- Review retrieved chunks: Examine whether the system finds relevant information
- Iterate methodically: Change one parameter at a time to understand its impact
- Consider performance trade-offs: Balance accuracy against computational resources
By carefully configuring these parameters, you can significantly enhance your application's ability to find and utilize relevant information, resulting in more accurate, contextual, and helpful responses to user queries.