Skip to main content

Configuring the Retrieval Module

The Retrieval Module determines how your AI application searches through and retrieves information from your documents. By properly configuring components like Chunk Size, Metric Type, Retrieval Method, Reranker, Top K, and Embedding Model, you can significantly enhance your application's ability to find and utilize relevant information.

Retrieval Module Interface

Uploading Your Dataset

The first step in configuring the Retrieval Module is uploading the dataset your application will access. Queryloop supports various data types, each with specific characteristics and use cases.

Dataset Types at a Glance

TypeFormatBest ForKey Features
StructuredCSV, XLS, XLSXFinancial data, catalogs, logsAutomatic metadata recognition, field-based searches
UnstructuredPDF, DOC, TXTArticles, documentation, free-form textFull-text semantic search, natural language processing
Unstructured with MetadataDocuments with tagsCategorized collections, multi-topic contentCombined semantic and metadata filtering

Selecting Dataset Type

Structured Data

Structured data refers to information organized in a predefined format with explicit relationships between data points. Examples include CSV and XLS files. Non-textual fields and small text fields, like numerical values, categories, or string tags, are automatically recognized as metadata, facilitating efficient filtering and retrieval.

Best for:

  • Financial records and reports
  • Product catalogs
  • Transaction logs
  • Any tabular information

Features:

  • Automatic recognition of metadata fields
  • Support for both natural language queries and semantic searches
  • Enhanced filtering capabilities using metadata fields

Unstructured Data

Unstructured data includes information that does not follow a specific format, such as free-form text found in documents like PDFs, TXT, and DOC files.

Best for:

  • Articles and research papers
  • Policy documents
  • General documentation
  • Any free-form text content

Features:

  • Full-text semantic search capabilities
  • Natural language understanding of content
  • Context-aware information retrieval

Unstructured Data with Metadata

Unstructured with metadata refers to unstructured data enhanced with additional information, such as tags or fields, that describe key characteristics of the document. For example, a document might have metadata like {'field': 'LLM development'} or {'field': 'Business'}. This metadata helps categorize and filter documents during searches, allowing you to apply specific filters based on these tags.

Best for:

  • Categorized document collections
  • Multi-topic knowledge bases
  • Subject-specific libraries
  • Cross-referenced materials

Features:

  • Combines semantic search with targeted metadata filtering
  • Improved precision for complex or focused queries
  • Enhanced categorization and relationship mapping

Understanding Dataset Limitations

When working with multiple data types, be aware of these compatibility constraints:

Compatible Combinations:

  • Unstructured Data Files with Structured Data Files: These can be mixed freely, allowing the system to handle both formats within the same bot.
  • Unstructured Data with and without Metadata: You can mix unstructured data files with and without metadata, but the metadata format must remain consistent across all metadata uploads.

Incompatible Combinations:

  • Structured Data with Unstructured Data with Metadata: You cannot upload structured data to a bot configured to handle unstructured data with metadata. Currently, the formats and processing methods differ significantly, making them incompatible in a mixed setting.
  • Unstructured Data with Metadata on Structured Data Bot: Similarly, you cannot upload unstructured data with metadata to a bot designed to handle structured data only, as the metadata handling and retrieval requirements differ.
  • Inconsistent Metadata Formats: If you are working with unstructured data with metadata, all metadata must follow the same format. Mixing different metadata structures within the same bot setup is not supported and will result in errors.

Setting Confidentiality Levels

Control document access by assigning appropriate confidentiality levels:

Public: Documents labeled as public can be accessed by all users, regardless of their privilege level. This setting is suitable for general information that doesn't require restricted access.

Private: Documents marked as private are accessible only to specific users or groups with the necessary permissions. This setting is ideal for sensitive information that needs to be restricted to a defined set of users.

Confidential: This level is for highly sensitive or restricted documents, where access is tightly controlled and limited to users with the highest privileges. It's best used for proprietary, legal, or classified information that must be safeguarded against unauthorized access.

These confidentiality settings ensure that each user interacts only with the data they are authorized to see, enhancing data security.

Configuring Retrieval Parameters

After uploading your dataset, you'll need to configure six key parameters that determine how information is processed and retrieved.

1. Chunk Size

Quick Selection Guide

SizeCharactersBest ForTrade-offs
Tiny300Precise facts, specific detailsMay lose broader context
Small700Balanced detail with some contextModerate context preservation
Medium1800Broader contextual understandingLess precise for isolated facts
Large4200Comprehensive context, narrative flowMay include irrelevant information

Selection Guidance:

  • For factual Q&A (dates, statistics, definitions): Choose Tiny or Small
  • For conceptual understanding (processes, relationships, theories): Choose Medium or Large

Detailed Information

Chunking is the process of breaking long pieces of text into smaller segments, known as chunks. This technique helps in managing and retrieving information more effectively, especially when dealing with lengthy or complex documents.

How Chunking Works

Queryloop uses a recursive character text splitter that prioritizes breaking the text at the paragraph level. If breaking at the paragraph level is not possible, it moves to sentences, and if that also isn't feasible, it breaks at the word level.

Smaller Chunks (Tiny, Small): Capture detailed insights but may lose broader context. Ideal for precise extractions or when dealing with data that requires granular analysis.

Larger Chunks (Medium, Large): Maintain broader context but might miss very fine details. Best for longer, continuous documents or when preserving the overall narrative is crucial.

Choosing the correct chunk size depends on the document type, retrieval needs, and the importance of detail versus context. For shorter documents, opting for larger chunks can help keep the document intact, avoiding unnecessary splitting and context loss.

2. Metric Type

Quick Selection Guide

MetricHow It WorksBest ForConsiderations
Cosine SimilarityMeasures angle between vectorsGeneral search, varying document lengthsNot sensitive to magnitude
Euclidean DistanceMeasures straight-line distanceExact matching, numerical precisionMay bias toward longer documents
Dot ProductMultiplies corresponding elementsApplications needing direction and magnitudeCan favor longer vectors
Hybrid (Dense + Sparse)Combines semantic and keyword matchingComplex queries needing both approachesComputationally heavier

Selection Guidance:

  • For Short, Context-Heavy Queries (e.g., searching conversational or narrative texts): Cosine Similarity is typically sufficient and performs well.
  • For Length-Sensitive Data (e.g., comparing reviews or recommendations where length variations carry meaning): Euclidean Distance might better capture differences.
  • For Tasks Involving Neural Models or Importance of Alignment: Use the Dot Product to efficiently compare relevance.
  • For Diverse Content with Need for Both Semantic Understanding and Keyword Matching: The Hybrid approach will provide the most comprehensive retrieval, particularly when exact keywords and contextual meaning are both critical.

Detailed Information

The metric type determines how similarity is calculated between queries and document segments. Your choice affects which content is deemed relevant to a user's question.

Cosine Similarity:

  • What It Does: Measures the angle between two vectors, focusing on the direction rather than their magnitude. It essentially tells how similar two pieces of text are, regardless of their length.
  • Best For: Comparing the semantic meaning of short texts, sentences, or queries where direction matters more than word count. Ideal for applications like document classification, clustering, and detecting similarities in content with varying lengths.
  • Advantages: Robust to the size of the text, meaning longer or shorter texts can be compared without bias towards length.
  • Disadvantages: Not sensitive to the absolute scale or length of the vectors, which means it might miss finer differences in content magnitude or emphasis.

Euclidean Distance:

  • What It Does: Measures the straight-line distance between two vectors in a multidimensional space, accounting for both magnitude and direction.
  • Best For: Use when exact numerical differences matter, such as in recommendation systems where the precise distance reflects the degree of dissimilarity between user preferences.
  • Advantages: Captures differences in magnitude and can be useful when exact positional differences between vectors are significant.
  • Disadvantages: Sensitive to vector length, meaning it may be biased towards longer texts or documents unless normalized.

Dot Product:

  • What It Does: Calculates the alignment between vectors by multiplying corresponding elements and summing them up, combining both direction and magnitude information.
  • Best For: Applications in neural networks and scenarios where magnitude and direction together define the level of similarity.
  • Advantages: Efficient and directly measures similarity, capturing both intensity and alignment of vectors.
  • Disadvantages: Can be biased towards longer vectors and does not account for scaling differences between vectors unless adjusted.

Hybrid (Dense + Sparse Embeddings):

  • What It Does: Merges dense embeddings (like those generated by neural networks) with sparse embeddings (such as BM25, a traditional information retrieval algorithm that uses term frequency and inverse document frequency). This combination leverages the strengths of both dense, context-aware models and sparse, keyword-focused models.
  • Best For: Scenarios where you need a balance between deep semantic understanding and precise keyword matching. Ideal for mixed-content data where both nuanced meaning and exact term presence are critical.
  • Advantages:
    • Dense Embeddings: Capture complex semantic relationships, making them great for understanding context, synonyms, and nuanced meanings.
    • Sparse Embeddings (BM25): Excel in precise term matching, particularly useful when exact keyword presence is vital (e.g., legal documents or technical queries).
    • Hybrid Strength: Offers a robust, balanced retrieval approach that handles both deep semantic connections and exact term matches.
  • Disadvantages:
    • Computationally heavier due to the integration of both dense and sparse computations.
    • May require tuning to balance the contribution of dense vs. sparse components based on the nature of the data and queries.

3. Retrieval Method

Quick Selection Guide

MethodDescriptionBest ForLimitations
BasicSimple vector similarity retrievalStraightforward queries, direct matchesLess effective for complex questions
Chunk WindowIncludes adjacent text chunks for contextLong-form content, narrative flowHigher computational cost
ParaphrasingExplores multiple query formulationsQueries with multiple interpretationsMay introduce irrelevant results
HyDECreates a hypothetical answer to guide retrievalComplex, abstract, exploratory questionsQuality depends on hypothetical document
DeconstructionBreaks complex queries into sub-queriesMulti-part questions, detailed analysisComputationally intensive

Selection Guidance:

  • For straightforward searches or when precision without context is sufficient: Choose Basic.
  • When maintaining contextual understanding is crucial (e.g., long texts or sequential data): Use Chunk Window.
  • To explore different ways of framing a question or when diversity in results is needed: Select Paraphrasing.
  • For abstract, open-ended, or complex queries where answers might be indirectly related: Opt for HyDE.
  • To tackle multi-part or very complex queries that need a breakdown for accurate retrieval: Go with Deconstruction.

Detailed Information

The retrieval method determines the technique used to find relevant information in your dataset. Different methods excel at different types of queries.

Basic:

  • What It Does: Retrieves documents by finding the closest matches in a vector space using similarity metrics like Cosine Similarity.
  • Best For: General searches where precise matching is needed without any special contextual requirements.
  • Advantages: Simple and efficient; works well for straightforward queries and data.
  • Disadvantages: Can miss context-specific details as it focuses purely on direct similarity without additional context. Not ideal for nuanced or complex information retrieval.

Chunk Window:

  • What It Does: This method vectorizes individual chunks of text but keeps track of the preceding and following chunks. After retrieval, it appends these adjacent chunks to provide more context to the retrieved content.
  • Best For: Situations where understanding the broader context of a text is important, such as in technical documents, articles, or long-form content.
  • Advantages: Enhances comprehension by preserving the flow of information around the retrieved chunk, making responses more coherent.
  • Disadvantages: Slightly increases computational load as it processes additional contextual chunks, which might not always be necessary for simpler queries.

Paraphrasing:

  • What It Does: Rephrases the query in multiple ways to explore different linguistic expressions and perspectives of the same question, broadening the scope of results.
  • Best For: Use when queries might be interpreted in various ways or when looking for diverse answers from the data, such as user-generated content or feedback analysis.
  • Advantages: Expands the range of results by capturing varied expressions of the query, improving coverage and recall.
  • Disadvantages: Can introduce irrelevant results if not well-tuned, as variations might stray too far from the original intent of the query.

HyDE (Hypothetical Document Embedding):

  • What It Does: Generates a hypothetical answer or document based on the query and uses this generated text to perform retrieval, essentially searching for documents that are most similar to the hypothetical answer.
  • Best For: Complex, open-ended, or abstract questions where direct search might miss relevant information. Great for exploratory searches where you're not exactly sure what the precise answer looks like.
  • Advantages: Helps uncover hidden connections by broadening the retrieval to align with the intent rather than just the words of the query.
  • Disadvantages: May introduce noise if the hypothetical document doesn't closely align with the actual relevant data, requiring careful balancing of generation quality.

Deconstruction:

  • What It Does: Breaks down complex queries into simpler, more manageable sub-queries, enhancing the system's ability to retrieve relevant parts of information individually before recomposing them.
  • Best For: Use in scenarios where queries are multi-faceted, layered, or too complex to handle in a single retrieval pass—like legal documents, research papers, or detailed data analysis tasks.
  • Advantages: Improves retrieval accuracy by focusing on each element of a complex question, ensuring no aspect is overlooked.
  • Disadvantages: Can be computationally intensive, as it requires multiple retrieval passes and reassembly, making it less efficient for simple queries.

4. Reranker

Quick Selection Guide

RerankerFunctionBest ForTrade-offs
NoneUses initial retrieval ranking onlySimple queries, efficiency-focusedMay miss nuanced relevance
Maximal Marginal Relevance (MMR)Balances relevance with diversityExploration, overview generationMay exclude similar but relevant results
Cohere RerankUses cross-encoders for joint analysisDetailed matching, precise relevanceComputationally intensive
LLM RerankLeverages a language model for nuanced understandingComplex queries requiring deep comprehensionResource-intensive, potentially slower

Selection Guidance:

  • For quick, straightforward tasks where relevance is the only priority: Choose None.
  • To maintain a diverse set of results that still align with the query: Opt for Maximal Marginal Relevance (MMR).
  • When deep, context-rich alignment between query and results is critical: Use Cohere Rerank.
  • For the most advanced, nuanced reranking with high accuracy and contextual fit: Go with LLM Rerank.

Detailed Information

No reranker:

  • What It Does: Focuses solely on retrieving the most relevant results based on the initial query without any additional post-processing or reordering.
  • Best For: Simple use cases where the initial search quality is sufficient, and you need the most straightforward and computationally efficient reranking approach.
  • Advantages: Fast and efficient since it involves minimal processing beyond the initial retrieval. It's straightforward and reliable for direct, relevance-focused tasks.
  • Disadvantages: Does not consider diversity or deeper contextual alignment, which can result in repetitive or narrowly focused results.

Maximal Marginal Relevance (MMR):

  • What It Does: Balances relevance and diversity by iteratively selecting results that are both highly relevant to the query and distinct from previously selected items. This approach helps to ensure a varied set of results, reducing redundancy.
  • Best For: Use in scenarios where you need a balanced set of results that capture different aspects of a query, such as when dealing with news aggregation, content curation, or diverse information needs.
  • Advantages: Prevents the retrieval of duplicates and overly similar results, offering a broader perspective on the query topic.
  • Disadvantages: The balance between relevance and diversity can sometimes dilute the precision of the most directly relevant results if not carefully tuned.

Cohere Rerank:

  • What It Does: Utilizes a cross-encoder to jointly analyze and reorder search results based on coherence and contextual fit. Unlike traditional methods that assess queries and documents independently, cross-encoders evaluate the interaction between the two, offering a more integrated approach to relevance.
  • Best For: Ideal for complex queries where context and detailed matching between the query and results are crucial, such as in academic search, customer support, or detailed content analysis.
  • Advantages: Provides a deeper, joint understanding of query-document relevance, improving the quality of ranking through comprehensive analysis of both elements together.
  • Disadvantages: Computationally intensive as it processes each query-result pair jointly, which may slow down the reranking for large datasets.

LLM Rerank:

  • What It Does: Uses a large language model (LLM) to evaluate and reorder search results, enhancing the accuracy and relevance based on nuanced language understanding. The LLM can assess factors such as coherence, context, and semantic fit, making it highly adaptive to various types of queries.
  • Best For: Situations where high-quality, contextually aware reranking is needed, such as in personalized search, advanced document retrieval, and contexts requiring deep semantic understanding.
  • Advantages: Offers sophisticated reranking capabilities by leveraging advanced language models that can interpret complex relationships and contextual nuances.
  • Disadvantages: Resource-intensive and can be slower than other reranking methods, especially with large volumes of data. Requires significant computational power and can be costly in deployment.

5. Top K

Quick Selection Guide

ValueReturnsBest ForConsiderations
1Single most relevant documentPrecise factual queriesMay miss important context
5Five most relevant documentsBalanced precision and contextGood for most general questions
10Ten most relevant documentsComprehensive overviewIncludes broader context
20Twenty most relevant documentsResearch, explorationMay include less relevant information

Selection Guidance:

  • For fact-based questions: Lower values (1-5) often provide sufficient information
  • For research or exploration: Higher values (10-20) offer broader perspectives

Detailed Information

Top K is a retrieval method that allows you to set a specific limit on the number of documents returned from a search. By choosing a value for "K," you directly control the number of results displayed, ensuring that only the most relevant documents are presented.

What It Does: Limits the number of documents retrieved by the search to the top K results, ranked by their relevance to the query. For instance, if you set K = 5, the system will return the five highest-ranked documents.

Best For: Use when you want to streamline the results and focus on the most relevant content, especially in scenarios where too many results would be overwhelming or unnecessary, such as in customer support, product searches, or targeted research.

Advantages:

  • Efficiency: Helps manage large volumes of data by only presenting the most pertinent information, reducing noise and irrelevant content.
  • Simplicity: Easy to implement and understand, making it ideal for straightforward retrieval tasks where only the top matches matter.
  • Control: Gives users direct control over the breadth of results, allowing them to adjust the scope based on the context or complexity of their needs.

Disadvantages:

  • Potentially Missed Information: By limiting results, there is a risk of overlooking less relevant but still useful information that lies beyond the top K threshold.
  • Context Loss: For highly nuanced or complex queries, restricting results might exclude valuable context that would otherwise be captured in a broader search.

Choosing the Right K:

  • Smaller K Values (e.g., K = 1 to 5): Best when precision is critical, and you want the most focused, relevant answers without distraction.
  • Larger K Values (e.g., K = 10 or 20): Use when you need a wider range of insights or are exploring more complex queries that benefit from a broader view of relevant documents.

6. Embedding Model

Quick Selection Guide

ModelDimensionalityCharacteristicsBest For
text-embedding-ada-0021536Balanced performance and efficiencyGeneral applications, cost-effective deployment
text-embedding-3-small1536Fast, lightweight processingHigh-volume queries, latency-sensitive applications
text-embedding-3-large3072Enhanced semantic understandingComplex reasoning, nuanced semantic relationships
Fine-Tuned ModelVariesDomain-adapted for specific contentIndustry-specific terminology, specialized jargon

Selection Guidance:

  • For general use with balanced speed and quality: Choose text-embedding-ada-002.
  • For quick, low-latency applications: Use text-embedding-3-small.
  • For in-depth analysis where understanding complex semantics is crucial: Opt for text-embedding-3-large.
  • For specialized contexts requiring high relevance and adaptation to specific terminology: Go with a Fine-Tuned Model, particularly if you have access to relevant training data.

Detailed Information

Embedding models are essential for transforming text into numerical vectors that capture the semantic meaning of the content. These vectors are then used in various retrieval, ranking, and classification tasks. Different embedding models offer varying levels of performance and specificity, and the choice of model can significantly impact retrieval effectiveness.

text-embedding-ada-002:

  • Description: A highly efficient and versatile embedding model known for its balance between performance and computational cost. It captures general semantic information, making it suitable for a wide range of retrieval tasks.
  • Best For: General-purpose searches, topic modeling, and classification tasks where high quality and speed are needed without significant resource demands.
  • Advantages: Fast, accurate, and cost-effective; widely used for various natural language processing tasks.
  • Disadvantages: May not capture domain-specific nuances as effectively as more specialized or fine-tuned models.

text-embedding-3-small:

  • Description: A lightweight embedding model designed for low-latency environments or applications where speed is prioritized over deep semantic understanding.
  • Best For: Scenarios requiring quick responses, like real-time applications or low-resource settings.
  • Advantages: Extremely fast and resource-efficient, ideal for high-volume or low-latency tasks.
  • Disadvantages: Limited in capturing complex or nuanced text relationships compared to larger models.

text-embedding-3-large:

  • Description: A more robust embedding model with increased capacity to understand complex language patterns and relationships.
  • Best For: Advanced retrieval tasks, deep semantic analysis, and contexts where a high degree of text comprehension is required.
  • Advantages: Offers deeper insight and captures complex language interactions, enhancing retrieval quality in intricate queries.
  • Disadvantages: Slower and more resource-intensive, which might not be ideal for high-speed applications.

Fine-Tuned Model:

  • Description: This model is tailored specifically to your data by training on uploaded documents. The fine-tuning process enhances the model's ability to understand domain-specific language, concepts, and terminology.
  • Best For: Highly specialized retrieval tasks where the general embedding models might miss context-specific nuances, such as industry-specific documents, legal texts, or technical manuals.
  • Advantages:
    • Domain Adaptation: Fine-tuning allows the model to better align with the language and context specific to your data, improving retrieval accuracy.
    • Automatic Training Data Generation: The model leverages your uploaded documents to create training data automatically, streamlining the fine-tuning process without requiring manual data curation.
  • Disadvantages: Requires additional training time and computational resources. Effectiveness depends on the quality and diversity of the training data provided.

Optimizing Retrieval Configuration

Finding the optimal retrieval configuration often requires experimentation and fine-tuning.

Best Practices

  1. Start with defaults for your content type: Queryloop offers recommended starting configurations
  2. Test with representative queries: Use questions that reflect actual use patterns
  3. Review retrieved chunks: Examine whether the system finds relevant information
  4. Iterate methodically: Change one parameter at a time to understand its impact
  5. Consider performance trade-offs: Balance accuracy against computational resources

By carefully configuring these parameters, you can significantly enhance your application's ability to find and utilize relevant information, resulting in more accurate, contextual, and helpful responses to user queries.