Text to NoSQL
This document describes the Text to NoSQL workflow implemented in Queryloop, which transforms natural language questions into NoSQL aggregation queries. The goal is to streamline interactions with MongoDB databases by allowing users to query data through conversational language, which is then translated into MongoDB aggregation queries for accurate data retrieval.
Step 1: Data Dumping and Embedding
The first stage of the process involves creating a set of potential user queries and embedding them into a vector database.
Generating Sample Queries (Utterances)
- Queryloop utilizes the GPT-4o-mini model to generate 20 potential questions based on a MongoDB table schema and example data structure.
- The questions vary in complexity and focus:
- Single-column queries: Involve a single attribute from the table.
- Multi-column queries: Involve two or more columns.
- Query Types: Include diverse operations, such as filtering, sorting, grouping, aggregating, and distinct value retrieval.
- These generated questions are embedded into a vector database, with metadata for each embedded question including the table name, schema, and example data structure.
Prompt for Utterance Generation
- To generate the questions, Queryloop uses a specific prompt:
"You are given a MongoDB table schema and an example of its structure. Generate 20 unique natural language queries that could retrieve data from this MongoDB collection using the aggregation pipeline."
Documenting the "Text to NoSQL" Process in Queryloop
This document describes the Text to NoSQL workflow implemented in Queryloop, which transforms natural language questions into NoSQL aggregation queries. The goal is to streamline interactions with MongoDB databases by allowing users to query data through conversational language, which is then translated into MongoDB aggregation queries for accurate data retrieval.
Step 1: Data Dumping and Embedding
The first stage of the process involves creating a set of potential user queries and embedding them into a vector database.
1.1 Generating Sample Queries (Utterances)
- Queryloop utilizes the GPT-4o-mini model to generate 20 potential questions based on a MongoDB table schema and example data structure.
- The questions are designed to vary in complexity and focus:
- Single-column queries: Involve a single attribute from the table.
- Multi-column queries: Involve two or more columns.
- Query Types: Include diverse operations, such as filtering, sorting, grouping, aggregating, and distinct value retrieval.
- These generated questions are then embedded into a vector database. Metadata for each embedded question includes the table name, schema, and example data structure.
1.2 Prompt for Utterance Generation
To generate the questions, Queryloop uses a specific prompt:
"You are given a MongoDB table schema and an example of its structure. Generate 20 unique natural language queries that could retrieve data from this MongoDB collection using the aggregation pipeline."
This prompt guides GPT-4o-mini to create queries that reflect common natural language questions for interacting with MongoDB collections.
Step 2: Table Retrieval with Similarity Search
To identify the correct table for a user’s query, Queryloop performs a similarity search over the vector database.
2.1 Similarity Search Process
- Queryloop searches the vector database using the user’s query, retrieving the top 100 utterances with the highest similarity scores.
- The number of utterances for each retrieved table is then counted, and tables are sorted by this count. The top five tables with the most matching utterances are selected for further analysis.
2.2 Prompt for Table Selection
The top five tables are presented to an LLM (GPT-4o-mini) along with the user query. The LLM is tasked with identifying the correct table from these top options based on schema relevance:
"Identify the most relevant database collection based on the provided schemas and user query."
Step 3: Query Generation
Once the correct table is identified, Queryloop generates the corresponding NoSQL aggregation query.
3.1 NoSQL Query Generation
- The selected table schema, a sample document, and the user query are provided to GPT-4o-mini to create a MongoDB aggregation query.
- The prompt ensures that the output is in JSON format, using the raw MongoDB aggregation pipeline syntax:
"You are a MongoDB expert. Create a syntactically correct raw aggregation query for the user question."
- The model limits results to 10 unless the user specifies otherwise and utilizes field names accurately, without referencing non-existent fields.
Improvements and Experimentation
To refine the retrieval accuracy, Queryloop performed extensive experiments, analyzing hit rates, accuracy, and prompt effectiveness.
4.1 Experiments on Retrieval Flow
- Queryloop tested retrieval using different top-k values and embedding configurations, finding that embedding utterances with schemas significantly improved accuracy.
- The best hit rate (is_hit percentage) achieved was 86.42% with top_k set to 10 and schema-included utterances embedded in the vector database.
4.2 Experiments on Table Selection Prompts
- Three prompt types were tested: Queryloop’s default prompt, Instruction IO, and Instruction COT.
- Using Queryloop’s prompt, accuracy for table selection improved to 88% with top_k = 2.
4.3 Row Inclusion and Schema Integration
- Including random rows with schemas in prompts improved table selection from 88% to 93% accuracy.
- Full-row inclusion (all rows) further increased accuracy but was deemed impractical for production due to table size constraints.
4.4 Enhanced Retrieval Flow
- Queryloop embedded each row of the table with its schema and performed a similarity search. This resulted in the highest-scoring rows being passed to the LLM, enhancing retrieval accuracy to 95% with top_k = 10.
4.5 Large-Scale Testing
- Queryloop achieved an overall retrieval accuracy of 85% on a dataset of 1,160 points, highlighting the robustness of the optimized retrieval flow.
Summary of Retrieval and Accuracy Improvements
The final optimized workflow showed the following improvements:
- Top_k = 10: Hit rate increased from 86% to 95%.
- Top_k = 2: Accuracy improved from 59% to 83%.
- Overall accuracy: Increased from 67% to 90% after integrating enhanced retrieval and schema strategies.
These improvements showcase Queryloop’s ability to transform natural language queries into efficient NoSQL aggregation queries, maintaining high accuracy and relevance in retrieval tasks. The optimized process balances performance with scalability, offering a streamlined solution for organizations seeking advanced natural language to NoSQL query translation.