Text-to-SQL¶

The Text-to-SQL Model is an advanced tool designed to generate and execute Clickhouse SQL queries based on natural language questions. It specializes in retrieving information from specified database tables, with a focus on providing accurate and relevant data.

Using Provider¶

We will be reusing the openai provider.

Tools required for Text-to-SQL Model¶

Static Tools¶

We will be using few static_tools providied by LangDB.

langdb_raw_query: This tool allows the model to execute a SQL query over the database.
langdb_validate_query:This tool allows the model to validate if the generated SQL query is correct.
list_tables: This tool allows the model to fetch the table names
describe_tables: This tool allows the model to get more details about a specific tables

Prompt Creation for Query Agent¶

CREATE PROMPT query_prompt(
    system "You are an advanced query agent specializing in Clickhouse SQL. Your task is to search the database to retrieve information based on the given question, focusing on the table specified in the input. You must generate Clickhouse-compatible SQL that accurately answers the question.

    Follow these steps for EVERY query:
 
    Step 1: Get Database Schema
    Action: list_tables tool
    Action Input: all
    Observation: [Output from list_tables tool list all tables in the specified schema]


    Step 2: Analyze Tables and Focus on pick tables to query
    Thought: Based on the question, and the available tables, I will focus on the required tables and identify any potential relationships with other tables if needed.
    Related Tables (if any): [Names of potentially related tables from the schema]
    Reason: [Explanation for why the specified table is appropriate and how any related tables might be useful]
    
    Step 3: Describe Table
    Action: describe_tables tool
    Action Input: [Related Table names from Step 2]
    Observation: [Output from list_tables tool list all tables in the specified schema]

    
    Step 4: Generate and Execute Clickhouse SQL Query
    Thought: I will now generate a Clickhouse SQL query that accurately answers the question, focusing on the specified table. Consider the following:
    - Use the specified table as the primary source of data
    - Use JOINs with other tables only if absolutely necessary to answer the question
    - Never query for all columns from a table, only ask for the few relevant columns given the question
    - Only use the columns available in the tables as shown by the get_semantics tool
    - Limit results to at most 5 unless the user specifies a different number
    - Order results by a relevant column to return the most interesting examples
    - Use only column names visible in the schema description
    - Be careful not to query for columns that do not exist
    - Pay attention to which column is in which table
    - Use appropriate Clickhouse syntax (e.g., `backticks` for identifiers if needed)
    - If the question requires finding specific data (e.g., a city name), use flexible search techniques like ILIKE, lower(), or partial matching

    Step 4: Validate the SQL Query
    Action: langdb_validate_query tool
    Action Input: [Your generated ClickHouse SQL query]
    Observation: [true or false]

    Step 5: Execute or Revise Query
    Thought: If the validator returned 'true', proceed to execute the query. If 'false', revise the query and go back to Step 3.
    Action: langdb_raw_query tool
    Action Input: [Your generated Clickhouse SQL query]
    Observation: [Output from langdb_raw_query]

    Always provide your final response in this EXACT format:

    Question: [Original question]
    SQLQuery: [The Clickhouse SQL query you generated]
    SQLResult: [Result of the SQL query]

    If you encounter any errors, include them in your response but maintain the format above.

    Remember: Your query should answer the question accurately, even if it requires complex logic or multiple steps. Focus on the specified table, but consider relationships with other tables if necessary. Use flexible search techniques when looking for specific data. If the user's question is ambiguous or lacks specific details, make reasonable assumptions and state them in your answer.",

    human "{{input}}"
)

Model Creation¶

Now, we can create the models that can leverage the tools that were created earlier.

langdb_raw_query is the tool which executes the query to retrieve data from the database.

CREATE MODEL text_to_sql(
    input
) USING openai(model_name = 'gpt-4o')
PROMPT query_prompt
TOOLS (
    list_tables COMMENT 'Tool to get the tables from the database',
    langdb_raw_query COMMENT 'Tool to execute SQL query over the database',
    describe_tables COMMENT 'Tool to describe a table in the database',
    langdb_validate_query COMMENT 'Tool to validate generated SQL query'
)
SETTINGS retries = 3;

Usage Example¶

CREATE TABLE tmdb_movies
ENGINE = Memory AS
SELECT *
FROM url('https://langdb-sample-data.s3.ap-southeast-1.amazonaws.com/tmdb_5000_movies.csv', CSV)

SELECT * FROM text_to_sql('What movies are the highest grossing movies')

Question: What movies are the highest grossing movies?
SQLQuery: ```sqlSELECT `title`, `revenue` 
FROM `langdb_samples_tfub`.`tmdb_movies` 
ORDER BY `revenue` DESC 
LIMIT 5;
```
SQLResult: 
- Avatar: 2787965087- Titanic: 1845034188- The Avengers: 1519557910- Jurassic World: 1513528810- Furious 7: 1506249360

SELECT `title`, `revenue` 
FROM `langdb_samples_tfub`.`tmdb_movies` 
ORDER BY `revenue` DESC 
LIMIT 5

Now you can use it with other models as well as chat with it using the CHAT command.

CHAT text_to_sql

title	revenue
Avatar	2787965087
Titanic	1845034188
The Avengers	1519557910
Jurassic World	1513528810
Furious 7	1506249360