We will recreate an example use case from LlamaIndex. In this example, we will create a model that will combine insights from structured data (SQL tables) and unstructured data (Wikipedia articles) to answer user queries.
We first create a cities table which contains information about different cities regarding their geographic location, population, and the country in which they are located.
CREATE TABLE cities (
`id` UUID DEFAULT generateUUIDv4(),
`city` "String",
`lat` Decimal64(3),
`lng` Decimal64(3),
`country` "String",
`population` UInt64
) engine = MergeTree
ORDER BY id;
INSERT INTO cities(city, lat, lng, country, population) VALUES
('Tokyo','35.6897','139.6922','Japan','37732000'),
('Jakarta','-6.1750','106.8275','Indonesia','33756000'),
('Delhi','28.6100','77.2300','India','32226000'),
('Manila','14.5958','120.9772','Philippines','24922000'),
('Dhaka','23.7639','90.3889','Bangladesh','18627000'),
('Beijing','39.9067','116.3975','China','18522000'),
('Moscow','55.7558','37.6172','Russia','17332000'),
('Karachi','24.8600','67.0100','Pakistan','15738000'),
('Ho Chi Minh City','10.7756','106.7019','Vietnam','15136000'),
('Singapore','1.3000','103.8000','Singapore','5983000'),
('Tashkent','41.3111','69.2797','Uzbekistan','2956384'),
('Phnom Penh','11.5694','104.9211','Cambodia','2129371'),
('Bishkek','42.8747','74.6122','Kyrgyzstan','1120827'),
('Tbilisi','41.7225','44.7925','Georgia','1118035'),
('Sri Jayewardenepura Kotte','6.9108','79.8878','Sri Lanka','115826');
We create a pdfs table to store the PDFs containing information about the cities, obtained from Wikipedia. We extract semantically chunked data from the PDFs using the built-in function load_pdf_text(), and insert it into the tables.
CREATE TABLE cities_links (
`city` "String",
`link` "String"
) engine = MergeTree
order by city;
INSERT INTO cities_links(city, link) VALUES
('Beijing','https://en.wikipedia.org/api/rest_v1/page/pdf/Beijing'),
('Tokyo','https://en.wikipedia.org/api/rest_v1/page/pdf/Tokyo'),
('Jakarta','https://en.wikipedia.org/api/rest_v1/page/pdf/Jakarta'),
('Delhi','https://en.wikipedia.org/api/rest_v1/page/pdf/New_Delhi'),
('Manila','https://en.wikipedia.org/api/rest_v1/page/pdf/Manila'),
('Dhaka','https://en.wikipedia.org/api/rest_v1/page/pdf/Dhaka'),
('Moscow','https://en.wikipedia.org/api/rest_v1/page/pdf/Moscow'),
('Karachi','https://en.wikipedia.org/api/rest_v1/page/pdf/Karachi'),
('Singapore','https://en.wikipedia.org/api/rest_v1/page/pdf/Singapore'),
('Tashkent','https://en.wikipedia.org/api/rest_v1/page/pdf/Tashkent'),
('Phnom Penh','https://en.wikipedia.org/api/rest_v1/page/pdf/Phnom_Penh'),
('Bishkek','https://en.wikipedia.org/api/rest_v1/page/pdf/Bishkek'),
('Tbilisi','https://en.wikipedia.org/api/rest_v1/page/pdf/Tbilisi'),
('Sri Jayewardenepura Kotte','https://en.wikipedia.org/api/rest_v1/page/pdf/Sri_Jayawardenepura_Kotte'),
('Ho Chi Minh City','https://en.wikipedia.org/api/rest_v1/page/pdf/Ho_Chi_Minh_City');
We create a pdfs table to store the PDFs containing information about the cities, obtained from Wikipedia. We extract semantically chunked data from the PDFs using the built-in function load_pdf_text(), and insert it into the tables.
CREATE TABLE cities_pdf (
`id` UUID DEFAULT generateUUIDv4(),
`content` "String",
`metadata` "String",
`city` "String"
) engine = MergeTree
order by (id, content);
INSERT INTO cities_pdf(content, metadata, city)
select text, metadata, city from chunk(
(
SELECT content, metadata, city
FROM extract_text((SELECT link, city from cities_links), path=> link, type=> 'pdf')
),
chunk_size => 500,
type => 'Word'
)
LangDB offers a convenient method to generate embeddings using the custom embedding type model function for development and testing purposes.
CREATE EMBEDDING MODEL generate_embeddings
USING openai(model='text-embedding-ada-002', encoding_format='float')
CREATE TABLE cities_embeddings (
id UUID,
city `String`,
content `String`,
embeddings `Array`(`Float32`),
)
engine = MergeTree
order by id;
While we can use generate_embeddings()
to generate embeddings for each chunk and store them.
INSERT INTO cities_embeddings
SELECT id, city, content , embedding FROM generate_embeddings((
SELECT p.id, content, city
FROM cities_pdf AS p
LEFT JOIN cities_embeddings AS pe ON p.id = pe.id
WHERE p.id != pe.id
ORDER BY p.id
), input=>content)
First, we create view which utilize vector search to find relevant chunks. While the cities_info_generic could be utilized to get information about the query.
CREATE VIEW cities_info_generic(query String "description of the information to look up about cities") AS
WITH query AS (
SELECT embedding::Array(Float32) AS query FROM generate_embeddings($query)
)
SELECT
p.id as id,
p.content as content,
cosineDistance(embeddings, query) AS cosineDistance,
p.city as city
FROM
cities_embeddings AS p
CROSS JOIN
query
ORDER BY
cosineDistance ASC
LIMIT 5
Using the view to understand how similarity search works
SELECT * from cities_info_generic('Olympics')
We will also be using Text-to-SQL
we created before to go over the cities table in the data to retrieve specific information.
We create a prompt for our use case based on the ReAct framework.
CREATE PROMPT cities_prompt (
system "You are a master data agent specializing in providing information about cities. Your task is to answer user questions about cities using the available tools and data sources.
Tools at your disposal:
1. text_to_sql( question): Use this to retrieve data from the specified table in the database. For city-related queries, use the 'cities' table, which contains information such as population, latitude, longitude, and country.
2. cities_info_generic(question): Use this for general information about cities using similarity search based on cosine distance.
Guidelines for tool usage:
- text_to_sql: Prefer this tool when specific data points (population, location, country) are needed or when comparing multiple cities. Make your intent clear of what you want to search
- cities_info_generic: Use this when the city is not known, or when seeking general information not available in the database.
Always follow these steps:
1. Analyze the question to determine the best tool(s) to use.
2. Use the chosen tool(s) to gather relevant information.
3. Synthesize the gathered information to provide a comprehensive answer.
Output format:
Question: [Restate the input question]
Thought: [Your reasoning about how to approach the question]
Action: [The tool you decide to use]
Action Input: [For text_to_sql: {'question': 'Your specific question'}, For cities_info_generic: 'Your question']
Observation: [The result returned by the tool]
... (Repeat Thought/Action/Action Input/Observation as needed)
Thought: [Final reasoning about how to answer the question based on all gathered information]
Final Answer: [Comprehensive answer to the question, including:
- Direct response to the question
- Supporting data from the tools used
- If text_to_sql was used, include the full SQL query
- Any relevant additional context or explanations]
Remember:
- Always use the tools to gather information; do not rely on prior knowledge.
- Be thorough in your analysis and provide detailed, informative answers.
- When using text_to_sql, always formulate a clear, specific question for the SQL query. The output of query_model would be SQLQuery and SQLAnswer.
- If the question is ambiguous, state your assumptions clearly in the final answer.
Begin!
Question: {{input}}
Thought: Let's start by analyzing the question and determining the best approach to answer it."
);
Now, we can create the models that can leverage the tools that were created earlier.
CREATE MODEL IF NOT EXISTS cities_info_model(
input
) USING openai(model_name='gpt-4o')
PROMPT cities_prompt
TOOLS (
text_to_sql COMMENT 'Text-to-SQL Model to query the database',
cities_info_generic COMMENT 'Vector Search on Cities Wiki Pages')
SETTINGS retries = 3;
Along with the tools we had created, we have also attached langdb_raw_query
, a built-in static tool, which allows the model to execute raw SELECT (only) queries on the database.
Using the created model, we can execute queries which would require the LLM to use both structured (cities table) and unstructured (Wikipedia articles) data through the provided tools.
select * FROM cities_info_model('Tell me about the arts and culture of the city with the highest population');
In the above query, the model generates a SQL query to find the city with the most populous city and invokes the langdb_raw_query
tool to execute the generated query. It uses the result from the query, i.e. Tokyo, and invokes the cities_info_generic
tool to get more information about the arts and culture of the city.
select * FROM cities_info_model('Whats the population of the city which conducted the 1964 Summer Olympics');
We can use the model created as chat
CHAT cities_info_model