Combining Structured and Unstructured Data¶

We will recreate an example use case from LlamaIndex. In this example, we will create a model that will combine insights from structured data (SQL tables) and unstructured data (Wikipedia articles) to answer user queries.

Schema Creation and Loading Data¶

We first create a cities table which contains information about different cities regarding their geographic location, population, and the country in which they are located.

CREATE TABLE cities (
	`id` UUID DEFAULT generateUUIDv4(),
	`city` "String",
	`lat` Decimal64(3),
	`lng` Decimal64(3),
	`country` "String",
	`population` UInt64
) engine = MergeTree
ORDER BY id;

INSERT INTO cities(city, lat, lng, country, population) VALUES
    ('Tokyo','35.6897','139.6922','Japan','37732000'),
    ('Jakarta','-6.1750','106.8275','Indonesia','33756000'),
    ('Delhi','28.6100','77.2300','India','32226000'),
    ('Manila','14.5958','120.9772','Philippines','24922000'),
    ('Dhaka','23.7639','90.3889','Bangladesh','18627000'),
    ('Beijing','39.9067','116.3975','China','18522000'),
    ('Moscow','55.7558','37.6172','Russia','17332000'),
    ('Karachi','24.8600','67.0100','Pakistan','15738000'),
    ('Ho Chi Minh City','10.7756','106.7019','Vietnam','15136000'),
    ('Singapore','1.3000','103.8000','Singapore','5983000'),
    ('Tashkent','41.3111','69.2797','Uzbekistan','2956384'),
    ('Phnom Penh','11.5694','104.9211','Cambodia','2129371'),
    ('Bishkek','42.8747','74.6122','Kyrgyzstan','1120827'),
    ('Tbilisi','41.7225','44.7925','Georgia','1118035'),
    ('Sri Jayewardenepura Kotte','6.9108','79.8878','Sri Lanka','115826');

Loading the PDFs links¶

We create a pdfs table to store the PDFs containing information about the cities, obtained from Wikipedia. We extract semantically chunked data from the PDFs using the built-in function load_pdf_text(), and insert it into the tables.

CREATE TABLE cities_links (
	`city` "String",
	`link` "String"
) engine = MergeTree
order by city;

INSERT INTO cities_links(city, link) VALUES
    ('Beijing','https://en.wikipedia.org/api/rest_v1/page/pdf/Beijing'),
    ('Tokyo','https://en.wikipedia.org/api/rest_v1/page/pdf/Tokyo'),
    ('Jakarta','https://en.wikipedia.org/api/rest_v1/page/pdf/Jakarta'),
    ('Delhi','https://en.wikipedia.org/api/rest_v1/page/pdf/New_Delhi'),
    ('Manila','https://en.wikipedia.org/api/rest_v1/page/pdf/Manila'),
    ('Dhaka','https://en.wikipedia.org/api/rest_v1/page/pdf/Dhaka'),
    ('Moscow','https://en.wikipedia.org/api/rest_v1/page/pdf/Moscow'),
    ('Karachi','https://en.wikipedia.org/api/rest_v1/page/pdf/Karachi'),
    ('Singapore','https://en.wikipedia.org/api/rest_v1/page/pdf/Singapore'),
    ('Tashkent','https://en.wikipedia.org/api/rest_v1/page/pdf/Tashkent'),
    ('Phnom Penh','https://en.wikipedia.org/api/rest_v1/page/pdf/Phnom_Penh'),
    ('Bishkek','https://en.wikipedia.org/api/rest_v1/page/pdf/Bishkek'),
    ('Tbilisi','https://en.wikipedia.org/api/rest_v1/page/pdf/Tbilisi'),
    ('Sri Jayewardenepura Kotte','https://en.wikipedia.org/api/rest_v1/page/pdf/Sri_Jayawardenepura_Kotte'),
    ('Ho Chi Minh City','https://en.wikipedia.org/api/rest_v1/page/pdf/Ho_Chi_Minh_City');

Loading the PDFs¶

We create a pdfs table to store the PDFs containing information about the cities, obtained from Wikipedia. We extract semantically chunked data from the PDFs using the built-in function load_pdf_text(), and insert it into the tables.

CREATE TABLE cities_pdf (
  `id` UUID DEFAULT generateUUIDv4(),
  `content` "String", 
  `metadata` "String", 
  `city` "String"
) engine = MergeTree
order by (id, content);

INSERT INTO cities_pdf(content, metadata, city) 
select text, metadata, city from chunk(
    (
    SELECT content, metadata, city
    FROM extract_text((SELECT link, city from cities_links), path=> link, type=> 'pdf')
    ),
    chunk_size => 500, 
    type => 'Word'
)

Embeddings¶

LangDB offers a convenient method to generate embeddings using the custom embedding type model function for development and testing purposes.

CREATE EMBEDDING MODEL generate_embeddings 
USING openai(model='text-embedding-ada-002', encoding_format='float')

CREATE TABLE cities_embeddings (
  id UUID,
  city `String`,
  content `String`,
  embeddings `Array`(`Float32`),
) 
engine = MergeTree
order by id;

While we can use generate_embeddings() to generate embeddings for each chunk and store them.

INSERT INTO cities_embeddings
SELECT id, city, content , embedding FROM generate_embeddings((
    SELECT p.id, content, city
FROM cities_pdf AS p 
LEFT JOIN cities_embeddings AS pe ON p.id = pe.id
WHERE p.id != pe.id
ORDER BY p.id
), input=>content)

VIEW Creation¶

First, we create view which utilize vector search to find relevant chunks. While the cities_info_generic could be utilized to get information about the query.

CREATE VIEW cities_info_generic(query String "description of the information to look up about cities") AS
WITH query AS (
  SELECT embedding::Array(Float32) AS query FROM generate_embeddings($query) 
)
SELECT 
  p.id as id, 
  p.content as content, 
  cosineDistance(embeddings, query) AS cosineDistance,
  p.city as city
FROM 
  cities_embeddings AS p 
CROSS JOIN
  query
ORDER BY
  cosineDistance ASC
LIMIT 5

Using VIEW¶

Using the view to understand how similarity search works

SELECT * from cities_info_generic('Olympics')

We will also be using Text-to-SQL we created before to go over the cities table in the data to retrieve specific information.

Prompt Creation¶

We create a prompt for our use case based on the ReAct framework.

CREATE PROMPT cities_prompt (
system "You are a master data agent specializing in providing information about cities. Your task is to answer user questions about cities using the available tools and data sources.

Tools at your disposal:
1. text_to_sql( question): Use this to retrieve data from the specified table in the database. For city-related queries, use the 'cities' table, which contains information such as population, latitude, longitude, and country.
2. cities_info_generic(question): Use this for general information about cities using similarity search based on cosine distance.

Guidelines for tool usage:
- text_to_sql: Prefer this tool when specific data points (population, location, country) are needed or when comparing multiple cities. Make your intent clear of what you want to search
- cities_info_generic: Use this when the city is not known, or when seeking general information not available in the database.

Always follow these steps:
1. Analyze the question to determine the best tool(s) to use.
2. Use the chosen tool(s) to gather relevant information.
3. Synthesize the gathered information to provide a comprehensive answer.

Output format:
Question: [Restate the input question]
Thought: [Your reasoning about how to approach the question]
Action: [The tool you decide to use]
Action Input: [For text_to_sql: {'question': 'Your specific question'}, For cities_info_generic: 'Your question']
Observation: [The result returned by the tool]
... (Repeat Thought/Action/Action Input/Observation as needed)
Thought: [Final reasoning about how to answer the question based on all gathered information]
Final Answer: [Comprehensive answer to the question, including:
  - Direct response to the question
  - Supporting data from the tools used
  - If text_to_sql was used, include the full SQL query
  - Any relevant additional context or explanations]

Remember:
- Always use the tools to gather information; do not rely on prior knowledge.
- Be thorough in your analysis and provide detailed, informative answers.
- When using text_to_sql, always formulate a clear, specific question for the SQL query. The output of query_model would be SQLQuery and SQLAnswer.
- If the question is ambiguous, state your assumptions clearly in the final answer.

Begin!
Question: {{input}}
Thought: Let's start by analyzing the question and determining the best approach to answer it."
);

Model Creation¶

Now, we can create the models that can leverage the tools that were created earlier.

CREATE MODEL IF NOT EXISTS cities_info_model( 
    input
) USING openai(model_name='gpt-4o')
PROMPT cities_prompt
TOOLS (
    text_to_sql COMMENT 'Text-to-SQL Model to query the database', 
    cities_info_generic COMMENT 'Vector Search on Cities Wiki Pages')
SETTINGS retries = 3;

Along with the tools we had created, we have also attached langdb_raw_query, a built-in static tool, which allows the model to execute raw SELECT (only) queries on the database.

Model Execution¶

Using the created model, we can execute queries which would require the LLM to use both structured (cities table) and unstructured (Wikipedia articles) data through the provided tools.

select * FROM cities_info_model('Tell me about the arts and culture of the city with the highest population');

Question: Tell me about the arts and culture of the city with the highest population
Thought: I need to understand the arts and culture of Tokyo, the city with the highest population.
Action: Use cities_info_generic tool
Action Input: query: "Tokyo arts and culture"
Observation: The tool will provide information about the arts and culture of Tokyo.

Action: functions.cities_info_generic
Action Input: query: "Tokyo arts and culture"
- Tokyo has a vibrant arts and culture scene. It is known for its traditional Japanese plays with venues like the National Noh Theatre and Kabuki-za. Additionally, the New National Theatre Tokyo in Shibuya hosts opera, ballet, contemporary dance, and drama performances.
- Tokyo is home to numerous theatres including the National Theatre of Japan, the Imperial Theatre, Meiji-za, NHK Hall, Tokyo Metropolitan Theatre, Tokyo Opera City, and Tokyo International Forum.
- The city offers a diverse array of leisure and entertainment options, featuring major aquariums like Shinagawa Aquarium, Tokyo Sea Life Park, Sunshine Aquarium, and Sumida Aquarium.
- Tokyo has notable museums like the Tokyo Metropolitan Garden Art Museum in Minato and the Tobacco and Salt Museum in Sumida, which houses extensive collections.
- The nightlife district of Tokyo is centered around areas like Shibuya, Shinjuku, and Roppongi, known for their bars, clubs, live music venues, and host and hostess clubs.
- Tokyo is also famous for its festivals such as the SannÅ Matsuri, Sanja Festival, and Kanda Matsuri, featuring elaborately decorated floats and parades.

Final Answer: Tokyo, the city with the highest population, boasts a rich arts and culture scene with traditional Japanese plays, diverse theatres, museums, aquariums, and lively nightlife districts, along with vibrant festivals.

In the above query, the model generates a SQL query to find the city with the most populous city and invokes the langdb_raw_query tool to execute the generated query. It uses the result from the query, i.e. Tokyo, and invokes the cities_info_generic tool to get more information about the arts and culture of the city.

select * FROM cities_info_model('Whats the population of the city which conducted the 1964 Summer Olympics');

The "cities" table in the database has the following schema:
- id: UUID
- city: String
- lat: Decimal(18, 3)
- lng: Decimal(18, 3)
- country: String
- population: UInt64

This schema includes information about cities such as city name, latitude, longitude, country, and population. Now, I can proceed to find the population of the city that conducted the 1964 Summer Olympics. Let's retrieve this information.Question: Whats the population of the city which conducted the 1964 Summer Olympics
Thought: I need to understand the semantics of the data structure available in the database.
Action: Execute a raw SQL query to retrieve the population of Tokyo city.
Observation: The population of Tokyo city is 37,732,000.
Thought: I think I have enough information to answer the question. Based on the data retrieved, I can now answer the question.
Final Answer: The population of Tokyo, the city that conducted the 1964 Summer Olympics, is 37,732,000.

Running the Chat¶

We can use the model created as chat

CHAT cities_info_model

	id	content	cosineDistance	city
0	0e6496a7-3048-469a-9b95-430c3b9bd1ea	Moscow was the host city of the 1980 Summer Olympics, with the yachting events being held at Tallinn, in present-day\r\nEstonia. Large sports facilities and the main international airport, Sheremetyevo Terminal 2, were built in preparation for\r\nthe 1980 Summer Olympics. Moscow had made a bid for the 2012 Summer\r\nOlympics. However, when final voting commenced on July 6, 2005, Moscow was\r\nthe first city to be eliminated from further rounds. The Games were awarded to\r\nLondon.\r\nThe most titled ice hockey team in the Soviet Union and in the world, HC CSKA\r\nMoscow comes from Moscow. Other big ice hockey clubs from Moscow are HC\r\nDynamo Moscow, which was the second most titled team in the Soviet Union, and\r\nHC Spartak Moscow.\r\nThe most titled Soviet, Russian, and one of the most titled Euroleague clubs, is the\r\nbasketball club from Moscow PBC CSKA Moscow. Moscow hosted the\r\nEuroBasket in 1953 and 1965.\r\nMoscow had more winners at the USSR and Russian Chess Championship than any other city.	0.167045	Moscow
1	0e6496a7-3048-469a-9b95-430c3b9bd1ea	Moscow was the host city of the 1980 Summer Olympics, with the yachting events being held at Tallinn, in present-day\r\nEstonia. Large sports facilities and the main international airport, Sheremetyevo Terminal 2, were built in preparation for\r\nthe 1980 Summer Olympics. Moscow had made a bid for the 2012 Summer\r\nOlympics. However, when final voting commenced on July 6, 2005, Moscow was\r\nthe first city to be eliminated from further rounds. The Games were awarded to\r\nLondon.\r\nThe most titled ice hockey team in the Soviet Union and in the world, HC CSKA\r\nMoscow comes from Moscow. Other big ice hockey clubs from Moscow are HC\r\nDynamo Moscow, which was the second most titled team in the Soviet Union, and\r\nHC Spartak Moscow.\r\nThe most titled Soviet, Russian, and one of the most titled Euroleague clubs, is the\r\nbasketball club from Moscow PBC CSKA Moscow. Moscow hosted the\r\nEuroBasket in 1953 and 1965.\r\nMoscow had more winners at the USSR and Russian Chess Championship than any other city.	0.167045	Moscow
2	ae687a62-73a5-411a-be00-46fedce90931	The Bolshoi Theatre\r\nThe Luzhniki Stadium hosted the\r\n1980 Summer Olympics and the\r\n2018 FIFA World Cup Final.\r\nSparrow Hills fanzone during 2018\r\nFIFA World Cup\r\nSC Olimpiyskiy was built for the\r\n1980 Summer Olympics.\r\nCSKAArena during a game of KHL,\r\nconsidered to be the second-best ice\r\nhockey league in the world\r\nMoscow will get its own branch of the Hermitage Museum in 2024, with authorities having agreed upon the final project, to be executed by Hani Rashid, cofounder of New York-based 'Asymptote Architecture' - the same bureau that's behind the city's stock market building, the Busan-based World Business Center\r\nSolomon Tower and the Strata Tower in Abu-Dhabi.\r\n[136]\r\nMoscow is the heart of the Russian performing arts, including ballet and film, with 68\r\nmuseums\r\n[137] 103[138] theaters, 132 cinemas and 24 concert halls. Among Moscow's theaters\r\nand ballet studios is the Bolshoi Theatre and the Malyi Theatre\r\n[139] as well as Vakhtangov\r\nTheatre and Moscow Art Theatre.	0.167222	Moscow
3	ae687a62-73a5-411a-be00-46fedce90931	The Bolshoi Theatre\r\nThe Luzhniki Stadium hosted the\r\n1980 Summer Olympics and the\r\n2018 FIFA World Cup Final.\r\nSparrow Hills fanzone during 2018\r\nFIFA World Cup\r\nSC Olimpiyskiy was built for the\r\n1980 Summer Olympics.\r\nCSKAArena during a game of KHL,\r\nconsidered to be the second-best ice\r\nhockey league in the world\r\nMoscow will get its own branch of the Hermitage Museum in 2024, with authorities having agreed upon the final project, to be executed by Hani Rashid, cofounder of New York-based 'Asymptote Architecture' - the same bureau that's behind the city's stock market building, the Busan-based World Business Center\r\nSolomon Tower and the Strata Tower in Abu-Dhabi.\r\n[136]\r\nMoscow is the heart of the Russian performing arts, including ballet and film, with 68\r\nmuseums\r\n[137] 103[138] theaters, 132 cinemas and 24 concert halls. Among Moscow's theaters\r\nand ballet studios is the Bolshoi Theatre and the Malyi Theatre\r\n[139] as well as Vakhtangov\r\nTheatre and Moscow Art Theatre.	0.167222	Moscow
4	77322727-8eb9-4671-a0ae-c40065bf4bcd	The China Central Television\r\nHeadquarters building in CBD\r\nA scene from the opening\r\nceremonies of the 2008 Summer\r\nOlympic Games\r\nBeijing Workers' Stadium at night as\r\nviewed from Sanlitun\r\nFamous rock bands and solo artists from Beijing include Cui Jian, Dou Wei, He Yong, Pu Shu, Tang\r\nDynasty, Black Panther, The Flowers, 43 Baojia Street, etc.\r\n[220]\r\nBeijing has hosted numerous international and national sporting events, the most notables was the 2008\r\nSummer Olympic and Paralympic Games and the 2022 Winter Olympics and the Paralympics. Other multisport international events held in Beijing include the 2001 Summer Universiade and the 1990 Asian Games.\r\nSingle-sport international competitions include the Beijing Marathon (annually since 1981), China Open of\r\nTennis (1993–97, annually since 2004), ISU Grand Prix of Figure Skating Cup of China (2003, 2004, 2005,\r\n2008, 2009 and 2010), World Professional Billiards and Snooker Association China Open for Snooker	0.170085	Beijing