LangDB aims to bridge the gap between data and AI world by running them alongside. You can create an entire LangDB experience just using SQL commands all the way from extraction to creating a chat agent.
This sample demonstrates the following:
extract_text
command.View
model
that connects to OpenAI
APIs.CHAT
command.--Create table for storing pdf chunks
CREATE TABLE pdf_chunks (
id UUID DEFAULT generateUUIDv4(),
content `String`
)
ENGINE = MergeTree
ORDER BY (id, content);
INSERT INTO pdf_chunks(content)
select text from chunk(
(
SELECT content FROM
extract_text( path => 'https://d18rn0p25nwr6d.cloudfront.net/CIK-0000320193/faab4555-c69b-438a-aaf7-e09305f87ca3.pdf',
type => 'pdf')
),
chunk_size => 200,
type => 'Word'
)
SELECT count(*) FROM pdf_chunks;
SELECT * FROM pdf_chunks LIMIT 2;
We will be reusing the open_ai_provider
we created in Using Providers
.
If you haven't created one yet, create one with the following command:
CREATE PROVIDER open_ai_provider
ENGINE = OpenAI(
api_key='sk-proj-xxx',
model_name = 'gpt-4o-mini'
);
LangDB offers a convenient method to generate embeddings using the custom embedding type model function for development and testing purposes. Additionally, we can use in-built embed()
function to generate embeddings.
CREATE EMBEDDING MODEL IF NOT EXISTS generate_embeddings(
input COMMENT 'This is the input of the content whose embeddings are created'
) USING open_ai_provider(embedding_model='text-embedding-ada-002', encoding_format='float', dimensions=100)
--Creating a table for storing embeddings
CREATE TABLE pdf_embeddings (
id UUID,
embeddings `Array`(`Float32`)
)
ENGINE = MergeTree
ORDER BY id;
Use generate_embed
model to generate embeddings for each chunk and store them into pdf_embeddings
table.
You can also use SPAWN TASK
feature to run this in the background.
SPAWN TASK store_embeddings
BEGIN
INSERT INTO pdf_embeddings
SELECT p.id, generate_embeddings(content) FROM pdf_chunks AS p LEFT JOIN pdf_embeddings AS pe ON p.id = pe.id
WHERE p.id != pe.id
ORDER BY p.id
LIMIT 5
END
EVERY 1 MINUTE
WITH MAX_POOL_SIZE 5;
-- check embeddings
SELECT * FROM pdf_embeddings LIMIT 2;
View
let you create access endpoints that can be used both for API consumption as well as RAG inputs for LLMs.
Here we are creating a view named similar()
to perform a vector search against similar chunks to feed into our LLM model.
CREATE VIEW similar(query String "Query to search similar sections in pdf documents") AS
WITH tbl AS (
SELECT CAST(generate_embeddings($query) AS `Array`(`Float32`)) AS query
)
SELECT
p.id as id,
content,
cosineDistance(embeddings, query) AS similarity
FROM
pdf_embeddings AS pe
JOIN
pdf_chunks AS p ON p.id = pe.id
CROSS JOIN
tbl
ORDER BY
similarity ASC
LIMIT 5
Sample vector search using similar()
function:
--returns similar sections in pdf documents matching the input string
SELECT * FROM similar('Apple liabilities') LIMIT 2
SELECT * FROM similar('PLEASE NOTE: THERE IS NO PROOF OF CLAIM FORM FOR') LIMIT 2
You can dynamically create models and prompts on the fly using CREATE
commands that can leverage tools that you have created.
similar
function as a tool.CREATE MODEL IF NOT EXISTS search(
input
) USING open_ai_provider()
PROMPT (system "You are a helpful assistant. You have been provided with similar tool to search over the SEC fillings.
Your task is also to help users with any queries they might have about the SEC filings.
Go through all the content and then respond.",
human "{{input}}")
TOOLS (
similar COMMENT 'View to search embeddings of the PDF'
)
SETTINGS retries = 1;
Using the created model, we can perform search()
query that leverages similar()
tool and responds.
This can immediately be plugged into the frontends to also provide a chat interface.
SELECT search('Earnings of Apple in 2022')
We can use the model created as chat agent
The following pops up a chat interface where you can interact with the model.
CHAT search WITH input = 'Hi';
DROP TABLE pdf_chunks;
DROP TABLE pdf_embeddings;