Introduction to RAG: Embed and Query PDFs¶

LangDB aims to bridge the gap between data and AI world by running them alongside. You can create an entire LangDB experience just using SQL commands all the way from extraction to creating a chat agent.

This sample demonstrates the following:

Extract data from PDF using extract_text command.
Create a RAG function that leverages vector search easily using View
Create a LangDB model that connects to OpenAI APIs.
Chat with a LangDB model in natural language using CHAT command.

Extract semantically chunked data from pdfs and insert into tables.¶

In [1]:

--Create table for storing pdf chunks

CREATE TABLE pdf_chunks (
  id UUID DEFAULT generateUUIDv4(),
  content `String`
   
) 
ENGINE = MergeTree
ORDER BY (id, content);

In [2]:

INSERT INTO pdf_chunks(content)
select text from chunk(
    (
       SELECT content FROM 
    extract_text( path => 'https://d18rn0p25nwr6d.cloudfront.net/CIK-0000320193/faab4555-c69b-438a-aaf7-e09305f87ca3.pdf',
    type => 'pdf')
    ),
    chunk_size => 200, 
    type => 'Word'
)

In [4]:

SELECT count(*) FROM pdf_chunks;

	count()
0	323

In [5]:

SELECT * FROM pdf_chunks LIMIT 2;

	id	content	metadata
0	643f0a60-a180-400c-8036-f535e6263882	10.13* Form of CEO Restricted Stock Unit Award Agreement under 2014 Employee Stock Plan effective as\r\nof September 27, 2020.\r\n10-Q 10.1 12/26/20\r\n10.14* Form of CEO Performance Award Agreement under 2014 Employee Stock Plan effective as of\r\nSeptember 27, 2020.\r\n10-Q 10.2 12/26/20\r\n10.15* Apple Inc. 2022 Employee Stock Plan. 8-K 10.1 3/4/22\r\n10.16* Form of Restricted Stock Unit Award Agreement under 2022 Employee Stock Plan effective as of\r\nMarch 4, 2022.\r\n8-K 10.2 3/4/22\r\n10.17* Form of Performance Award Agreement under 2022 Employee Stock Plan effective as of March 4,\r\n2022.\r\n8-K 10.3 3/4/22\r\n10.18* Apple Inc. Executive Cash Incentive Plan. 8-K 10.1 8/19/22\r\n10.19* Form of CEO Restricted Stock Unit Award Agreement under 2022 Employee Stock Plan effective as\r\nof September 25, 2022.\r\n10-Q 10.1 12/31/22\r\n10.20* Form of CEO Performance Award Agreement under 2022 Employee Stock Plan effective as of\r\nSeptember 25, 2022.\r\n10-Q 10.2 12/31/22\r\n21.1** Subsidiaries of the Registrant.	{"page_no":58,"source":"https://d18rn0p25nwr6d.cloudfront.net/CIK-0000320193/faab4555-c69b-438a-aaf7-e09305f87ca3.pdf","chunk_id":2}
1	777b6f5a-4fc4-4d4b-8058-f4fc76db256a	• MLS Season Pass, a Major League Soccer subscription streaming service.\r\nSecond Quarter 2023:\r\n• MacBook Pro 14”, MacBook Pro 16” and Mac mini; and\r\n• Second-generation HomePod.\r\nThird Quarter 2023:\r\n• MacBook Air 15”, Mac Studio and Mac Pro;\r\n• Apple Vision Pro™, the Company’s first spatial computer featuring its new visionOS™, expected to be available in early calendar year 2024; and\r\n• iOS 17, macOS Sonoma, iPadOS 17, tvOS 17 and watchOS 10, updates to the Company’s operating systems.\r\nFourth Quarter 2023:\r\n• iPhone 15, iPhone 15 Plus, iPhone 15 Pro and iPhone 15 Pro Max; and\r\n• Apple Watch Series 9 and Apple Watch Ultra 2.\r\nIn May 2023, the Company announced a new share repurchase program of up to $90 billion and raised its quarterly dividend from $0.23 to $0.24 per share\r\nbeginning in May 2023. During 2023, the Company repurchased $76.6 billion of its common stock and paid dividends and dividend equivalents of $15.0 billion.\r\nMacroeconomic Conditions	{"source":"https://d18rn0p25nwr6d.cloudfront.net/CIK-0000320193/faab4555-c69b-438a-aaf7-e09305f87ca3.pdf","chunk_id":2,"page_no":22}

Embeddings¶

LangDB offers a convenient method to generate embeddings using the custom embedding type model function for development and testing purposes.

In [ ]:

CREATE EMBEDDING MODEL generate_embeddings 
USING openai(model='text-embedding-ada-002', encoding_format='float')

In [6]:

--Creating a table for storing embeddings
CREATE TABLE pdf_embeddings (
  id UUID,
  content `String`,
  embeddings `Array`(`Float32`)
) 
ENGINE = MergeTree
ORDER BY id;

Use generate_embed model to generate embeddings for each chunk and store them into pdf_embeddings table.

In [7]:

INSERT INTO pdf_embeddings
SELECT id, content, embedding FROM generate_embeddings((
    SELECT p.id, content
FROM pdf_chunks AS p 
LEFT JOIN pdf_embeddings AS pe ON p.id = pe.id
WHERE p.id != pe.id
ORDER BY p.id
), input=>content)

In [8]:

-- check embeddings
SELECT * FROM pdf_embeddings LIMIT 2;

	id	embeddings
0	643f0a60-a180-400c-8036-f535e6263882	[-0.015833275, 0.016488833, 0.021282796, -0.013057711, -0.012951651, 0.023713998, 0.01388778, -0.006796769, 0.013173764, 0.0025718298, 0.027231859, 0.03325437, -0.018630723, -0.0117868595, 0.010130148, -0.023359004, 0.034252606, -0.0143824555, 0.0033848356, -0.054530654, 0.020298224, -0.01141275, -0.03716061, 0.0045071486, -0.013853517, -0.017403545, 0.024094565, 0.063342825, 0.0046908166, -0.115077466, 0.009534818, -0.04234375, 0.018302526, 0.0390845, -0.013822629, -0.025336372, 0.054055277, 0.036858503, 0.008884004, -0.020154912, -0.047179695, -0.004695503, 0.0043017804, -0.0024871489, 0.03286219, 0.008155634, -0.021814382, -0.060532615, -0.032473963, -0.010965438, -0.032233372, -0.00224287, 0.016640471, 0.011012144, 0.031013919, 0.008669582, 0.019065231, 0.0007917067, 0.06560575, 0.03744758, 0.028153855, -0.084666125, -0.014569226, -0.0013349628, -0.01831511, -0.0012049268, -0.026819758, -0.013026325, -0.07956044, -0.015341199, 0.029745584, -0.03499858, 0.0053778877, 0.008838674, -0.03878715, 0.061437055, -0.018560946, -0.006508512, 0.010661573, -0.036596525, -0.028861076, 0.041744184, -0.016955012, 0.029670678, -0.04710539, 0.02494338, 0.003950581, 0.010581608, -0.03407814, 0.027592612, -0.00418248, 0.01929867, -0.020422382, -0.0069334204, -0.06781504, -0.034989018, -0.014999327, -0.02593518, -0.03159284, 0.7848599, ...]
1	777b6f5a-4fc4-4d4b-8058-f4fc76db256a	[0.008706014, 0.009706724, 0.032671094, -0.035436057, 0.051527184, 0.01093841, 0.024991488, 0.027022187, 0.031193942, 0.024614582, 0.028242635, -0.011225648, -0.01767256, 0.006480529, 0.005842946, -0.032462277, 0.04878182, -0.04052094, 0.010823112, -0.002030401, 0.006359209, -0.03001604, -0.023105806, 0.0273617, 0.008039218, 0.012930436, 0.031040175, 0.009457813, -0.0025834546, -0.11709306, -0.010408724, -0.019569661, 0.027400829, 0.022918995, 0.0031092823, -0.048434913, 0.018394105, -0.07002118, -0.055620562, -0.030220404, -0.022057481, -0.027423913, 0.012783855, 0.07695928, -0.009941498, 0.009330671, 0.030002559, 0.05339468, 0.014665814, 0.024058446, -0.041501235, 0.0103600025, 0.0007897765, 0.033746455, 0.025704332, 0.03836078, -0.026510676, 0.044043988, 0.08595608, 0.026488818, 0.033913232, -0.041947633, -0.052070808, 0.009028151, -0.061676215, -0.024596896, -0.038221534, 0.012701611, -0.03624668, 0.017487017, 0.043547478, 0.024905164, 0.05209428, -0.002638861, -0.015676487, 0.014199148, 0.017773598, -0.05650256, 0.004725834, -0.001897428, -0.072576486, 0.008897648, -0.00785671, 0.016857786, 0.016486645, 0.0019510472, -0.0044676396, -0.004772592, -0.01736326, 0.010523404, -0.01827029, 0.023244375, -0.012066184, -0.0026557974, -0.095182076, 0.016055545, 0.00091857, -0.046272755, -0.016205208, 0.7707698, ...]

Vector Search¶

View let you create access endpoints that can be used both for API consumption as well as RAG inputs for LLMs.

Here we are creating a view named similar() to perform a vector search against similar chunks to feed into our LLM model.

In [9]:

CREATE VIEW similar(query String "Query to search similar sections in pdf documents") AS
WITH query AS (
  SELECT embedding::Array(Float32) AS query FROM generate_embeddings($query) 
)
SELECT 
  p.id, 
  p.content, 
  cosineDistance(p.embeddings, query) AS similarity 
FROM 
  pdf_embeddings p
CROSS JOIN
  query
ORDER BY
  similarity ASC
LIMIT 5

Sample vector search using similar() function:

In [10]:

--returns similar sections in pdf documents matching the input string
SELECT * FROM similar('Apple liabilities') LIMIT 2

	id	content	similarity
0	c0da64ce-bd40-44c5-8e22-d7f37d176701	past been adversely affected and could in the future be materially adversely affected by foreign exchange rate fluctuations.\r\nThe weakening of foreign currencies relative to the U.S. dollar adversely affects the U.S. dollar value of the Company’s foreign currency–denominated sales and\r\nearnings, and generally leads the Company to raise international pricing, potentially reducing demand for the Company’s products. In some circumstances, for\r\ncompetitive or other reasons, the Company may decide not to raise international pricing to offset the U.S. dollar’s strengthening, which would adversely affect\r\nthe U.S. dollar value of the gross margins the Company earns on foreign currency–denominated sales.\r\nConversely, a strengthening of foreign currencies relative to the U.S. dollar, while generally beneficial to the Company’s foreign currency–denominated sales and	0.494350
1	b064feec-1466-4764-9839-eadd2c144d6a	Mac net sales decreased 27% or $10.8 billion during 2023 compared to 2022 due primarily to lower net sales of laptops.\r\niPad\r\niPad net sales decreased 3% or $1.0 billion during 2023 compared to 2022 due primarily to lower net sales of iPad mini and iPad Air, partially offset by the\r\ncombined net sales of iPad 9th and 10th generation.\r\nWearables, Home and Accessories\r\nWearables, Home and Accessories net sales decreased 3% or $1.4 billion during 2023 compared to 2022 due primarily to lower net sales of Wearables and\r\nAccessories.\r\nServices\r\nServices net sales increased 9% or $7.1 billion during 2023 compared to 2022 due to higher net sales across all lines of business.\r\n(1)\r\n(1)\r\n(1)\r\n(1)\r\n(2)\r\nApple Inc. \| 2023 Form 10-K \| 22	0.467039

In [11]:

SELECT * FROM similar('PLEASE NOTE: THERE IS NO PROOF OF CLAIM FORM FOR') LIMIT 2

	id	content	similarity
0	ce2d0e19-4e36-4922-850d-fef4b1b528f5	Total comprehensive income $ 96,652 $ 88,531 $ 95,249\r\nSee accompanying Notes to Consolidated Financial Statements.\r\nApple Inc. \| 2023 Form 10-K \| 29	0.323084
1	6821b0ee-9be8-4942-90e9-8253024c69a8	timely decisions regarding required disclosure.\r\nInherent Limitations over Internal Controls\r\nThe Company’s internal control over financial reporting is designed to provide reasonable assurance regarding the reliability of financial reporting and the\r\npreparation of financial statements for external purposes in accordance with GAAP. The Company’s internal control over financial reporting includes those\r\npolicies and procedures that:\r\n(i) pertain to the maintenance of records that, in reasonable detail, accurately and fairly reflect the transactions and dispositions of the Company’s\r\nassets;\r\n(ii) provide reasonable assurance that transactions are recorded as necessary to permit preparation of financial statements in accordance with\r\nGAAP, and that the Company’s receipts and expenditures are being made only in accordance with authorizations of the Company’s management\r\nand directors; and	0.315159

Model and Prompt Creation¶

You can dynamically create models and prompts on the fly using CREATE commands that can leverage tools that you have created.

Here we are attaching similar function as a tool.
You could also nest queries simply using SQL to hard wire the function chaining.
Prompts can be passed inline or can be created seperately.

Model Creation¶

In [12]:

CREATE MODEL IF NOT EXISTS search(
  input
) USING openai(model_name= 'gpt-4o-mini')
PROMPT (system "You are a helpful assistant. You have been provided with similar tool to search over the SEC fillings. 
                 Your task is also to help users with any queries they might have about the SEC filings. 
                 Go through all the content and then respond.",
  human "{{input}}")
TOOLS (
    similar COMMENT 'View to search embeddings of the PDF'
)
SETTINGS retries = 1;

Model execution¶

Using the created model, we can perform search() query that leverages similar() tool and responds.

This can immediately be plugged into the frontends to also provide a chat interface.

In [ ]:

SELECT * FROM search('Earnings of Apple in 2022')