We have already covered extracting semantic text chunks from PDF documents but often times we ll have to work with images and preseve the layout to make the best out of LLMs. In this article we ll go through the following
For this example lets use wikipidea's The world's billionaires' data set. We want to extract the table in the 2024 section. Let's get started.
CREATE TABLE pdf_blocks_billionaires
ENGINE = Memory AS
SELECT * FROM extract_layout(
(select * from load (
'https://en.wikipedia.org/api/rest_v1/page/pdf/The_World%27s_Billionaires'
)), type=> 'PDF', page_range=> [2,3])
This extracts structured information from the pdf with layout information.
select page, block_type, count(block_type) from pdf_blocks_billionaires
where block_type like 'LAYOUT%'
group by page, block_type;
Lets now extract a structured table from this block and transpose.
select * from show_parsed_schemas((select * from pdf_blocks_billionaires)) where schema<> ''
select * from transpose_parsed_tables(table_index => 0, (select * from pdf_blocks_billionaires))
Check out the original table from the wikipedia page. As mentioned in other articles you can mix and match structured and
unstructured data and use pretty_print
function to feed data to the LLMs.
You can also directly convert this as a markdown using the schema provided to feed the entire document with layout to LLMs
select * from print_parsed_markdown((select * from pdf_blocks_billionaires))