Structured Extracton¶

We have already covered extracting semantic text chunks from PDF documents but often times we ll have to work with images and preseve the layout to make the best out of LLMs. In this article we ll go through the following

Loading PDF tables
Querying table schemas
Transposing into structured tables.

For this example lets use wikipidea's The world's billionaires' data set. We want to extract the table in the 2024 section. Let's get started.

In [34]:

CREATE TABLE pdf_blocks_billionaires
ENGINE = Memory AS
SELECT page, block_idx, block_id, block_type, row_id, col_id, text, confidence, entity_types, relationships  FROM extract_layout(
    'https://en.wikipedia.org/api/rest_v1/page/pdf/The_World%27s_Billionaires'
 , type=> 'PDF', page_range=> [2,3])

This extracts structured information from the pdf with layout information.

In [35]:

select page, block_type, count(block_type) from pdf_blocks_billionaires 
    where block_type like 'LAYOUT%' 
    group by page, block_type;

	block_type	count(block_type)
0	LAYOUT_TEXT	1
1	LAYOUT_TABLE	1
2	LAYOUT_SECTION_HEADER	1

Lets now extract a structured table from this block and transpose.

In [37]:

select * from show_parsed_schemas((select * from pdf_blocks_billionaires)) where schema<> ''

	table_index	schema
0	0	`Primary source(s) of wealth` String,`Net worth (USD)` String,`No.` String,`Nationality` String,`Name` String,`Age` String

In [39]:

select * from transpose_parsed_tables(table_index => 0, (select * from pdf_blocks_billionaires))

	No.	Name	Age	Net worth (USD)	Nationality	Primary source(s) of wealth
0
1	75	1	LVMH	Bernard Amault & family	France	$233 billion
2	52	2 -	Tesla, SpaceX, Twitter (Currently	Elon Musk	South Africa Canada United States	$195 billion
3	60	3	Amazon	Jeff Bezos	United States	$194 billion
4	39	4 A	Meta Platforms	Mark Zuckerberg	United States	$177 billion
5	79	5	Oracle Corporation	Larry Ellison	United States	$141 billion
6	93	6	Berkshire Hathaway	Warren Buffett	United States	$133 billion
7	68	7	Microsoft	Bill Gates	United States	$128 billion
8	68	8 A	Microsoft	Steve Ballmer	United States	$121 billion
9	65	9	Reliance Industries	Mukesh Ambani	India	$116 billion
10	51	10	Google	Larry Page	United States	$114 billion

Check out the original table from the wikipedia page. As mentioned in other articles you can mix and match structured and unstructured data and use pretty_print function to feed data to the LLMs.

You can also directly convert this as a markdown using the schema provided to feed the entire document with layout to LLMs

In [41]:

select * from print_parsed_markdown((select * from pdf_blocks_billionaires))

	content
0	\| Net worth (USD) \| Primary source(s) of wealth \| No. \| Name \| Age \| Nationality \|\n\|-----------------\|-----------------------------\|--------------\|------\|-----------------------------------\|-----------------------------------\|\n\| null \| null \| null \| null \| null \| null \|\n\| 1 \| Bernard Amault & family \| $233 billion \| 75 \| France \| LVMH \|\n\| 2 - \| Elon Musk \| $195 billion \| 52 \| South Africa Canada United States \| Tesla, SpaceX, Twitter (Currently \|\n\| 3 \| Jeff Bezos \| $194 billion \| 60 \| United States \| Amazon \|\n\| 4 A \| Mark Zuckerberg \| $177 billion \| 39 \| United States \| Meta Platforms \|\n\| 5 \| Larry Ellison \| $141 billion \| 79 \| United States \| Oracle Corporation \|\n\| 6 \| Warren Buffett \| $133 billion \| 93 \| United States \| Berkshire Hathaway \|\n\| 7 \| Bill Gates \| $128 billion \| 68 \| United States \| Microsoft \|\n\| 8 A \| Steve Ballmer \| $121 billion \| 68 \| United States \| Microsoft \|\n\| 9 \| Mukesh Ambani \| $116 billion \| 65 \| India \| Reliance Industries \|\n\| 10 \| Larry Page \| $114 billion \| 51 \| United States \| Google \|\n ## 2023\n In the 37th annual Forbes list of the world's billionaires, the list included 2,640 billionaires with a total net\n wealth of $12.2 trillion, down 28 members and $500 billion from 2022. Over half of the list is less wealthy\n compared to the previous year, including Elon Musk, who fell from No. 1 to No. 2. [7] The list also marks\n for the first time a French citizen was in the top position as well as a non-American for the first time since\n 2013 when the Mexican Carlos Slim Helu was the world's richest person. The list, like in 2022. counted 15\n under 30 billionaires with the richest of them being Red Bull heir Mark Mateschitz with a net worth of\n $34.7 billion. The youngest of the lot were Clemente Del Vecchio, heir to the Luxottica fortune shared with\n his six siblings and stepmother, and Kim Jung-yang, whose fortune lies in Japanese-South Korean gaming\n giant Nexon, both under-20s.(11)\n

content

0

| Net worth (USD) | Primary source(s) of wealth | No. | Name | Age | Nationality |\n|-----------------|-----------------------------|--------------|------|-----------------------------------|-----------------------------------|\n| null | null | null | null | null | null |\n| 1 | Bernard Amault & family | $233 billion | 75 | France | LVMH |\n| 2 - | Elon Musk | $195 billion | 52 | South Africa Canada United States | Tesla, SpaceX, Twitter (Currently |\n| 3 | Jeff Bezos | $194 billion | 60 | United States | Amazon |\n| 4 A | Mark Zuckerberg | $177 billion | 39 | United States | Meta Platforms |\n| 5 | Larry Ellison | $141 billion | 79 | United States | Oracle Corporation |\n| 6 | Warren Buffett | $133 billion | 93 | United States | Berkshire Hathaway |\n| 7 | Bill Gates | $128 billion | 68 | United States | Microsoft |\n| 8 A | Steve Ballmer | $121 billion | 68 | United States | Microsoft |\n| 9 | Mukesh Ambani | $116 billion | 65 | India | Reliance Industries |\n| 10 | Larry Page | $114 billion | 51 | United States | Google |\n ## 2023\n In the 37th annual Forbes list of the world's billionaires, the list included 2,640 billionaires with a total net\n wealth of $12.2 trillion, down 28 members and $500 billion from 2022. Over half of the list is less wealthy\n compared to the previous year, including Elon Musk, who fell from No. 1 to No. 2. [7] The list also marks\n for the first time a French citizen was in the top position as well as a non-American for the first time since\n 2013 when the Mexican Carlos Slim Helu was the world's richest person. The list, like in 2022. counted 15\n under 30 billionaires with the richest of them being Red Bull heir Mark Mateschitz with a net worth of\n $34.7 billion. The youngest of the lot were Clemente Del Vecchio, heir to the Luxottica fortune shared with\n his six siblings and stepmother, and Kim Jung-yang, whose fortune lies in Japanese-South Korean gaming\n giant Nexon, both under-20s.(11)\n