HTML Extraction Functions

These functions extract data from HTML documents.

html_extract_text

Extract text content from HTML.

Syntax:

html_extract_text(html)
html_extract_text(html, xpath)

Parameters:

  • html (VARCHAR or HTML): The HTML content

  • xpath (VARCHAR, optional): XPath expression to match specific elements

Returns: VARCHAR - The extracted text content.

Examples:

-- Extract all text
SELECT html_extract_text('<html><body><p>Hello World</p></body></html>');
-- Result: "Hello World"

-- Extract specific element
SELECT html_extract_text('<html><body><h1>Title</h1><p>Body</p></body></html>', '//h1');
-- Result: "Title"

Note

When using XPath, only the first matching element’s text is returned.

html_extract_images

Extract all images from HTML with metadata.

Syntax:

html_extract_images(html)

Returns: LIST<STRUCT(alt VARCHAR, src VARCHAR, title VARCHAR, width INTEGER, height INTEGER, line_number INTEGER)>

Example:

SELECT html_extract_images(
    '<img src="photo.jpg" alt="A photo" width="800" height="600">'
);
-- Result: [{alt: "A photo", src: "photo.jpg", title: NULL, width: 800, height: 600, line_number: 1}]

html_extract_tables

Extract HTML tables as rows (table function).

Syntax:

SELECT * FROM html_extract_tables(html)

Returns: TABLE(table_index INTEGER, row_index INTEGER, columns VARCHAR[])

Example:

SELECT * FROM html_extract_tables(
    '<table><tr><th>Name</th><th>Age</th></tr><tr><td>John</td><td>30</td></tr></table>'
);
-- Result:
-- table_index | row_index | columns
-- 0           | 0         | ["Name", "Age"]
-- 0           | 1         | ["John", "30"]

html_extract_table_rows

Extract table data as structured rows.

Syntax:

html_extract_table_rows(html)

Returns: LIST<STRUCT> - Structured table data.

html_extract_tables_json

Extract tables with rich JSON structure including headers.

Syntax:

html_extract_tables_json(html)

Returns: LIST<STRUCT(headers VARCHAR[], rows VARCHAR[][], row_count INTEGER)>

html_escape

Escape HTML special characters.

Syntax:

html_escape(text)

Parameters:

  • text (VARCHAR): Text to escape

Returns: VARCHAR - Text with HTML entities escaped.

Example:

SELECT html_escape('<p>Hello & World</p>');
-- Result: "&lt;p&gt;Hello &amp; World&lt;/p&gt;"

html_unescape

Decode HTML entities to text.

Syntax:

html_unescape(text)

Parameters:

  • text (VARCHAR): Text with HTML entities

Returns: VARCHAR - Decoded text.

Example:

SELECT html_unescape('&lt;p&gt;Hello &amp; World&lt;/p&gt;');
-- Result: "<p>Hello & World</p>"

parse_html

Parse an HTML string into the HTML type.

Syntax:

parse_html(content)

Parameters:

  • content (VARCHAR): HTML string to parse

Returns: HTML - Parsed HTML document.

Example:

SELECT parse_html('<html><body><p>Hello</p></body></html>');