HTML Extraction Functions ========================= These functions extract data from HTML documents. html_extract_text ----------------- Extract text content from HTML. **Syntax:** .. code-block:: sql html_extract_text(html) html_extract_text(html, xpath) **Parameters:** - ``html`` (VARCHAR or HTML): The HTML content - ``xpath`` (VARCHAR, optional): XPath expression to match specific elements **Returns:** VARCHAR - The extracted text content. **Examples:** .. code-block:: sql -- Extract all text SELECT html_extract_text('

Hello World

'); -- Result: "Hello World" -- Extract specific element SELECT html_extract_text('

Title

Body

', '//h1'); -- Result: "Title" .. note:: When using XPath, only the first matching element's text is returned. html_extract_links ------------------ Extract all hyperlinks from HTML with metadata. **Syntax:** .. code-block:: sql html_extract_links(html) **Returns:** LIST **Example:** .. code-block:: sql SELECT html_extract_links( 'HomeAbout' ); -- Result: [ -- {text: "Home", href: "/home", title: "Home Page", line_number: 1}, -- {text: "About", href: "/about", title: NULL, line_number: 1} -- ] -- Unnest to get individual links SELECT (unnest(html_extract_links(html))).href as url FROM read_html_objects('page.html'); html_extract_images ------------------- Extract all images from HTML with metadata. **Syntax:** .. code-block:: sql html_extract_images(html) **Returns:** LIST **Example:** .. code-block:: sql SELECT html_extract_images( 'A photo' ); -- Result: [{alt: "A photo", src: "photo.jpg", title: NULL, width: 800, height: 600, line_number: 1}] html_extract_tables ------------------- Extract HTML tables as rows (table function). **Syntax:** .. code-block:: sql SELECT * FROM html_extract_tables(html) **Returns:** TABLE(table_index INTEGER, row_index INTEGER, columns VARCHAR[]) **Example:** .. code-block:: sql SELECT * FROM html_extract_tables( '
NameAge
John30
' ); -- Result: -- table_index | row_index | columns -- 0 | 0 | ["Name", "Age"] -- 0 | 1 | ["John", "30"] html_extract_table_rows ----------------------- Extract table data as structured rows. **Syntax:** .. code-block:: sql html_extract_table_rows(html) **Returns:** LIST - Structured table data. html_extract_tables_json ------------------------ Extract tables with rich JSON structure including headers. **Syntax:** .. code-block:: sql html_extract_tables_json(html) **Returns:** LIST html_escape ----------- Escape HTML special characters. **Syntax:** .. code-block:: sql html_escape(text) **Parameters:** - ``text`` (VARCHAR): Text to escape **Returns:** VARCHAR - Text with HTML entities escaped. **Example:** .. code-block:: sql SELECT html_escape('

Hello & World

'); -- Result: "<p>Hello & World</p>" html_unescape ------------- Decode HTML entities to text. **Syntax:** .. code-block:: sql html_unescape(text) **Parameters:** - ``text`` (VARCHAR): Text with HTML entities **Returns:** VARCHAR - Decoded text. **Example:** .. code-block:: sql SELECT html_unescape('<p>Hello & World</p>'); -- Result: "

Hello & World

" parse_html ---------- Parse an HTML string into the HTML type. **Syntax:** .. code-block:: sql parse_html(content) **Parameters:** - ``content`` (VARCHAR): HTML string to parse **Returns:** HTML - Parsed HTML document. **Example:** .. code-block:: sql SELECT parse_html('

Hello

');