HTML Extraction Functions ========================= These functions extract data from HTML documents. html_extract_text ----------------- Extract text content from HTML. **Syntax:** .. code-block:: sql html_extract_text(html) html_extract_text(html, xpath) **Parameters:** - ``html`` (VARCHAR or HTML): The HTML content - ``xpath`` (VARCHAR, optional): XPath expression to match specific elements **Returns:** VARCHAR - The extracted text content. **Examples:** .. code-block:: sql -- Extract all text SELECT html_extract_text('
Hello World
'); -- Result: "Hello World" -- Extract specific element SELECT html_extract_text('Body
', '//h1'); -- Result: "Title" .. note:: When using XPath, only the first matching element's text is returned. html_extract_links ------------------ Extract all hyperlinks from HTML with metadata. **Syntax:** .. code-block:: sql html_extract_links(html) **Returns:** LIST
'
);
-- Result: [{alt: "A photo", src: "photo.jpg", title: NULL, width: 800, height: 600, line_number: 1}]
html_extract_tables
-------------------
Extract HTML tables as rows (table function).
**Syntax:**
.. code-block:: sql
SELECT * FROM html_extract_tables(html)
**Returns:** TABLE(table_index INTEGER, row_index INTEGER, columns VARCHAR[])
**Example:**
.. code-block:: sql
SELECT * FROM html_extract_tables(
'| Name | Age |
|---|---|
| John | 30 |
Hello & World
'); -- Result: "<p>Hello & World</p>" html_unescape ------------- Decode HTML entities to text. **Syntax:** .. code-block:: sql html_unescape(text) **Parameters:** - ``text`` (VARCHAR): Text with HTML entities **Returns:** VARCHAR - Decoded text. **Example:** .. code-block:: sql SELECT html_unescape('<p>Hello & World</p>'); -- Result: "Hello & World
" parse_html ---------- Parse an HTML string into the HTML type. **Syntax:** .. code-block:: sql parse_html(content) **Parameters:** - ``content`` (VARCHAR): HTML string to parse **Returns:** HTML - Parsed HTML document. **Example:** .. code-block:: sql SELECT parse_html('Hello
');