XPath Guide
===========
The webbed extension supports XPath 1.0 expressions for extracting data from XML and HTML documents.
Basic Syntax
------------
XPath uses path expressions to navigate XML documents:
.. code-block:: sql
-- Select all
elements anywhere in document
SELECT xml_extract_text(xml, '//title');
-- Select elements that are children of
SELECT xml_extract_text(xml, '/catalog/book/title');
-- Select the first - element
SELECT xml_extract_text(xml, '//item[1]');
Path Expressions
----------------
.. list-table::
:header-rows: 1
:widths: 30 70
* - Expression
- Description
* - ``/``
- Root element
* - ``//``
- Descendants at any depth
* - ``.``
- Current node
* - ``..``
- Parent node
* - ``@``
- Attribute selector
* - ``*``
- Any element
Examples
~~~~~~~~
.. code-block:: sql
-- Absolute path from root
SELECT xml_extract_text(xml, '/catalog/book/title');
-- Any anywhere
SELECT xml_extract_text(xml, '//title');
-- All child elements of current node
SELECT xml_extract_text(xml, '//*');
Attribute Selection
-------------------
Use ``@`` to select attributes:
.. code-block:: sql
-- Get the 'id' attribute
SELECT xml_extract_text(xml, '//item/@id');
-- Get all attributes of
- elements
SELECT xml_extract_attributes(xml, '//item');
Predicates
----------
Use ``[...]`` to filter results:
Position Predicates
~~~~~~~~~~~~~~~~~~~
.. code-block:: sql
-- First element
SELECT xml_extract_text(xml, '//item[1]');
-- Last element
SELECT xml_extract_text(xml, '//item[last()]');
-- First three elements
SELECT xml_extract_text(xml, '//item[position() <= 3]');
Attribute Predicates
~~~~~~~~~~~~~~~~~~~~
.. code-block:: sql
-- Elements with specific attribute value
SELECT xml_extract_text(xml, '//book[@category="fiction"]/title');
-- Elements that have an attribute
SELECT xml_extract_text(xml, '//item[@id]');
-- Numeric comparison
SELECT xml_extract_text(xml, '//product[@price > 100]/name');
Text Predicates
~~~~~~~~~~~~~~~
.. code-block:: sql
-- Elements containing specific text
SELECT xml_extract_text(xml, '//item[contains(text(), "sale")]');
-- Elements starting with text
SELECT xml_extract_text(xml, '//item[starts-with(text(), "New")]');
Namespace Handling
------------------
For namespaced documents, simple paths may not work:
.. code-block:: sql
-- This might return empty for namespaced XML:
SELECT xml_extract_text(xml, '//element');
Use ``local-name()`` to match regardless of namespace:
.. code-block:: sql
-- Match any element with local name 'element'
SELECT xml_extract_text(xml, '//*[local-name()="element"]');
-- With predicates
SELECT xml_extract_text(xml, '//*[local-name()="item" and @id="123"]');
-- Nested elements
SELECT xml_extract_text(xml, '//*[local-name()="person"]/*[local-name()="name"]');
.. note::
The ``read_xml()`` function automatically strips namespaces during schema inference, so column names won't include namespace prefixes. However, XPath queries against raw XML documents require namespace handling.
Common Functions
----------------
XPath 1.0 supports these functions:
String Functions
~~~~~~~~~~~~~~~~
.. list-table::
:header-rows: 1
:widths: 30 70
* - Function
- Description
* - ``text()``
- Select text content
* - ``contains(str, substr)``
- Check if string contains substring
* - ``starts-with(str, prefix)``
- Check if string starts with prefix
* - ``string-length(str)``
- Get string length
* - ``concat(s1, s2, ...)``
- Concatenate strings
* - ``normalize-space(str)``
- Normalize whitespace
Numeric Functions
~~~~~~~~~~~~~~~~~
.. list-table::
:header-rows: 1
:widths: 30 70
* - Function
- Description
* - ``sum(nodeset)``
- Sum of numeric values
* - ``count(nodeset)``
- Count nodes
* - ``floor(num)``
- Round down
* - ``ceiling(num)``
- Round up
Boolean Functions
~~~~~~~~~~~~~~~~~
.. list-table::
:header-rows: 1
:widths: 30 70
* - Function
- Description
* - ``not(expr)``
- Boolean negation
* - ``true()``
- Returns true
* - ``false()``
- Returns false
Node Functions
~~~~~~~~~~~~~~
.. list-table::
:header-rows: 1
:widths: 30 70
* - Function
- Description
* - ``local-name()``
- Element name without namespace
* - ``name()``
- Element name with namespace prefix
* - ``namespace-uri()``
- Namespace URI
* - ``last()``
- Last position in context
* - ``position()``
- Current position
Examples
--------
Complex Queries
~~~~~~~~~~~~~~~
.. code-block:: sql
-- Products over $100 in electronics category
SELECT xml_extract_text(xml,
'//product[@category="electronics" and @price > 100]/name'
);
-- Second author of first book
SELECT xml_extract_text(xml, '//book[1]/author[2]');
-- All items with 'sale' in description
SELECT xml_extract_text(xml,
'//item[contains(description, "sale")]/name'
);
-- Count of items
SELECT xml_extract_text(xml, 'count(//item)');
HTML-Specific Patterns
~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: sql
-- All paragraph text
SELECT html_extract_text(html, '//p');
-- Links in navigation
SELECT html_extract_text(html, '//nav//a/@href');
-- Images with alt text
SELECT html_extract_text(html, '//img[@alt]/@src');
-- Table cells in first row
SELECT html_extract_text(html, '//table//tr[1]//td');