XPath Guide

The webbed extension supports XPath 1.0 expressions for extracting data from XML and HTML documents.

Basic Syntax

XPath uses path expressions to navigate XML documents:

-- Select all <title> elements anywhere in document
SELECT xml_extract_text(xml, '//title');

-- Select <title> elements that are children of <book>
SELECT xml_extract_text(xml, '/catalog/book/title');

-- Select the first <item> element
SELECT xml_extract_text(xml, '//item[1]');

Path Expressions

Expression

Description

/

Root element

//

Descendants at any depth

.

Current node

..

Parent node

@

Attribute selector

*

Any element

Examples

-- Absolute path from root
SELECT xml_extract_text(xml, '/catalog/book/title');

-- Any <title> anywhere
SELECT xml_extract_text(xml, '//title');

-- All child elements of current node
SELECT xml_extract_text(xml, '//*');

Attribute Selection

Use @ to select attributes:

-- Get the 'id' attribute
SELECT xml_extract_text(xml, '//item/@id');

-- Get all attributes of <item> elements
SELECT xml_extract_attributes(xml, '//item');

Predicates

Use [...] to filter results:

Position Predicates

-- First element
SELECT xml_extract_text(xml, '//item[1]');

-- Last element
SELECT xml_extract_text(xml, '//item[last()]');

-- First three elements
SELECT xml_extract_text(xml, '//item[position() <= 3]');

Attribute Predicates

-- Elements with specific attribute value
SELECT xml_extract_text(xml, '//book[@category="fiction"]/title');

-- Elements that have an attribute
SELECT xml_extract_text(xml, '//item[@id]');

-- Numeric comparison
SELECT xml_extract_text(xml, '//product[@price > 100]/name');

Text Predicates

-- Elements containing specific text
SELECT xml_extract_text(xml, '//item[contains(text(), "sale")]');

-- Elements starting with text
SELECT xml_extract_text(xml, '//item[starts-with(text(), "New")]');

Namespace Handling

For namespaced documents, simple paths may not work:

-- This might return empty for namespaced XML:
SELECT xml_extract_text(xml, '//element');

Use local-name() to match regardless of namespace:

-- Match any element with local name 'element'
SELECT xml_extract_text(xml, '//*[local-name()="element"]');

-- With predicates
SELECT xml_extract_text(xml, '//*[local-name()="item" and @id="123"]');

-- Nested elements
SELECT xml_extract_text(xml, '//*[local-name()="person"]/*[local-name()="name"]');

Note

The read_xml() function automatically strips namespaces during schema inference, so column names won’t include namespace prefixes. However, XPath queries against raw XML documents require namespace handling.

Common Functions

XPath 1.0 supports these functions:

String Functions

Function

Description

text()

Select text content

contains(str, substr)

Check if string contains substring

starts-with(str, prefix)

Check if string starts with prefix

string-length(str)

Get string length

concat(s1, s2, ...)

Concatenate strings

normalize-space(str)

Normalize whitespace

Numeric Functions

Function

Description

sum(nodeset)

Sum of numeric values

count(nodeset)

Count nodes

floor(num)

Round down

ceiling(num)

Round up

Boolean Functions

Function

Description

not(expr)

Boolean negation

true()

Returns true

false()

Returns false

Node Functions

Function

Description

local-name()

Element name without namespace

name()

Element name with namespace prefix

namespace-uri()

Namespace URI

last()

Last position in context

position()

Current position

Examples

Complex Queries

-- Products over $100 in electronics category
SELECT xml_extract_text(xml,
    '//product[@category="electronics" and @price > 100]/name'
);

-- Second author of first book
SELECT xml_extract_text(xml, '//book[1]/author[2]');

-- All items with 'sale' in description
SELECT xml_extract_text(xml,
    '//item[contains(description, "sale")]/name'
);

-- Count of items
SELECT xml_extract_text(xml, 'count(//item)');

HTML-Specific Patterns

-- All paragraph text
SELECT html_extract_text(html, '//p');

-- Links in navigation
SELECT html_extract_text(html, '//nav//a/@href');

-- Images with alt text
SELECT html_extract_text(html, '//img[@alt]/@src');

-- Table cells in first row
SELECT html_extract_text(html, '//table//tr[1]//td');