Quick Start

This guide will get you started with the webbed extension in just a few minutes.

Loading the Extension

LOAD webbed;

Reading XML Files

The simplest way to work with XML is using read_xml:

-- Read an XML file directly into a table
SELECT * FROM read_xml('data.xml');

-- Read multiple files with a glob pattern
SELECT * FROM read_xml('config/*.xml');

-- Read with schema inference options
SELECT * FROM read_xml('data.xml', record_element := 'item');

Reading HTML Files

Similarly for HTML:

-- Read HTML files
SELECT * FROM read_html('page.html');

-- Extract specific elements
SELECT * FROM read_html('page.html', record_element := 'article');

Extracting Data with XPath

Use XPath expressions to extract specific content:

-- Extract text from XML
SELECT xml_extract_text('<book><title>DuckDB Guide</title></book>', '//title');
-- Result: "DuckDB Guide"

-- Extract from HTML
SELECT html_extract_text('<html><body><h1>Welcome</h1></body></html>', '//h1');
-- Result: "Welcome"

-- Extract attributes
SELECT xml_extract_attributes('<item id="123" type="book"/>', '/item');
-- Result: [{id: "123", type: "book"}]

Working with Document Objects

For more control, use the _objects variants:

-- Get raw document objects
SELECT xml, filename
FROM read_xml_objects('data/*.xml', filename=true);

-- Process each document
SELECT
    filename,
    xml_extract_text(xml, '//title') as title,
    xml_stats(xml::VARCHAR) as stats
FROM read_xml_objects('books/*.xml', filename=true);

Converting Between Formats

Convert XML to JSON and vice versa:

-- XML to JSON
SELECT xml_to_json('<person><name>John</name><age>30</age></person>');
-- Result: {"person":{"name":{"#text":"John"},"age":{"#text":"30"}}}

-- JSON to XML
SELECT json_to_xml('{"name":"John","age":30}');
-- Result: <root><name>John</name><age>30</age></root>

Parsing XML/HTML Strings

Parse XML or HTML content directly from strings:

-- Parse an XML string with schema inference
SELECT * FROM parse_xml('<data><item><name>Widget</name><price>9.99</price></item></data>');

-- Parse HTML content
SELECT * FROM parse_html('<div><p>Hello</p><p>World</p></div>', record_element := 'p');

Controlling Date/Time Parsing

Use datetime_format to control how dates and timestamps are detected:

-- Parse European dates (DD/MM/YYYY)
SELECT * FROM read_xml('data.xml', datetime_format := 'eu');

-- Parse US dates (MM/DD/YYYY)
SELECT * FROM read_xml('data.xml', datetime_format := 'us');

-- Use a custom format string
SELECT * FROM read_xml('data.xml', datetime_format := '%Y/%m/%d');

-- Disable date detection entirely
SELECT * FROM read_xml('data.xml', datetime_format := 'none');

Handling NULL Values

Use nullstr to specify values that should be treated as NULL:

-- Treat "N/A" and "-" as NULL
SELECT * FROM read_xml('data.xml', nullstr := ['N/A', '-']);

Processing Large Files

Files exceeding maximum_file_size (16MB by default) are automatically streamed using a SAX-based parser that processes XML in chunks — peak memory stays proportional to a single record rather than the entire file:

-- Large files are streamed automatically (default behavior)
SELECT count(*) FROM read_xml('huge_file.xml');

-- Force DOM mode (errors if file is too large)
SELECT * FROM read_xml('file.xml', streaming := false);

-- Adjust the file size limit for DOM parsing
SELECT * FROM read_xml('file.xml', maximum_file_size := 268435456);  -- 256MB

Extracting HTML Tables

-- Extract tables as rows
SELECT table_index, row_index, columns
FROM html_extract_tables('<table><tr><th>Name</th></tr><tr><td>John</td></tr></table>');

Next Steps