Quick Start
This guide will get you started with the webbed extension in just a few minutes.
Loading the Extension
LOAD webbed;
Reading XML Files
The simplest way to work with XML is using read_xml:
-- Read an XML file directly into a table
SELECT * FROM read_xml('data.xml');
-- Read multiple files with a glob pattern
SELECT * FROM read_xml('config/*.xml');
-- Read with schema inference options
SELECT * FROM read_xml('data.xml', record_element := 'item');
Reading HTML Files
Similarly for HTML:
-- Read HTML files
SELECT * FROM read_html('page.html');
-- Extract specific elements
SELECT * FROM read_html('page.html', record_element := 'article');
Extracting Data with XPath
Use XPath expressions to extract specific content:
-- Extract text from XML
SELECT xml_extract_text('<book><title>DuckDB Guide</title></book>', '//title');
-- Result: "DuckDB Guide"
-- Extract from HTML
SELECT html_extract_text('<html><body><h1>Welcome</h1></body></html>', '//h1');
-- Result: "Welcome"
-- Extract attributes
SELECT xml_extract_attributes('<item id="123" type="book"/>', '/item');
-- Result: [{id: "123", type: "book"}]
Working with Document Objects
For more control, use the _objects variants:
-- Get raw document objects
SELECT xml, filename
FROM read_xml_objects('data/*.xml', filename=true);
-- Process each document
SELECT
filename,
xml_extract_text(xml, '//title') as title,
xml_stats(xml::VARCHAR) as stats
FROM read_xml_objects('books/*.xml', filename=true);
Converting Between Formats
Convert XML to JSON and vice versa:
-- XML to JSON
SELECT xml_to_json('<person><name>John</name><age>30</age></person>');
-- Result: {"person":{"name":{"#text":"John"},"age":{"#text":"30"}}}
-- JSON to XML
SELECT json_to_xml('{"name":"John","age":30}');
-- Result: <root><name>John</name><age>30</age></root>
Extracting Links and Images from HTML
-- Extract all links
SELECT (unnest(html_extract_links(html))).href as url,
(unnest(html_extract_links(html))).text as link_text
FROM read_html_objects('page.html');
-- Extract all images
SELECT (unnest(html_extract_images(html))).src as image_url,
(unnest(html_extract_images(html))).alt as alt_text
FROM read_html_objects('page.html');
Parsing XML/HTML Strings
Parse XML or HTML content directly from strings:
-- Parse an XML string with schema inference
SELECT * FROM parse_xml('<data><item><name>Widget</name><price>9.99</price></item></data>');
-- Parse HTML content
SELECT * FROM parse_html('<div><p>Hello</p><p>World</p></div>', record_element := 'p');
Controlling Date/Time Parsing
Use datetime_format to control how dates and timestamps are detected:
-- Parse European dates (DD/MM/YYYY)
SELECT * FROM read_xml('data.xml', datetime_format := 'eu');
-- Parse US dates (MM/DD/YYYY)
SELECT * FROM read_xml('data.xml', datetime_format := 'us');
-- Use a custom format string
SELECT * FROM read_xml('data.xml', datetime_format := '%Y/%m/%d');
-- Disable date detection entirely
SELECT * FROM read_xml('data.xml', datetime_format := 'none');
Handling NULL Values
Use nullstr to specify values that should be treated as NULL:
-- Treat "N/A" and "-" as NULL
SELECT * FROM read_xml('data.xml', nullstr := ['N/A', '-']);
Processing Large Files
Files exceeding maximum_file_size (16MB by default) are automatically streamed
using a SAX-based parser that processes XML in chunks — peak memory stays proportional
to a single record rather than the entire file:
-- Large files are streamed automatically (default behavior)
SELECT count(*) FROM read_xml('huge_file.xml');
-- Force DOM mode (errors if file is too large)
SELECT * FROM read_xml('file.xml', streaming := false);
-- Adjust the file size limit for DOM parsing
SELECT * FROM read_xml('file.xml', maximum_file_size := 268435456); -- 256MB
Extracting HTML Tables
-- Extract tables as rows
SELECT table_index, row_index, columns
FROM html_extract_tables('<table><tr><th>Name</th></tr><tr><td>John</td></tr></table>');
Next Steps
See Function Reference for a complete function reference
Learn about Parameters Reference for customization options
Explore XPath Guide for advanced XPath queries
Understand Schema Inference for automatic type detection