File Reading & String Parsing Functions
These functions read XML and HTML from files or parse them from strings directly into DuckDB tables.
read_xml
Read XML files with automatic schema inference.
Syntax:
read_xml(pattern [, options...])
Parameters:
Parameter |
Type |
Description |
|---|---|---|
|
VARCHAR |
File path or glob pattern (e.g., |
|
BOOLEAN |
Include filename column in output (default: false) |
|
BOOLEAN |
Skip files that fail to parse (default: false) |
|
BIGINT |
Maximum file size in bytes for DOM parsing (default: 16MB). Files above this use SAX streaming. |
|
VARCHAR |
XPath or tag name for elements that become rows |
|
VARCHAR |
Specify root element for schema inference |
|
VARCHAR[] |
Column names that should always be LIST type |
|
BOOLEAN |
Enable automatic schema detection (default: true) |
|
INTEGER |
Maximum nesting depth to parse (default: 10) |
|
BOOLEAN |
Force all scalar types to VARCHAR (default: false) |
|
BOOLEAN |
Combine columns by name for multiple files (default: false) |
|
VARCHAR |
Attribute handling: ‘prefix’, ‘merge’, ‘ignore’ (default: ‘prefix’) |
|
VARCHAR |
Prefix for attribute columns (default: ‘@’) |
|
VARCHAR |
Key for text content in mixed elements (default: ‘#text’) |
|
VARCHAR |
Empty element handling: ‘object’, ‘null’, ‘string’ (default: ‘object’) |
|
VARCHAR |
Namespace handling: ‘strip’, ‘expand’, ‘keep’ (default: ‘strip’) |
|
STRUCT |
Explicit column schema (e.g., |
|
VARCHAR or VARCHAR[] |
Controls date/time detection. Accepts |
|
VARCHAR or VARCHAR[] |
String value(s) to interpret as NULL (e.g., |
|
BOOLEAN |
Enable SAX streaming for files exceeding |
Examples:
-- Basic usage
SELECT * FROM read_xml('catalog.xml');
-- With glob pattern
SELECT * FROM read_xml('data/*.xml', filename=true);
-- Extract specific records
SELECT * FROM read_xml('feed.xml', record_element := 'item');
-- Force array type for optional elements
SELECT * FROM read_xml('products.xml',
record_element := 'product',
force_list := ['tag', 'category']
);
-- Combine files with different schemas
SELECT * FROM read_xml('configs/*.xml', union_by_name := true);
read_xml_objects
Read XML files as raw document objects for custom processing.
Syntax:
read_xml_objects(pattern [, filename=false, ignore_errors=false])
Returns:
A table with column xml of type XML. If filename=true, also includes filename column.
Examples:
-- Get raw XML documents
SELECT xml FROM read_xml_objects('data.xml');
-- Process with XPath
SELECT
filename,
xml_extract_text(xml, '//title') as title
FROM read_xml_objects('books/*.xml', filename=true);
read_html
Read HTML files with automatic schema inference.
Syntax:
read_html(pattern [, options...])
Parameters:
Same parameters as read_xml, plus HTML-specific handling.
Examples:
-- Extract articles from HTML pages
SELECT * FROM read_html('pages/*.html', record_element := 'article');
read_html_objects
Read HTML files as raw document objects.
Syntax:
read_html_objects(pattern [, filename=false, ignore_errors=false])
Returns:
A table with column html of type HTML. If filename=true, also includes filename column.
Examples:
-- Get raw HTML documents
SELECT html FROM read_html_objects('page.html');
-- Extract links from multiple pages
SELECT
filename,
html_extract_links(html) as links
FROM read_html_objects('pages/*.html', filename=true);
String Parsing Functions
These functions parse XML and HTML content directly from strings, complementing the file-based read_* functions.
parse_xml
Parse XML string with automatic schema inference.
Syntax:
parse_xml(content [, options...])
Parameters:
Parameter |
Type |
Description |
|---|---|---|
|
VARCHAR |
XML content to parse |
|
BOOLEAN |
Return empty result instead of error on invalid input (default: false) |
|
VARCHAR |
XPath or tag name for elements that become rows |
|
VARCHAR |
Specify root element for schema inference |
|
VARCHAR[] |
Column names that should always be LIST type |
|
BOOLEAN |
Enable automatic schema detection (default: true) |
|
INTEGER |
Maximum nesting depth to parse (default: 10) |
|
BOOLEAN |
Force all scalar types to VARCHAR (default: false) |
|
STRUCT |
Explicit column schema (e.g., |
Examples:
-- Parse XML string with schema inference
SELECT * FROM parse_xml('<catalog><book><title>DuckDB</title><price>29.99</price></book></catalog>');
-- Parse multiple records
SELECT name, value FROM parse_xml(
'<items><item><name>A</name><value>1</value></item><item><name>B</name><value>2</value></item></items>'
);
-- With explicit schema
SELECT * FROM parse_xml('<root><item>42</item></root>', columns := {item: 'INTEGER'});
-- Ignore parse errors
SELECT * FROM parse_xml(xml_column, ignore_errors := true) FROM raw_data;
parse_xml_objects
Parse XML string and return as raw XML type.
Syntax:
parse_xml_objects(content [, ignore_errors=false])
Returns:
A table with column xml of type XML.
Examples:
-- Parse and validate XML string
SELECT * FROM parse_xml_objects('<root><item>test</item></root>');
-- Process with XPath
SELECT xml_extract_text(xml, '//item') as items
FROM parse_xml_objects('<root><item>A</item><item>B</item></root>');
parse_html
Parse HTML string with automatic schema inference.
Syntax:
parse_html(content [, options...])
Parameters:
Same parameters as parse_xml. HTML parsing is more lenient than XML parsing.
Examples:
-- Parse HTML content
SELECT * FROM parse_html('<html><body><div>Content</div></body></html>');
-- Extract from HTML with record element
SELECT * FROM parse_html('<ul><li>Item 1</li><li>Item 2</li></ul>', record_element := 'li');
parse_html_objects
Parse HTML string and return as raw HTML type.
Syntax:
parse_html_objects(content [, ignore_errors=false])
Returns:
A table with column html of type HTML.
Examples:
-- Parse HTML string
SELECT * FROM parse_html_objects('<div><p>Hello World</p></div>');
-- Process with extraction functions
SELECT html_extract_text(html) as text
FROM parse_html_objects('<p>Paragraph <strong>with bold</strong> text.</p>');