File Reading & String Parsing Functions

These functions read XML and HTML from files or parse them from strings directly into DuckDB tables.

read_xml

Read XML files with automatic schema inference.

Syntax:

read_xml(pattern [, options...])

Parameters:

Parameter	Type	Description
`pattern`	VARCHAR	File path or glob pattern (e.g., `'data.xml'`, `'*.xml'`)
`filename`	BOOLEAN	Include filename column in output (default: false)
`ignore_errors`	BOOLEAN	Skip files that fail to parse (default: false)
`maximum_file_size`	BIGINT	Maximum file size in bytes for DOM parsing (default: 16MB). Files above this use SAX streaming.
`record_element`	VARCHAR	XPath or tag name for elements that become rows
`root_element`	VARCHAR	Specify root element for schema inference
`force_list`	VARCHAR[]	Column names that should always be LIST type
`auto_detect`	BOOLEAN	Enable automatic schema detection (default: true)
`max_depth`	INTEGER	Maximum nesting depth to parse (default: 10)
`all_varchar`	BOOLEAN	Force all scalar types to VARCHAR (default: false)
`union_by_name`	BOOLEAN	Combine columns by name for multiple files (default: false)
`attr_mode`	VARCHAR	Attribute handling: ‘prefix’, ‘merge’, ‘ignore’ (default: ‘prefix’)
`attr_prefix`	VARCHAR	Prefix for attribute columns (default: ‘@’)
`text_key`	VARCHAR	Key for text content in mixed elements (default: ‘#text’)
`empty_elements`	VARCHAR	Empty element handling: ‘object’, ‘null’, ‘string’ (default: ‘object’)
`namespaces`	VARCHAR	Namespace handling: ‘strip’, ‘expand’, ‘keep’ (default: ‘strip’)
`columns`	STRUCT	Explicit column schema (e.g., `{name: 'VARCHAR', price: 'DOUBLE'}`)
`datetime_format`	VARCHAR or VARCHAR[]	Controls date/time detection. Accepts `'auto'` (default), `'none'`, preset names (`'us'`, `'eu'`, `'iso'`), custom strftime strings, or a list of formats.
`nullstr`	VARCHAR or VARCHAR[]	String value(s) to interpret as NULL (e.g., `'N/A'` or `['N/A', '-']`)
`streaming`	BOOLEAN	Enable SAX streaming for files exceeding `maximum_file_size` (default: true). SAX mode only supports simple tag names for `record_element`. Not available for HTML.

Examples:

-- Basic usage
SELECT * FROM read_xml('catalog.xml');

-- With glob pattern
SELECT * FROM read_xml('data/*.xml', filename=true);

-- Extract specific records
SELECT * FROM read_xml('feed.xml', record_element := 'item');

-- Force array type for optional elements
SELECT * FROM read_xml('products.xml',
    record_element := 'product',
    force_list := ['tag', 'category']
);

-- Combine files with different schemas
SELECT * FROM read_xml('configs/*.xml', union_by_name := true);

read_xml_objects

Read XML files as raw document objects for custom processing.

Syntax:

read_xml_objects(pattern [, filename=false, ignore_errors=false])

Returns:

A table with column xml of type XML. If filename=true, also includes filename column.

Examples:

-- Get raw XML documents
SELECT xml FROM read_xml_objects('data.xml');

-- Process with XPath
SELECT
    filename,
    xml_extract_text(xml, '//title') as title
FROM read_xml_objects('books/*.xml', filename=true);

read_html

Read HTML files with automatic schema inference.

Syntax:

read_html(pattern [, options...])

Parameters:

Same parameters as read_xml, plus HTML-specific handling.

Examples:

-- Extract articles from HTML pages
SELECT * FROM read_html('pages/*.html', record_element := 'article');

read_html_objects

Read HTML files as raw document objects.

Syntax:

read_html_objects(pattern [, filename=false, ignore_errors=false])

Returns:

A table with column html of type HTML. If filename=true, also includes filename column.

Examples:

-- Get raw HTML documents
SELECT html FROM read_html_objects('page.html');

-- Extract links from multiple pages
SELECT
    filename,
    html_extract_links(html) as links
FROM read_html_objects('pages/*.html', filename=true);

String Parsing Functions

These functions parse XML and HTML content directly from strings, complementing the file-based read_* functions.

parse_xml

Parse XML string with automatic schema inference.

Syntax:

parse_xml(content [, options...])

Parameters:

Parameter	Type	Description
`content`	VARCHAR	XML content to parse
`ignore_errors`	BOOLEAN	Return empty result instead of error on invalid input (default: false)
`record_element`	VARCHAR	XPath or tag name for elements that become rows
`root_element`	VARCHAR	Specify root element for schema inference
`force_list`	VARCHAR[]	Column names that should always be LIST type
`auto_detect`	BOOLEAN	Enable automatic schema detection (default: true)
`max_depth`	INTEGER	Maximum nesting depth to parse (default: 10)
`all_varchar`	BOOLEAN	Force all scalar types to VARCHAR (default: false)
`columns`	STRUCT	Explicit column schema (e.g., `{item: 'INTEGER'}`)

Examples:

-- Parse XML string with schema inference
SELECT * FROM parse_xml('<catalog><book><title>DuckDB</title><price>29.99</price></book></catalog>');

-- Parse multiple records
SELECT name, value FROM parse_xml(
    '<items><item><name>A</name><value>1</value></item><item><name>B</name><value>2</value></item></items>'
);

-- With explicit schema
SELECT * FROM parse_xml('<root><item>42</item></root>', columns := {item: 'INTEGER'});

-- Ignore parse errors
SELECT * FROM parse_xml(xml_column, ignore_errors := true) FROM raw_data;

parse_xml_objects

Parse XML string and return as raw XML type.

Syntax:

parse_xml_objects(content [, ignore_errors=false])

Returns:

A table with column xml of type XML.

Examples:

-- Parse and validate XML string
SELECT * FROM parse_xml_objects('<root><item>test</item></root>');

-- Process with XPath
SELECT xml_extract_text(xml, '//item') as items
FROM parse_xml_objects('<root><item>A</item><item>B</item></root>');

parse_html

Parse HTML string with automatic schema inference.

Syntax:

parse_html(content [, options...])

Parameters:

Same parameters as parse_xml. HTML parsing is more lenient than XML parsing.

Examples:

-- Parse HTML content
SELECT * FROM parse_html('<html><body><div>Content</div></body></html>');

-- Extract from HTML with record element
SELECT * FROM parse_html('<ul><li>Item 1</li><li>Item 2</li></ul>', record_element := 'li');

parse_html_objects

Parse HTML string and return as raw HTML type.

Syntax:

parse_html_objects(content [, ignore_errors=false])

Returns:

A table with column html of type HTML.

Examples:

-- Parse HTML string
SELECT * FROM parse_html_objects('<div><p>Hello World</p></div>');

-- Process with extraction functions
SELECT html_extract_text(html) as text
FROM parse_html_objects('<p>Paragraph <strong>with bold</strong> text.</p>');