File Reading & String Parsing Functions

These functions read XML and HTML from files or parse them from strings directly into DuckDB tables.

read_xml

Read XML files with automatic schema inference.

Syntax:

read_xml(pattern [, options...])

Parameters:

Parameter

Type

Description

pattern

VARCHAR

File path or glob pattern (e.g., 'data.xml', '*.xml')

filename

BOOLEAN

Include filename column in output (default: false)

ignore_errors

BOOLEAN

Skip files that fail to parse (default: false)

maximum_file_size

BIGINT

Maximum file size in bytes for DOM parsing (default: 16MB). Files above this use SAX streaming.

record_element

VARCHAR

XPath or tag name for elements that become rows

root_element

VARCHAR

Specify root element for schema inference

force_list

VARCHAR[]

Column names that should always be LIST type

auto_detect

BOOLEAN

Enable automatic schema detection (default: true)

max_depth

INTEGER

Maximum nesting depth to parse (default: 10)

all_varchar

BOOLEAN

Force all scalar types to VARCHAR (default: false)

union_by_name

BOOLEAN

Combine columns by name for multiple files (default: false)

attr_mode

VARCHAR

Attribute handling: ‘prefix’, ‘merge’, ‘ignore’ (default: ‘prefix’)

attr_prefix

VARCHAR

Prefix for attribute columns (default: ‘@’)

text_key

VARCHAR

Key for text content in mixed elements (default: ‘#text’)

empty_elements

VARCHAR

Empty element handling: ‘object’, ‘null’, ‘string’ (default: ‘object’)

namespaces

VARCHAR

Namespace handling: ‘strip’, ‘expand’, ‘keep’ (default: ‘strip’)

columns

STRUCT

Explicit column schema (e.g., {name: 'VARCHAR', price: 'DOUBLE'})

datetime_format

VARCHAR or VARCHAR[]

Controls date/time detection. Accepts 'auto' (default), 'none', preset names ('us', 'eu', 'iso'), custom strftime strings, or a list of formats.

nullstr

VARCHAR or VARCHAR[]

String value(s) to interpret as NULL (e.g., 'N/A' or ['N/A', '-'])

streaming

BOOLEAN

Enable SAX streaming for files exceeding maximum_file_size (default: true). SAX mode only supports simple tag names for record_element. Not available for HTML.

Examples:

-- Basic usage
SELECT * FROM read_xml('catalog.xml');

-- With glob pattern
SELECT * FROM read_xml('data/*.xml', filename=true);

-- Extract specific records
SELECT * FROM read_xml('feed.xml', record_element := 'item');

-- Force array type for optional elements
SELECT * FROM read_xml('products.xml',
    record_element := 'product',
    force_list := ['tag', 'category']
);

-- Combine files with different schemas
SELECT * FROM read_xml('configs/*.xml', union_by_name := true);

read_xml_objects

Read XML files as raw document objects for custom processing.

Syntax:

read_xml_objects(pattern [, filename=false, ignore_errors=false])

Returns:

A table with column xml of type XML. If filename=true, also includes filename column.

Examples:

-- Get raw XML documents
SELECT xml FROM read_xml_objects('data.xml');

-- Process with XPath
SELECT
    filename,
    xml_extract_text(xml, '//title') as title
FROM read_xml_objects('books/*.xml', filename=true);

read_html

Read HTML files with automatic schema inference.

Syntax:

read_html(pattern [, options...])

Parameters:

Same parameters as read_xml, plus HTML-specific handling.

Examples:

-- Extract articles from HTML pages
SELECT * FROM read_html('pages/*.html', record_element := 'article');

read_html_objects

Read HTML files as raw document objects.

Syntax:

read_html_objects(pattern [, filename=false, ignore_errors=false])

Returns:

A table with column html of type HTML. If filename=true, also includes filename column.

Examples:

-- Get raw HTML documents
SELECT html FROM read_html_objects('page.html');

-- Extract links from multiple pages
SELECT
    filename,
    html_extract_links(html) as links
FROM read_html_objects('pages/*.html', filename=true);

String Parsing Functions

These functions parse XML and HTML content directly from strings, complementing the file-based read_* functions.

parse_xml

Parse XML string with automatic schema inference.

Syntax:

parse_xml(content [, options...])

Parameters:

Parameter

Type

Description

content

VARCHAR

XML content to parse

ignore_errors

BOOLEAN

Return empty result instead of error on invalid input (default: false)

record_element

VARCHAR

XPath or tag name for elements that become rows

root_element

VARCHAR

Specify root element for schema inference

force_list

VARCHAR[]

Column names that should always be LIST type

auto_detect

BOOLEAN

Enable automatic schema detection (default: true)

max_depth

INTEGER

Maximum nesting depth to parse (default: 10)

all_varchar

BOOLEAN

Force all scalar types to VARCHAR (default: false)

columns

STRUCT

Explicit column schema (e.g., {item: 'INTEGER'})

Examples:

-- Parse XML string with schema inference
SELECT * FROM parse_xml('<catalog><book><title>DuckDB</title><price>29.99</price></book></catalog>');

-- Parse multiple records
SELECT name, value FROM parse_xml(
    '<items><item><name>A</name><value>1</value></item><item><name>B</name><value>2</value></item></items>'
);

-- With explicit schema
SELECT * FROM parse_xml('<root><item>42</item></root>', columns := {item: 'INTEGER'});

-- Ignore parse errors
SELECT * FROM parse_xml(xml_column, ignore_errors := true) FROM raw_data;

parse_xml_objects

Parse XML string and return as raw XML type.

Syntax:

parse_xml_objects(content [, ignore_errors=false])

Returns:

A table with column xml of type XML.

Examples:

-- Parse and validate XML string
SELECT * FROM parse_xml_objects('<root><item>test</item></root>');

-- Process with XPath
SELECT xml_extract_text(xml, '//item') as items
FROM parse_xml_objects('<root><item>A</item><item>B</item></root>');

parse_html

Parse HTML string with automatic schema inference.

Syntax:

parse_html(content [, options...])

Parameters:

Same parameters as parse_xml. HTML parsing is more lenient than XML parsing.

Examples:

-- Parse HTML content
SELECT * FROM parse_html('<html><body><div>Content</div></body></html>');

-- Extract from HTML with record element
SELECT * FROM parse_html('<ul><li>Item 1</li><li>Item 2</li></ul>', record_element := 'li');

parse_html_objects

Parse HTML string and return as raw HTML type.

Syntax:

parse_html_objects(content [, ignore_errors=false])

Returns:

A table with column html of type HTML.

Examples:

-- Parse HTML string
SELECT * FROM parse_html_objects('<div><p>Hello World</p></div>');

-- Process with extraction functions
SELECT html_extract_text(html) as text
FROM parse_html_objects('<p>Paragraph <strong>with bold</strong> text.</p>');