File Reading & String Parsing Functions ======================================= These functions read XML and HTML from files or parse them from strings directly into DuckDB tables. read_xml -------- Read XML files with automatic schema inference. **Syntax:** .. code-block:: sql read_xml(pattern [, options...]) **Parameters:** .. list-table:: :header-rows: 1 :widths: 20 15 65 * - Parameter - Type - Description * - ``pattern`` - VARCHAR - File path or glob pattern (e.g., ``'data.xml'``, ``'*.xml'``) * - ``filename`` - BOOLEAN - Include filename column in output (default: false) * - ``ignore_errors`` - BOOLEAN - Skip files that fail to parse (default: false) * - ``maximum_file_size`` - BIGINT - Maximum file size in bytes for DOM parsing (default: 16MB). Files above this use SAX streaming. * - ``record_element`` - VARCHAR - XPath or tag name for elements that become rows * - ``root_element`` - VARCHAR - Specify root element for schema inference * - ``force_list`` - VARCHAR[] - Column names that should always be LIST type * - ``auto_detect`` - BOOLEAN - Enable automatic schema detection (default: true) * - ``max_depth`` - INTEGER - Maximum nesting depth to parse (default: 10) * - ``all_varchar`` - BOOLEAN - Force all scalar types to VARCHAR (default: false) * - ``union_by_name`` - BOOLEAN - Combine columns by name for multiple files (default: false) * - ``attr_mode`` - VARCHAR - Attribute handling: 'prefix', 'merge', 'ignore' (default: 'prefix') * - ``attr_prefix`` - VARCHAR - Prefix for attribute columns (default: '@') * - ``text_key`` - VARCHAR - Key for text content in mixed elements (default: '#text') * - ``empty_elements`` - VARCHAR - Empty element handling: 'object', 'null', 'string' (default: 'object') * - ``namespaces`` - VARCHAR - Namespace handling: 'strip', 'expand', 'keep' (default: 'strip') * - ``columns`` - STRUCT - Explicit column schema (e.g., ``{name: 'VARCHAR', price: 'DOUBLE'}``) * - ``datetime_format`` - VARCHAR or VARCHAR[] - Controls date/time detection. Accepts ``'auto'`` (default), ``'none'``, preset names (``'us'``, ``'eu'``, ``'iso'``), custom strftime strings, or a list of formats. * - ``nullstr`` - VARCHAR or VARCHAR[] - String value(s) to interpret as NULL (e.g., ``'N/A'`` or ``['N/A', '-']``) * - ``streaming`` - BOOLEAN - Enable SAX streaming for files exceeding ``maximum_file_size`` (default: true). SAX mode only supports simple tag names for ``record_element``. Not available for HTML. **Examples:** .. code-block:: sql -- Basic usage SELECT * FROM read_xml('catalog.xml'); -- With glob pattern SELECT * FROM read_xml('data/*.xml', filename=true); -- Extract specific records SELECT * FROM read_xml('feed.xml', record_element := 'item'); -- Force array type for optional elements SELECT * FROM read_xml('products.xml', record_element := 'product', force_list := ['tag', 'category'] ); -- Combine files with different schemas SELECT * FROM read_xml('configs/*.xml', union_by_name := true); read_xml_objects ---------------- Read XML files as raw document objects for custom processing. **Syntax:** .. code-block:: sql read_xml_objects(pattern [, filename=false, ignore_errors=false]) **Returns:** A table with column ``xml`` of type ``XML``. If ``filename=true``, also includes ``filename`` column. **Examples:** .. code-block:: sql -- Get raw XML documents SELECT xml FROM read_xml_objects('data.xml'); -- Process with XPath SELECT filename, xml_extract_text(xml, '//title') as title FROM read_xml_objects('books/*.xml', filename=true); read_html --------- Read HTML files with automatic schema inference. **Syntax:** .. code-block:: sql read_html(pattern [, options...]) **Parameters:** Same parameters as ``read_xml``, plus HTML-specific handling. **Examples:** .. code-block:: sql -- Extract articles from HTML pages SELECT * FROM read_html('pages/*.html', record_element := 'article'); read_html_objects ----------------- Read HTML files as raw document objects. **Syntax:** .. code-block:: sql read_html_objects(pattern [, filename=false, ignore_errors=false]) **Returns:** A table with column ``html`` of type ``HTML``. If ``filename=true``, also includes ``filename`` column. **Examples:** .. code-block:: sql -- Get raw HTML documents SELECT html FROM read_html_objects('page.html'); -- Extract links from multiple pages SELECT filename, html_extract_links(html) as links FROM read_html_objects('pages/*.html', filename=true); String Parsing Functions ------------------------ These functions parse XML and HTML content directly from strings, complementing the file-based ``read_*`` functions. parse_xml ~~~~~~~~~ Parse XML string with automatic schema inference. **Syntax:** .. code-block:: sql parse_xml(content [, options...]) **Parameters:** .. list-table:: :header-rows: 1 :widths: 20 15 65 * - Parameter - Type - Description * - ``content`` - VARCHAR - XML content to parse * - ``ignore_errors`` - BOOLEAN - Return empty result instead of error on invalid input (default: false) * - ``record_element`` - VARCHAR - XPath or tag name for elements that become rows * - ``root_element`` - VARCHAR - Specify root element for schema inference * - ``force_list`` - VARCHAR[] - Column names that should always be LIST type * - ``auto_detect`` - BOOLEAN - Enable automatic schema detection (default: true) * - ``max_depth`` - INTEGER - Maximum nesting depth to parse (default: 10) * - ``all_varchar`` - BOOLEAN - Force all scalar types to VARCHAR (default: false) * - ``columns`` - STRUCT - Explicit column schema (e.g., ``{item: 'INTEGER'}``) **Examples:** .. code-block:: sql -- Parse XML string with schema inference SELECT * FROM parse_xml('DuckDB29.99'); -- Parse multiple records SELECT name, value FROM parse_xml( 'A1B2' ); -- With explicit schema SELECT * FROM parse_xml('42', columns := {item: 'INTEGER'}); -- Ignore parse errors SELECT * FROM parse_xml(xml_column, ignore_errors := true) FROM raw_data; parse_xml_objects ~~~~~~~~~~~~~~~~~ Parse XML string and return as raw XML type. **Syntax:** .. code-block:: sql parse_xml_objects(content [, ignore_errors=false]) **Returns:** A table with column ``xml`` of type ``XML``. **Examples:** .. code-block:: sql -- Parse and validate XML string SELECT * FROM parse_xml_objects('test'); -- Process with XPath SELECT xml_extract_text(xml, '//item') as items FROM parse_xml_objects('AB'); parse_html ~~~~~~~~~~ Parse HTML string with automatic schema inference. **Syntax:** .. code-block:: sql parse_html(content [, options...]) **Parameters:** Same parameters as ``parse_xml``. HTML parsing is more lenient than XML parsing. **Examples:** .. code-block:: sql -- Parse HTML content SELECT * FROM parse_html('
Content
'); -- Extract from HTML with record element SELECT * FROM parse_html('', record_element := 'li'); parse_html_objects ~~~~~~~~~~~~~~~~~~ Parse HTML string and return as raw HTML type. **Syntax:** .. code-block:: sql parse_html_objects(content [, ignore_errors=false]) **Returns:** A table with column ``html`` of type ``HTML``. **Examples:** .. code-block:: sql -- Parse HTML string SELECT * FROM parse_html_objects('

Hello World

'); -- Process with extraction functions SELECT html_extract_text(html) as text FROM parse_html_objects('

Paragraph with bold text.

');