Schema Inference
================
The webbed extension automatically infers DuckDB table schemas from XML and HTML documents using a 3-phase deterministic approach.
Overview
--------
When you read an XML file without specifying a schema, the extension:
1. **Identifies Records** - Determines which XML elements represent table rows
2. **Identifies Columns** - Analyzes child elements and attributes to create columns
3. **Infers Types** - Determines appropriate DuckDB types for each column
Phase 1: Identify Records
-------------------------
By default, immediate children of the root element become rows:
.. code-block:: xml
...
...
...
You can customize this with the ``record_element`` parameter:
.. code-block:: sql
-- Extract nested items as rows
SELECT * FROM read_xml('feed.xml', record_element := 'item');
-- Use XPath syntax
SELECT * FROM read_xml('data.xml', record_element := '//entry');
.. warning::
``record_element`` matches at ALL depths. If your document has ``- `` elements at multiple levels, all will become rows. Filter by column presence if needed:
.. code-block:: sql
SELECT * FROM read_xml('data.xml', record_element := 'item')
WHERE specific_column IS NOT NULL;
Phase 2: Identify Columns
-------------------------
For each record element, the extension creates columns from:
1. **Attributes** on the record element
2. **Child elements** of the record
.. code-block:: xml
Widget
29.99
new
sale
Results in columns:
- ``@id`` (from attribute)
- ``@category`` (from attribute)
- ``name`` (from child element)
- ``price`` (from child element)
- ``tags`` (nested STRUCT with LIST)
Repeated Elements
~~~~~~~~~~~~~~~~~
When an element name repeats within a record, it becomes a LIST:
.. code-block:: xml
- Product A
- Product B
- Product C
The ``item`` column will be ``LIST``.
Use ``force_list`` to ensure array types even for single occurrences:
.. code-block:: sql
SELECT * FROM read_xml('orders.xml', force_list := ['item']);
Phase 3: Infer Types
--------------------
The extension automatically detects these types:
.. list-table::
:header-rows: 1
:widths: 25 35 40
* - DuckDB Type
- Detection Pattern
- Examples
* - BOOLEAN
- true/false, 1/0
- ``true``, ``false``, ``1``, ``0``
* - INTEGER/BIGINT
- Whole numbers
- ``42``, ``-100``, ``999999``
* - DOUBLE
- Decimal numbers
- ``3.14``, ``-0.5``, ``1.0e10``
* - DATE
- Date formats (ISO, US, EU)
- ``2024-01-15``, ``03/15/2024``
* - TIMESTAMP
- Date with time
- ``2024-01-15T10:30:00``
* - TIMESTAMPTZ
- Date with time and timezone
- ``2024-01-15T10:30:00+05:00``
* - TIME
- Time of day
- ``10:30:00``, ``02:30:00 PM``
* - VARCHAR
- Everything else
- ``hello``, ``mixed123``
* - STRUCT
- Nested elements
- ``NYC``
* - LIST
- Repeated elements
- Multiple ```` elements
Temporal type detection uses DuckDB's ``StrpTimeFormat`` with a candidate elimination approach.
A list of format candidates is tested against all sample values; candidates that fail on any
sample are eliminated. The first surviving candidate determines the column type. By default,
auto-detection prioritizes ISO formats, then US, then EU for ambiguous dates.
Use the ``datetime_format`` parameter to control this behavior — see :doc:`parameters` for details.
Forcing VARCHAR Types
~~~~~~~~~~~~~~~~~~~~~
Use ``all_varchar`` to prevent type inference issues:
.. code-block:: sql
-- All scalar values become VARCHAR
SELECT * FROM read_xml('data.xml', all_varchar := true);
This preserves nested structure (STRUCT, LIST) but converts leaf values:
- ``STRUCT(a INT, b FLOAT)`` → ``STRUCT(a VARCHAR, b VARCHAR)``
- ``LIST`` → ``LIST``
Controlling Depth
-----------------
Limit parsing depth with ``max_depth``:
.. code-block:: sql
-- Only parse 3 levels deep
SELECT * FROM read_xml('deep.xml', max_depth := 3);
-- Unlimited (capped at 10 for safety)
SELECT * FROM read_xml('deep.xml', max_depth := -1);
SAX Streaming and Schema Inference
-----------------------------------
When files exceed ``maximum_file_size`` (16MB by default), the extension uses SAX-based
streaming instead of building a full DOM tree. This affects schema inference in two ways:
1. **Sample-based inference** — SAX mode reads the first ``sample_size`` records (default: 50)
to infer the schema, then streams the rest for extraction. The schema is not revised after
the sample window, so columns or types that only appear in later records may not be detected.
2. **Simple ``record_element`` only** — SAX mode matches record elements by simple tag name
(e.g., ``'item'``). XPath expressions like ``'//ns:item[@type="active"]'`` or path-based
patterns like ``'/root/data/item'`` require DOM parsing. When the file is too large for DOM
and the ``record_element`` contains XPath syntax, the extension raises an error.
.. code-block:: sql
-- Works in SAX mode (simple tag name)
SELECT * FROM read_xml('huge.xml', record_element := 'item');
-- Falls back to DOM (XPath expression)
SELECT * FROM read_xml('data.xml', record_element := '//item[@status="active"]');
Set ``streaming := false`` to force DOM mode for any file (will error if the file exceeds
``maximum_file_size``).
Common Patterns
---------------
RSS Feeds
~~~~~~~~~
.. code-block:: sql
-- Default: Returns channel metadata
SELECT * FROM read_xml('feed.rss');
-- Extract individual items
SELECT * FROM read_xml('feed.rss', record_element := 'item');
AWS API Responses
~~~~~~~~~~~~~~~~~
.. code-block:: sql
-- Extract EC2 volumes
SELECT * FROM read_xml('volumes.xml', record_element := 'item');
Configuration Files
~~~~~~~~~~~~~~~~~~~
.. code-block:: sql
-- Preserve structure
SELECT * FROM read_xml('config.xml', unnest_as := 'struct');
-- Flatten to columns
SELECT * FROM read_xml('config.xml', unnest_as := 'columns');