Schema Inference
The webbed extension automatically infers DuckDB table schemas from XML and HTML documents using a 3-phase deterministic approach.
Overview
When you read an XML file without specifying a schema, the extension:
Identifies Records - Determines which XML elements represent table rows
Identifies Columns - Analyzes child elements and attributes to create columns
Infers Types - Determines appropriate DuckDB types for each column
Phase 1: Identify Records
By default, immediate children of the root element become rows:
<catalog>
<product>...</product> <!-- Row 1 -->
<product>...</product> <!-- Row 2 -->
<product>...</product> <!-- Row 3 -->
</catalog>
You can customize this with the record_element parameter:
-- Extract nested items as rows
SELECT * FROM read_xml('feed.xml', record_element := 'item');
-- Use XPath syntax
SELECT * FROM read_xml('data.xml', record_element := '//entry');
Warning
record_element matches at ALL depths. If your document has <item> elements at multiple levels, all will become rows. Filter by column presence if needed:
SELECT * FROM read_xml('data.xml', record_element := 'item')
WHERE specific_column IS NOT NULL;
Phase 2: Identify Columns
For each record element, the extension creates columns from:
Attributes on the record element
Child elements of the record
<product id="123" category="electronics">
<name>Widget</name>
<price>29.99</price>
<tags>
<tag>new</tag>
<tag>sale</tag>
</tags>
</product>
Results in columns:
@id(from attribute)@category(from attribute)name(from child element)price(from child element)tags(nested STRUCT with LIST)
Repeated Elements
When an element name repeats within a record, it becomes a LIST:
<order>
<item>Product A</item>
<item>Product B</item>
<item>Product C</item>
</order>
The item column will be LIST<VARCHAR>.
Use force_list to ensure array types even for single occurrences:
SELECT * FROM read_xml('orders.xml', force_list := ['item']);
Phase 3: Infer Types
The extension automatically detects these types:
DuckDB Type |
Detection Pattern |
Examples |
|---|---|---|
BOOLEAN |
true/false, 1/0 |
|
INTEGER/BIGINT |
Whole numbers |
|
DOUBLE |
Decimal numbers |
|
DATE |
Date formats (ISO, US, EU) |
|
TIMESTAMP |
Date with time |
|
TIMESTAMPTZ |
Date with time and timezone |
|
TIME |
Time of day |
|
VARCHAR |
Everything else |
|
STRUCT |
Nested elements |
|
LIST |
Repeated elements |
Multiple |
Temporal type detection uses DuckDB’s StrpTimeFormat with a candidate elimination approach.
A list of format candidates is tested against all sample values; candidates that fail on any
sample are eliminated. The first surviving candidate determines the column type. By default,
auto-detection prioritizes ISO formats, then US, then EU for ambiguous dates.
Use the datetime_format parameter to control this behavior — see Parameters Reference for details.
Forcing VARCHAR Types
Use all_varchar to prevent type inference issues:
-- All scalar values become VARCHAR
SELECT * FROM read_xml('data.xml', all_varchar := true);
This preserves nested structure (STRUCT, LIST) but converts leaf values:
STRUCT(a INT, b FLOAT)→STRUCT(a VARCHAR, b VARCHAR)LIST<INTEGER>→LIST<VARCHAR>
Controlling Depth
Limit parsing depth with max_depth:
-- Only parse 3 levels deep
SELECT * FROM read_xml('deep.xml', max_depth := 3);
-- Unlimited (capped at 10 for safety)
SELECT * FROM read_xml('deep.xml', max_depth := -1);
SAX Streaming and Schema Inference
When files exceed maximum_file_size (16MB by default), the extension uses SAX-based
streaming instead of building a full DOM tree. This affects schema inference in two ways:
Sample-based inference — SAX mode reads the first
sample_sizerecords (default: 50) to infer the schema, then streams the rest for extraction. The schema is not revised after the sample window, so columns or types that only appear in later records may not be detected.Simple ``record_element`` only — SAX mode matches record elements by simple tag name (e.g.,
'item'). XPath expressions like'//ns:item[@type="active"]'or path-based patterns like'/root/data/item'require DOM parsing. When the file is too large for DOM and therecord_elementcontains XPath syntax, the extension raises an error.
-- Works in SAX mode (simple tag name)
SELECT * FROM read_xml('huge.xml', record_element := 'item');
-- Falls back to DOM (XPath expression)
SELECT * FROM read_xml('data.xml', record_element := '//item[@status="active"]');
Set streaming := false to force DOM mode for any file (will error if the file exceeds
maximum_file_size).
Common Patterns
RSS Feeds
-- Default: Returns channel metadata
SELECT * FROM read_xml('feed.rss');
-- Extract individual items
SELECT * FROM read_xml('feed.rss', record_element := 'item');
AWS API Responses
-- Extract EC2 volumes
SELECT * FROM read_xml('volumes.xml', record_element := 'item');
Configuration Files
-- Preserve structure
SELECT * FROM read_xml('config.xml', unnest_as := 'struct');
-- Flatten to columns
SELECT * FROM read_xml('config.xml', unnest_as := 'columns');