Schema Inference

The webbed extension automatically infers DuckDB table schemas from XML and HTML documents using a 3-phase deterministic approach.

Overview

When you read an XML file without specifying a schema, the extension:

Identifies Records - Determines which XML elements represent table rows
Identifies Columns - Analyzes child elements and attributes to create columns
Infers Types - Determines appropriate DuckDB types for each column

Phase 1: Identify Records

By default, immediate children of the root element become rows:

<catalog>
  <product>...</product>  <!-- Row 1 -->
  <product>...</product>  <!-- Row 2 -->
  <product>...</product>  <!-- Row 3 -->
</catalog>

You can customize this with the record_element parameter:

-- Extract nested items as rows
SELECT * FROM read_xml('feed.xml', record_element := 'item');

-- Use XPath syntax
SELECT * FROM read_xml('data.xml', record_element := '//entry');

Warning

record_element matches at ALL depths. If your document has <item> elements at multiple levels, all will become rows. Filter by column presence if needed:

SELECT * FROM read_xml('data.xml', record_element := 'item')
WHERE specific_column IS NOT NULL;

Phase 2: Identify Columns

For each record element, the extension creates columns from:

Attributes on the record element
Child elements of the record

<product id="123" category="electronics">
  <name>Widget</name>
  <price>29.99</price>
  <tags>
    <tag>new</tag>
    <tag>sale</tag>
  </tags>
</product>

Results in columns:

@id (from attribute)
@category (from attribute)
name (from child element)
price (from child element)
tags (nested STRUCT with LIST)

Repeated Elements

When an element name repeats within a record, it becomes a LIST:

<order>
  <item>Product A</item>
  <item>Product B</item>
  <item>Product C</item>
</order>

The item column will be LIST<VARCHAR>.

Use force_list to ensure array types even for single occurrences:

SELECT * FROM read_xml('orders.xml', force_list := ['item']);

Phase 3: Infer Types

The extension automatically detects these types:

DuckDB Type	Detection Pattern	Examples
BOOLEAN	true/false, 1/0	`true`, `false`, `1`, `0`
INTEGER/BIGINT	Whole numbers	`42`, `-100`, `999999`
DOUBLE	Decimal numbers	`3.14`, `-0.5`, `1.0e10`
DATE	Date formats (ISO, US, EU)	`2024-01-15`, `03/15/2024`
TIMESTAMP	Date with time	`2024-01-15T10:30:00`
TIMESTAMPTZ	Date with time and timezone	`2024-01-15T10:30:00+05:00`
TIME	Time of day	`10:30:00`, `02:30:00 PM`
VARCHAR	Everything else	`hello`, `mixed123`
STRUCT	Nested elements	`<address><city>NYC</city></address>`
LIST	Repeated elements	Multiple `<tag>` elements

Temporal type detection uses DuckDB’s StrpTimeFormat with a candidate elimination approach. A list of format candidates is tested against all sample values; candidates that fail on any sample are eliminated. The first surviving candidate determines the column type. By default, auto-detection prioritizes ISO formats, then US, then EU for ambiguous dates.

Use the datetime_format parameter to control this behavior — see Parameters Reference for details.

Forcing VARCHAR Types

Use all_varchar to prevent type inference issues:

-- All scalar values become VARCHAR
SELECT * FROM read_xml('data.xml', all_varchar := true);

This preserves nested structure (STRUCT, LIST) but converts leaf values:

STRUCT(a INT, b FLOAT) → STRUCT(a VARCHAR, b VARCHAR)
LIST<INTEGER> → LIST<VARCHAR>

Controlling Depth

Limit parsing depth with max_depth:

-- Only parse 3 levels deep
SELECT * FROM read_xml('deep.xml', max_depth := 3);

-- Unlimited (capped at 10 for safety)
SELECT * FROM read_xml('deep.xml', max_depth := -1);

SAX Streaming and Schema Inference

When files exceed maximum_file_size (16MB by default), the extension uses SAX-based streaming instead of building a full DOM tree. This affects schema inference in two ways:

Sample-based inference — SAX mode reads the first sample_size records (default: 10240), buffers them to build a synthetic document that the normal inference logic types, then streams the rest for extraction. The schema is not revised after the sample window, so columns or types that only appear beyond sample_size records may not be detected. Because the sampled records are held in memory, a very large sample_size raises memory use in SAX mode roughly in proportion to the prefix size — the 10240 default is a finite cap chosen to balance detection accuracy against that cost.

The DOM path (files under maximum_file_size) instead applies sample_size as a cap on the number of sampled values per field. There, sample_size := -1 samples every value for always-correct detection at the cost of scanning the whole input; in SAX mode a non-positive sample_size falls back to a finite 50-record prefix rather than reading the entire file.
Simple ``record_element`` only — SAX mode matches record elements by simple tag name (e.g., 'item'). XPath expressions like '//ns:item[@type="active"]' or path-based patterns like '/root/data/item' require DOM parsing. When the file is too large for DOM and the record_element contains XPath syntax, the extension raises an error.

-- Works in SAX mode (simple tag name)
SELECT * FROM read_xml('huge.xml', record_element := 'item');

-- Falls back to DOM (XPath expression)
SELECT * FROM read_xml('data.xml', record_element := '//item[@status="active"]');

Set streaming := false to force DOM mode for any file (will error if the file exceeds maximum_file_size).

Common Patterns

RSS Feeds

-- Default: Returns channel metadata
SELECT * FROM read_xml('feed.rss');

-- Extract individual items
SELECT * FROM read_xml('feed.rss', record_element := 'item');

AWS API Responses

-- Extract EC2 volumes
SELECT * FROM read_xml('volumes.xml', record_element := 'item');

Configuration Files

-- Preserve structure
SELECT * FROM read_xml('config.xml', unnest_as := 'struct');

-- Flatten to columns
SELECT * FROM read_xml('config.xml', unnest_as := 'columns');