Conversion Functions
These functions convert between XML, JSON, and other formats.
xml_to_json
Convert XML to JSON with configurable options.
Syntax:
xml_to_json(xml [, options...])
Parameters:
Parameter |
Type |
Description |
|---|---|---|
|
VARCHAR/XML |
The XML content to convert |
|
VARCHAR[] |
Element names to always convert to JSON arrays |
|
VARCHAR |
Prefix for attributes (default: |
|
VARCHAR |
Key for text content (default: |
|
VARCHAR |
Namespace handling: |
|
VARCHAR |
Key for namespace declarations (default: empty/disabled) |
|
VARCHAR |
Empty element handling: |
Returns: VARCHAR (JSON string)
Examples:
-- Basic conversion
SELECT xml_to_json('<person><name>John</name><age>30</age></person>');
-- Result: {"person":{"name":{"#text":"John"},"age":{"#text":"30"}}}
-- Force specific elements to be arrays
SELECT xml_to_json(
'<catalog><book><title>Book 1</title></book></catalog>',
force_list := ['book']
);
-- Custom attribute prefix and text key
SELECT xml_to_json(
'<item id="123">Product Name</item>',
attr_prefix := '_',
text_key := 'value'
);
-- Result: {"item":{"_id":"123","value":"Product Name"}}
-- Handle namespaces
SELECT xml_to_json(
'<root xmlns:ns="http://example.com"><ns:item>Test</ns:item></root>',
namespaces := 'keep'
);
-- Handle empty elements as null
SELECT xml_to_json('<root><item/></root>', empty_elements := 'null');
-- Result: {"root":{"item":null}}
json_to_xml
Convert JSON to XML.
Syntax:
json_to_xml(json)
Parameters:
json(VARCHAR): JSON string to convert
Returns: VARCHAR (XML string)
Example:
SELECT json_to_xml('{"name":"John","age":30}');
-- Result: <root><name>John</name><age>30</age></root>
to_xml
Convert any value to XML.
Syntax:
to_xml(value)
to_xml(value, node_name)
Parameters:
value: Any value to convertnode_name(VARCHAR, optional): Custom node name for the root element
Returns: VARCHAR (XML string)
Examples:
-- Convert string
SELECT to_xml('Hello World');
-- Result: <value>Hello World</value>
-- With custom node name
SELECT to_xml('John Doe', 'author');
-- Result: <author>John Doe</author>
-- Convert number
SELECT to_xml(42, 'count');
-- Result: <count>42</count>
xml (alias)
Alias for to_xml.
Syntax:
xml(value)
Example:
SELECT xml('Hello');
-- Result: <value>Hello</value>
Document Block Functions
These functions convert HTML documents to and from the duck_block structured format, enabling document analysis, transformation, and format conversion pipelines.
Note
The duck_block type is compatible with the duck_block_utils extension, which provides additional functions for working with document blocks including:
duck_blocks_to_markdown()- Convert blocks to Markdownmarkdown_to_duck_blocks()- Parse Markdown into blocksDocument block filtering and transformation utilities
When both extensions are loaded, you can build powerful document conversion pipelines (e.g., HTML to Markdown, Markdown to HTML).
html_to_duck_blocks
Convert HTML content into a list of structured document blocks. This function parses HTML and extracts block-level elements (headings, paragraphs, code blocks, lists, tables, etc.) into a structured format suitable for document processing and analysis.
Syntax:
html_to_duck_blocks(html)
Parameters:
html(HTML/VARCHAR): The HTML content to parse
Returns: LIST(duck_block) - A list of document blocks
The duck_block Type:
Each block is a struct with the following fields:
Field |
Type |
Description |
|---|---|---|
|
VARCHAR |
Type of block: |
|
VARCHAR |
Text content of the block (or JSON for complex blocks) |
|
INTEGER |
Heading level (1-6) or blockquote nesting depth |
|
VARCHAR |
Content encoding: |
|
MAP(VARCHAR, VARCHAR) |
Additional attributes (id, class, language, src, alt, etc.) |
|
INTEGER |
Zero-based position of the block in the document |
Block Type Details:
Block Type |
Description |
|---|---|
|
H1-H6 elements. |
|
P elements. Content is plain text. |
|
PRE/CODE elements. |
|
UL/OL elements. |
|
BLOCKQUOTE elements. |
|
TABLE elements. |
|
HR elements. Content is empty. |
|
IMG elements. |
|
FIGURE elements with optional caption. |
Examples:
-- Extract blocks from HTML
SELECT html_to_duck_blocks('<h1>Title</h1><p>Some text</p>');
-- Returns list with 2 blocks: heading and paragraph
-- Get all headings from a document
SELECT block.content, block.level
FROM (SELECT unnest(html_to_duck_blocks(html)) as block FROM documents)
WHERE block.block_type = 'heading';
-- Count blocks by type
SELECT block.block_type, COUNT(*)
FROM (SELECT unnest(html_to_duck_blocks(html)) as block FROM documents)
GROUP BY block.block_type;
-- Extract code blocks with their language
SELECT block.content, block.attributes['language'] as language
FROM (SELECT unnest(html_to_duck_blocks(
'<pre><code class="language-python">print("hello")</code></pre>'
)) as block)
WHERE block.block_type = 'code';
duck_blocks_to_html
Convert a list of document blocks back to HTML. This is the inverse of html_to_duck_blocks.
Syntax:
duck_blocks_to_html(blocks)
Parameters:
blocks(LIST(duck_block)): A list of document blocks
Returns: HTML - The reconstructed HTML content
Examples:
-- Round-trip conversion
SELECT duck_blocks_to_html(html_to_duck_blocks('<h1>Title</h1><p>Text</p>'));
-- Result: <h1>Title</h1><p>Text</p>
-- Filter and reconstruct (keep only headings and paragraphs)
SELECT duck_blocks_to_html(
list_filter(
html_to_duck_blocks(html),
block -> block.block_type IN ('heading', 'paragraph')
)
) FROM documents;
-- Reorder blocks
SELECT duck_blocks_to_html(
list_sort(
html_to_duck_blocks(html),
block -> block.block_order DESC
)
) FROM documents;
Using with duck_block_utils for Markdown Conversion
When combined with the duck_block_utils extension, you can convert between HTML and Markdown formats.
Setup:
-- Load both extensions
INSTALL webbed FROM community;
LOAD webbed;
INSTALL duck_block_utils FROM community;
LOAD duck_block_utils;
HTML to Markdown:
-- Convert HTML to Markdown
SELECT duck_blocks_to_markdown(html_to_duck_blocks(
'<h1>My Document</h1><p>This is a paragraph.</p><ul><li>Item 1</li><li>Item 2</li></ul>'
));
-- Result:
-- # My Document
--
-- This is a paragraph.
--
-- - Item 1
-- - Item 2
-- Convert a batch of HTML documents to Markdown
SELECT
filename,
duck_blocks_to_markdown(html_to_duck_blocks(content)) as markdown
FROM read_html_objects('docs/*.html');
Markdown to HTML:
-- Convert Markdown to HTML
SELECT duck_blocks_to_html(markdown_to_duck_blocks(
'# Hello World
This is a paragraph with **bold** text.
- First item
- Second item'
));
-- Result: <h1>Hello World</h1><p>This is a paragraph with <strong>bold</strong> text.</p><ul><li>First item</li><li>Second item</li></ul>
Document Processing Pipeline:
-- Extract and convert only headings and paragraphs from HTML to Markdown
SELECT duck_blocks_to_markdown(
list_filter(
html_to_duck_blocks(html_content),
b -> b.block_type IN ('heading', 'paragraph')
)
) as simplified_markdown
FROM web_pages;
-- Build a table of contents from HTML documents
SELECT
url,
block.content as heading,
block.level
FROM web_pages,
LATERAL unnest(html_to_duck_blocks(html_content)) as block
WHERE block.block_type = 'heading'
ORDER BY url, block.block_order;
-- Convert code blocks from one language syntax highlighting to another format
SELECT duck_blocks_to_html(
list_transform(
html_to_duck_blocks(html),
b -> CASE
WHEN b.block_type = 'code'
THEN {'block_type': 'code', 'content': b.content, 'level': b.level,
'encoding': b.encoding, 'block_order': b.block_order,
'attributes': map_from_entries([('language', 'python')])}
ELSE b
END
)
) FROM documents;
Python xmltodict Compatibility
For Python-style xmltodict behavior, create a macro:
CREATE MACRO xmltodict(xml,
attr_prefix := '@',
text_key := '#',
process_namespaces := false,
empty_elements := 'object',
force_list := []) AS
xml_to_json(xml,
attr_prefix := attr_prefix,
text_key := text_key,
empty_elements := empty_elements,
force_list := force_list,
namespaces := IF(process_namespaces, 'expand', 'strip')
);
-- Usage matches Python's xmltodict.parse()
SELECT xmltodict('<root><item>Test</item></root>');