Conversion Functions

These functions convert between XML, JSON, and other formats.

xml_to_json

Convert XML to JSON with configurable options.

Syntax:

xml_to_json(xml [, options...])

Parameters:

Parameter

Type

Description

xml

VARCHAR/XML

The XML content to convert

force_list

VARCHAR[]

Element names to always convert to JSON arrays

attr_prefix

VARCHAR

Prefix for attributes (default: '@')

text_key

VARCHAR

Key for text content (default: '#text')

namespaces

VARCHAR

Namespace handling: 'strip', 'expand', 'keep'

xmlns_key

VARCHAR

Key for namespace declarations (default: empty/disabled)

empty_elements

VARCHAR

Empty element handling: 'object', 'null', 'string'

Returns: VARCHAR (JSON string)

Examples:

-- Basic conversion
SELECT xml_to_json('<person><name>John</name><age>30</age></person>');
-- Result: {"person":{"name":{"#text":"John"},"age":{"#text":"30"}}}

-- Force specific elements to be arrays
SELECT xml_to_json(
    '<catalog><book><title>Book 1</title></book></catalog>',
    force_list := ['book']
);

-- Custom attribute prefix and text key
SELECT xml_to_json(
    '<item id="123">Product Name</item>',
    attr_prefix := '_',
    text_key := 'value'
);
-- Result: {"item":{"_id":"123","value":"Product Name"}}

-- Handle namespaces
SELECT xml_to_json(
    '<root xmlns:ns="http://example.com"><ns:item>Test</ns:item></root>',
    namespaces := 'keep'
);

-- Handle empty elements as null
SELECT xml_to_json('<root><item/></root>', empty_elements := 'null');
-- Result: {"root":{"item":null}}

json_to_xml

Convert JSON to XML.

Syntax:

json_to_xml(json)

Parameters:

  • json (VARCHAR): JSON string to convert

Returns: VARCHAR (XML string)

Example:

SELECT json_to_xml('{"name":"John","age":30}');
-- Result: <root><name>John</name><age>30</age></root>

to_xml

Convert any value to XML.

Syntax:

to_xml(value)
to_xml(value, node_name)

Parameters:

  • value: Any value to convert

  • node_name (VARCHAR, optional): Custom node name for the root element

Returns: VARCHAR (XML string)

Examples:

-- Convert string
SELECT to_xml('Hello World');
-- Result: <value>Hello World</value>

-- With custom node name
SELECT to_xml('John Doe', 'author');
-- Result: <author>John Doe</author>

-- Convert number
SELECT to_xml(42, 'count');
-- Result: <count>42</count>

xml (alias)

Alias for to_xml.

Syntax:

xml(value)

Example:

SELECT xml('Hello');
-- Result: <value>Hello</value>

Document Block Functions

These functions convert HTML documents to and from the duck_block structured format, enabling document analysis, transformation, and format conversion pipelines.

Note

The duck_block type is compatible with the duck_block_utils extension, which provides additional functions for working with document blocks including:

  • duck_blocks_to_markdown() - Convert blocks to Markdown

  • markdown_to_duck_blocks() - Parse Markdown into blocks

  • Document block filtering and transformation utilities

When both extensions are loaded, you can build powerful document conversion pipelines (e.g., HTML to Markdown, Markdown to HTML).

html_to_duck_blocks

Convert HTML content into a list of structured document blocks. This function parses HTML and extracts block-level elements (headings, paragraphs, code blocks, lists, tables, etc.) into a structured format suitable for document processing and analysis.

Syntax:

html_to_duck_blocks(html)

Parameters:

  • html (HTML/VARCHAR): The HTML content to parse

Returns: LIST(duck_block) - A list of document blocks

The duck_block Type:

Each block is a struct with the following fields:

Field

Type

Description

block_type

VARCHAR

Type of block: 'heading', 'paragraph', 'code', 'list', 'blockquote', 'table', 'hr', 'image', 'figure'

content

VARCHAR

Text content of the block (or JSON for complex blocks)

level

INTEGER

Heading level (1-6) or blockquote nesting depth

encoding

VARCHAR

Content encoding: 'text' for plain text, 'json' for structured content

attributes

MAP(VARCHAR, VARCHAR)

Additional attributes (id, class, language, src, alt, etc.)

block_order

INTEGER

Zero-based position of the block in the document

Block Type Details:

Block Type

Description

heading

H1-H6 elements. level indicates heading level (1-6). attributes may contain id.

paragraph

P elements. Content is plain text.

code

PRE/CODE elements. attributes['language'] contains the programming language if specified.

list

UL/OL elements. encoding is 'json'. attributes['ordered'] is 'true' or 'false'.

blockquote

BLOCKQUOTE elements. level indicates nesting depth.

table

TABLE elements. encoding is 'json' with rows/cells structure.

hr

HR elements. Content is empty.

image

IMG elements. attributes contains src and alt.

figure

FIGURE elements with optional caption.

Examples:

-- Extract blocks from HTML
SELECT html_to_duck_blocks('<h1>Title</h1><p>Some text</p>');
-- Returns list with 2 blocks: heading and paragraph

-- Get all headings from a document
SELECT block.content, block.level
FROM (SELECT unnest(html_to_duck_blocks(html)) as block FROM documents)
WHERE block.block_type = 'heading';

-- Count blocks by type
SELECT block.block_type, COUNT(*)
FROM (SELECT unnest(html_to_duck_blocks(html)) as block FROM documents)
GROUP BY block.block_type;

-- Extract code blocks with their language
SELECT block.content, block.attributes['language'] as language
FROM (SELECT unnest(html_to_duck_blocks(
    '<pre><code class="language-python">print("hello")</code></pre>'
)) as block)
WHERE block.block_type = 'code';

duck_blocks_to_html

Convert a list of document blocks back to HTML. This is the inverse of html_to_duck_blocks.

Syntax:

duck_blocks_to_html(blocks)

Parameters:

  • blocks (LIST(duck_block)): A list of document blocks

Returns: HTML - The reconstructed HTML content

Examples:

-- Round-trip conversion
SELECT duck_blocks_to_html(html_to_duck_blocks('<h1>Title</h1><p>Text</p>'));
-- Result: <h1>Title</h1><p>Text</p>

-- Filter and reconstruct (keep only headings and paragraphs)
SELECT duck_blocks_to_html(
    list_filter(
        html_to_duck_blocks(html),
        block -> block.block_type IN ('heading', 'paragraph')
    )
) FROM documents;

-- Reorder blocks
SELECT duck_blocks_to_html(
    list_sort(
        html_to_duck_blocks(html),
        block -> block.block_order DESC
    )
) FROM documents;

Using with duck_block_utils for Markdown Conversion

When combined with the duck_block_utils extension, you can convert between HTML and Markdown formats.

Setup:

-- Load both extensions
INSTALL webbed FROM community;
LOAD webbed;

INSTALL duck_block_utils FROM community;
LOAD duck_block_utils;

HTML to Markdown:

-- Convert HTML to Markdown
SELECT duck_blocks_to_markdown(html_to_duck_blocks(
    '<h1>My Document</h1><p>This is a paragraph.</p><ul><li>Item 1</li><li>Item 2</li></ul>'
));
-- Result:
-- # My Document
--
-- This is a paragraph.
--
-- - Item 1
-- - Item 2

-- Convert a batch of HTML documents to Markdown
SELECT
    filename,
    duck_blocks_to_markdown(html_to_duck_blocks(content)) as markdown
FROM read_html_objects('docs/*.html');

Markdown to HTML:

-- Convert Markdown to HTML
SELECT duck_blocks_to_html(markdown_to_duck_blocks(
    '# Hello World

This is a paragraph with **bold** text.

- First item
- Second item'
));
-- Result: <h1>Hello World</h1><p>This is a paragraph with <strong>bold</strong> text.</p><ul><li>First item</li><li>Second item</li></ul>

Document Processing Pipeline:

-- Extract and convert only headings and paragraphs from HTML to Markdown
SELECT duck_blocks_to_markdown(
    list_filter(
        html_to_duck_blocks(html_content),
        b -> b.block_type IN ('heading', 'paragraph')
    )
) as simplified_markdown
FROM web_pages;

-- Build a table of contents from HTML documents
SELECT
    url,
    block.content as heading,
    block.level
FROM web_pages,
     LATERAL unnest(html_to_duck_blocks(html_content)) as block
WHERE block.block_type = 'heading'
ORDER BY url, block.block_order;

-- Convert code blocks from one language syntax highlighting to another format
SELECT duck_blocks_to_html(
    list_transform(
        html_to_duck_blocks(html),
        b -> CASE
            WHEN b.block_type = 'code'
            THEN {'block_type': 'code', 'content': b.content, 'level': b.level,
                  'encoding': b.encoding, 'block_order': b.block_order,
                  'attributes': map_from_entries([('language', 'python')])}
            ELSE b
        END
    )
) FROM documents;

Python xmltodict Compatibility

For Python-style xmltodict behavior, create a macro:

CREATE MACRO xmltodict(xml,
                       attr_prefix := '@',
                       text_key := '#',
                       process_namespaces := false,
                       empty_elements := 'object',
                       force_list := []) AS
  xml_to_json(xml,
    attr_prefix := attr_prefix,
    text_key := text_key,
    empty_elements := empty_elements,
    force_list := force_list,
    namespaces := IF(process_namespaces, 'expand', 'strip')
  );

-- Usage matches Python's xmltodict.parse()
SELECT xmltodict('<root><item>Test</item></root>');