Quick Start
===========
This guide will get you started with the webbed extension in just a few minutes.
Loading the Extension
---------------------
.. code-block:: sql
LOAD webbed;
Reading XML Files
-----------------
The simplest way to work with XML is using ``read_xml``:
.. code-block:: sql
-- Read an XML file directly into a table
SELECT * FROM read_xml('data.xml');
-- Read multiple files with a glob pattern
SELECT * FROM read_xml('config/*.xml');
-- Read with schema inference options
SELECT * FROM read_xml('data.xml', record_element := 'item');
Reading HTML Files
------------------
Similarly for HTML:
.. code-block:: sql
-- Read HTML files
SELECT * FROM read_html('page.html');
-- Extract specific elements
SELECT * FROM read_html('page.html', record_element := 'article');
Extracting Data with XPath
--------------------------
Use XPath expressions to extract specific content:
.. code-block:: sql
-- Extract text from XML
SELECT xml_extract_text('DuckDB Guide', '//title');
-- Result: "DuckDB Guide"
-- Extract from HTML
SELECT html_extract_text('
Welcome
', '//h1');
-- Result: "Welcome"
-- Extract attributes
SELECT xml_extract_attributes(' ', '/item');
-- Result: [{id: "123", type: "book"}]
Working with Document Objects
-----------------------------
For more control, use the ``_objects`` variants:
.. code-block:: sql
-- Get raw document objects
SELECT xml, filename
FROM read_xml_objects('data/*.xml', filename=true);
-- Process each document
SELECT
filename,
xml_extract_text(xml, '//title') as title,
xml_stats(xml::VARCHAR) as stats
FROM read_xml_objects('books/*.xml', filename=true);
Converting Between Formats
--------------------------
Convert XML to JSON and vice versa:
.. code-block:: sql
-- XML to JSON
SELECT xml_to_json('John30');
-- Result: {"person":{"name":{"#text":"John"},"age":{"#text":"30"}}}
-- JSON to XML
SELECT json_to_xml('{"name":"John","age":30}');
-- Result: John30
Extracting Links and Images from HTML
-------------------------------------
.. code-block:: sql
-- Extract all links
SELECT (unnest(html_extract_links(html))).href as url,
(unnest(html_extract_links(html))).text as link_text
FROM read_html_objects('page.html');
-- Extract all images
SELECT (unnest(html_extract_images(html))).src as image_url,
(unnest(html_extract_images(html))).alt as alt_text
FROM read_html_objects('page.html');
Parsing XML/HTML Strings
------------------------
Parse XML or HTML content directly from strings:
.. code-block:: sql
-- Parse an XML string with schema inference
SELECT * FROM parse_xml('- Widget9.99
');
-- Parse HTML content
SELECT * FROM parse_html('', record_element := 'p');
Controlling Date/Time Parsing
-----------------------------
Use ``datetime_format`` to control how dates and timestamps are detected:
.. code-block:: sql
-- Parse European dates (DD/MM/YYYY)
SELECT * FROM read_xml('data.xml', datetime_format := 'eu');
-- Parse US dates (MM/DD/YYYY)
SELECT * FROM read_xml('data.xml', datetime_format := 'us');
-- Use a custom format string
SELECT * FROM read_xml('data.xml', datetime_format := '%Y/%m/%d');
-- Disable date detection entirely
SELECT * FROM read_xml('data.xml', datetime_format := 'none');
Handling NULL Values
--------------------
Use ``nullstr`` to specify values that should be treated as NULL:
.. code-block:: sql
-- Treat "N/A" and "-" as NULL
SELECT * FROM read_xml('data.xml', nullstr := ['N/A', '-']);
Processing Large Files
----------------------
Files exceeding ``maximum_file_size`` (16MB by default) are automatically streamed
using a SAX-based parser that processes XML in chunks — peak memory stays proportional
to a single record rather than the entire file:
.. code-block:: sql
-- Large files are streamed automatically (default behavior)
SELECT count(*) FROM read_xml('huge_file.xml');
-- Force DOM mode (errors if file is too large)
SELECT * FROM read_xml('file.xml', streaming := false);
-- Adjust the file size limit for DOM parsing
SELECT * FROM read_xml('file.xml', maximum_file_size := 268435456); -- 256MB
Extracting HTML Tables
----------------------
.. code-block:: sql
-- Extract tables as rows
SELECT table_index, row_index, columns
FROM html_extract_tables('');
Next Steps
----------
- See :doc:`functions/index` for a complete function reference
- Learn about :doc:`parameters` for customization options
- Explore :doc:`xpath_guide` for advanced XPath queries
- Understand :doc:`schema_inference` for automatic type detection