Quick Start =========== This guide will get you started with the webbed extension in just a few minutes. Loading the Extension --------------------- .. code-block:: sql LOAD webbed; Reading XML Files ----------------- The simplest way to work with XML is using ``read_xml``: .. code-block:: sql -- Read an XML file directly into a table SELECT * FROM read_xml('data.xml'); -- Read multiple files with a glob pattern SELECT * FROM read_xml('config/*.xml'); -- Read with schema inference options SELECT * FROM read_xml('data.xml', record_element := 'item'); Reading HTML Files ------------------ Similarly for HTML: .. code-block:: sql -- Read HTML files SELECT * FROM read_html('page.html'); -- Extract specific elements SELECT * FROM read_html('page.html', record_element := 'article'); Extracting Data with XPath -------------------------- Use XPath expressions to extract specific content: .. code-block:: sql -- Extract text from XML SELECT xml_extract_text('DuckDB Guide', '//title'); -- Result: "DuckDB Guide" -- Extract from HTML SELECT html_extract_text('

Welcome

', '//h1'); -- Result: "Welcome" -- Extract attributes SELECT xml_extract_attributes('', '/item'); -- Result: [{id: "123", type: "book"}] Working with Document Objects ----------------------------- For more control, use the ``_objects`` variants: .. code-block:: sql -- Get raw document objects SELECT xml, filename FROM read_xml_objects('data/*.xml', filename=true); -- Process each document SELECT filename, xml_extract_text(xml, '//title') as title, xml_stats(xml::VARCHAR) as stats FROM read_xml_objects('books/*.xml', filename=true); Converting Between Formats -------------------------- Convert XML to JSON and vice versa: .. code-block:: sql -- XML to JSON SELECT xml_to_json('John30'); -- Result: {"person":{"name":{"#text":"John"},"age":{"#text":"30"}}} -- JSON to XML SELECT json_to_xml('{"name":"John","age":30}'); -- Result: John30 Extracting Links and Images from HTML ------------------------------------- .. code-block:: sql -- Extract all links SELECT (unnest(html_extract_links(html))).href as url, (unnest(html_extract_links(html))).text as link_text FROM read_html_objects('page.html'); -- Extract all images SELECT (unnest(html_extract_images(html))).src as image_url, (unnest(html_extract_images(html))).alt as alt_text FROM read_html_objects('page.html'); Parsing XML/HTML Strings ------------------------ Parse XML or HTML content directly from strings: .. code-block:: sql -- Parse an XML string with schema inference SELECT * FROM parse_xml('Widget9.99'); -- Parse HTML content SELECT * FROM parse_html('

Hello

World

', record_element := 'p'); Controlling Date/Time Parsing ----------------------------- Use ``datetime_format`` to control how dates and timestamps are detected: .. code-block:: sql -- Parse European dates (DD/MM/YYYY) SELECT * FROM read_xml('data.xml', datetime_format := 'eu'); -- Parse US dates (MM/DD/YYYY) SELECT * FROM read_xml('data.xml', datetime_format := 'us'); -- Use a custom format string SELECT * FROM read_xml('data.xml', datetime_format := '%Y/%m/%d'); -- Disable date detection entirely SELECT * FROM read_xml('data.xml', datetime_format := 'none'); Handling NULL Values -------------------- Use ``nullstr`` to specify values that should be treated as NULL: .. code-block:: sql -- Treat "N/A" and "-" as NULL SELECT * FROM read_xml('data.xml', nullstr := ['N/A', '-']); Processing Large Files ---------------------- Files exceeding ``maximum_file_size`` (16MB by default) are automatically streamed using a SAX-based parser that processes XML in chunks — peak memory stays proportional to a single record rather than the entire file: .. code-block:: sql -- Large files are streamed automatically (default behavior) SELECT count(*) FROM read_xml('huge_file.xml'); -- Force DOM mode (errors if file is too large) SELECT * FROM read_xml('file.xml', streaming := false); -- Adjust the file size limit for DOM parsing SELECT * FROM read_xml('file.xml', maximum_file_size := 268435456); -- 256MB Extracting HTML Tables ---------------------- .. code-block:: sql -- Extract tables as rows SELECT table_index, row_index, columns FROM html_extract_tables('
Name
John
'); Next Steps ---------- - See :doc:`functions/index` for a complete function reference - Learn about :doc:`parameters` for customization options - Explore :doc:`xpath_guide` for advanced XPath queries - Understand :doc:`schema_inference` for automatic type detection