Changelog ========= v2.0.0 (Current) ----------------- **New Features** - SAX-based streaming parser for very large XML files — files exceeding ``maximum_file_size`` are automatically parsed using SAX mode, reducing peak memory from ~4x file size (DOM) to proportional to a single record (Issue #68) - New ``streaming`` parameter (default: ``true``). When enabled, oversized XML files are streamed via libxml2's SAX push parser in 64KB chunks instead of building a full DOM tree. Set ``streaming:=false`` to restore the previous behavior of erroring on oversized files. - SAX mode supports simple tag-name ``record_element`` values (e.g., ``'item'``). XPath expressions automatically fall back to DOM parsing. - Not available for HTML files (libxml2 HTML parser is DOM-only). **Changes** - Reduced default ``maximum_file_size`` from 128MB to 16MB. With SAX streaming enabled by default, this threshold now controls when to switch from DOM to SAX rather than when to reject files. Files above 16MB are streamed automatically. Set ``maximum_file_size`` higher to use DOM for larger files, or set ``streaming:=false`` to error on oversized files (previous behavior). **Limitations** - SAX mode currently handles flat records (scalars, attributes, repeated elements). Nested STRUCT extraction from SAX events is not yet implemented — deeply nested records fall back to raw XML string values. **Testing** - 68 test suites, 2511 assertions - Comprehensive DOM/SAX equivalence tests covering type inference, datetime_format, record_element, cross-record attribute discovery, large row counts (3000 rows across chunk boundaries), UTF-8 content, and nullstr interaction - Stress tested with 382MB file (1M records): zero data loss, 5x faster than DOM, 184x less memory (25MB vs 4.6GB peak) v1.5.0 ----------------- **New Features** - Added ``datetime_format`` parameter to ``read_xml``, ``read_html``, ``parse_xml``, and ``parse_html`` for controlling date/time detection and parsing — supports preset names (``auto``, ``none``, ``us``, ``eu``, ``iso``, etc.), custom strftime format strings, and lists of formats. Replaces regex-based temporal detection with DuckDB's ``StrpTimeFormat`` candidate elimination approach (Issue #38) - Added ``nullstr`` parameter for custom NULL value representation (Issue #40) - Lazy DOM extraction for reduced peak memory — records are now extracted one at a time directly from the DOM instead of caching all rows at once (Issue #17, Phase 1) - Type inference for elements with attributes — ``#text`` field now infers proper types (DOUBLE, INTEGER, DATE, BOOLEAN) instead of defaulting to VARCHAR (Issues #49, #46) **Improvements** - Increased default ``maximum_file_size`` from 16MB to 128MB (Issue #66) **Bug Fixes** - Fixed ``read_xml`` returning NULL for non-Latin text content — Cyrillic, CJK, and other multi-byte UTF-8 characters were being stripped by whitespace trimming (Issue #64) v1.4.0 ------ **New Features** - Added ``parse_xml(content)`` table function to parse XML strings with schema inference - Added ``parse_xml_objects(content)`` table function to parse XML strings and return raw XML type - Added ``parse_html(content)`` table function to parse HTML strings with schema inference - Added ``parse_html_objects(content)`` table function to parse HTML strings and return raw HTML type **Bug Fixes** - Fixed CDATA sections being converted to empty objects in ``xml_to_json`` (Issue #63) v1.3.3 ------ **Bug Fixes** - Fixed table blocks rendering to HTML (Issue #62) **Testing** - Added comprehensive HTML ↔ Duck Block conversion tests v1.3.2 ------ **New Features** - Added ``filename`` parameter to ``read_xml`` and ``read_html`` functions **Documentation** - Fixed high priority documentation issues - Added documentation badge linking to readthedocs v1.3.1 ------ **Bug Fixes** - Fixed ``duck_blocks_to_html()`` outputting literal "NULL" for parent elements with NULL content (parent blocks with inline children) v1.3.0 ------ **New Features** - Added ``html_to_duck_blocks`` function to convert HTML into structured document blocks - Added ``duck_blocks_to_html`` function to convert document blocks back to HTML - Added namespace parameter to XPath scalar functions (``xml_extract_text``, ``xml_extract_elements``, etc.) - Added ``xml_lookup_namespace(prefix)`` to look up common namespace URIs - Added ``xml_find_undefined_prefixes(xml, xpath)`` to detect undeclared namespace prefixes - Added implicit casting from XML/HTML types to VARCHAR, enabling string functions on XML/HTML values **Bug Fixes** - Fixed UTF-8 encoding in ``html_extract_text`` - characters like "chère" are now correctly preserved (Issue #53) - Fixed documentation mismatches between README and actual function behavior (Issue #54) - Added regression tests for ``xml_extract_attributes`` segfault report (Issue #55) **Documentation** - Added comprehensive XPath namespace handling documentation with ``local-name()`` examples - Updated test statistics: 58 test suites, 1901 assertions - Added documentation for ``html_escape`` and ``html_unescape`` functions - Created Read the Docs documentation structure **New Test Coverage** - Added test suite for namespace handling patterns (Issue #4) - Added test suite for batch file processing (Issue #17) - Added tests for UTF-8 encoding with various character sets v1.2.0 ------ **New Features** - Added ``union_by_name`` parameter for combining files with different schemas - Added ``all_varchar`` parameter for forcing VARCHAR types - Added ``force_list`` parameter for ensuring LIST types **Bug Fixes** - Fixed cross-record attribute discovery for nested elements (Issue #50) - Fixed LIST extraction and record element serialization - Fixed schema consistency for multi-file reads **Improvements** - Enhanced thread safety with per-operation configuration (Issue #7) - Improved error handling for malformed documents v1.1.0 ------ **New Features** - Added ``read_html`` and ``read_html_objects`` functions - Added HTML table extraction functions - Added ``html_extract_links`` and ``html_extract_images`` - Added ``xml_to_json`` with comprehensive options **Improvements** - Improved schema inference for complex nested structures - Better handling of repeated elements - Enhanced type detection for dates and timestamps v1.0.0 ------ **Initial Release** - Core XML parsing with libxml2 - ``read_xml`` and ``read_xml_objects`` functions - XPath extraction functions - XML validation and formatting utilities - Basic schema inference