Changelog
=========

v2.0.0 (Current)
-----------------

**New Features**

- SAX-based streaming parser for very large XML files — files exceeding
  ``maximum_file_size`` are automatically parsed using SAX mode, reducing peak
  memory from ~4x file size (DOM) to proportional to a single record (Issue #68)

  - New ``streaming`` parameter (default: ``true``). When enabled, oversized XML
    files are streamed via libxml2's SAX push parser in 64KB chunks instead of
    building a full DOM tree. Set ``streaming:=false`` to restore the previous
    behavior of erroring on oversized files.
  - SAX mode supports simple tag-name ``record_element`` values (e.g., ``'item'``).
    XPath expressions automatically fall back to DOM parsing.
  - Not available for HTML files (libxml2 HTML parser is DOM-only).

**Changes**

- Reduced default ``maximum_file_size`` from 128MB to 16MB. With SAX streaming
  enabled by default, this threshold now controls when to switch from DOM to SAX
  rather than when to reject files. Files above 16MB are streamed automatically.
  Set ``maximum_file_size`` higher to use DOM for larger files, or set
  ``streaming:=false`` to error on oversized files (previous behavior).

**Limitations**

- SAX mode currently handles flat records (scalars, attributes, repeated elements).
  Nested STRUCT extraction from SAX events is not yet implemented — deeply nested
  records fall back to raw XML string values.

**Testing**

- 68 test suites, 2511 assertions
- Comprehensive DOM/SAX equivalence tests covering type inference, datetime_format,
  record_element, cross-record attribute discovery, large row counts (3000 rows
  across chunk boundaries), UTF-8 content, and nullstr interaction
- Stress tested with 382MB file (1M records): zero data loss, 5x faster than DOM,
  184x less memory (25MB vs 4.6GB peak)

v1.5.0
-----------------

**New Features**

- Added ``datetime_format`` parameter to ``read_xml``, ``read_html``, ``parse_xml``, and
  ``parse_html`` for controlling date/time detection and parsing — supports preset names
  (``auto``, ``none``, ``us``, ``eu``, ``iso``, etc.), custom strftime format strings, and
  lists of formats. Replaces regex-based temporal detection with DuckDB's ``StrpTimeFormat``
  candidate elimination approach (Issue #38)
- Added ``nullstr`` parameter for custom NULL value representation (Issue #40)
- Lazy DOM extraction for reduced peak memory — records are now extracted one at a time
  directly from the DOM instead of caching all rows at once (Issue #17, Phase 1)
- Type inference for elements with attributes — ``#text`` field now infers proper types
  (DOUBLE, INTEGER, DATE, BOOLEAN) instead of defaulting to VARCHAR (Issues #49, #46)

**Improvements**

- Increased default ``maximum_file_size`` from 16MB to 128MB (Issue #66)

**Bug Fixes**

- Fixed ``read_xml`` returning NULL for non-Latin text content — Cyrillic, CJK, and other
  multi-byte UTF-8 characters were being stripped by whitespace trimming (Issue #64)

v1.4.0
------

**New Features**

- Added ``parse_xml(content)`` table function to parse XML strings with schema inference
- Added ``parse_xml_objects(content)`` table function to parse XML strings and return raw XML type
- Added ``parse_html(content)`` table function to parse HTML strings with schema inference
- Added ``parse_html_objects(content)`` table function to parse HTML strings and return raw HTML type

**Bug Fixes**

- Fixed CDATA sections being converted to empty objects in ``xml_to_json`` (Issue #63)

v1.3.3
------

**Bug Fixes**

- Fixed table blocks rendering to HTML (Issue #62)

**Testing**

- Added comprehensive HTML ↔ Duck Block conversion tests

v1.3.2
------

**New Features**

- Added ``filename`` parameter to ``read_xml`` and ``read_html`` functions

**Documentation**

- Fixed high priority documentation issues
- Added documentation badge linking to readthedocs

v1.3.1
------

**Bug Fixes**

- Fixed ``duck_blocks_to_html()`` outputting literal "NULL" for parent elements with NULL content (parent blocks with inline children)

v1.3.0
------

**New Features**

- Added ``html_to_duck_blocks`` function to convert HTML into structured document blocks
- Added ``duck_blocks_to_html`` function to convert document blocks back to HTML
- Added namespace parameter to XPath scalar functions (``xml_extract_text``, ``xml_extract_elements``, etc.)
- Added ``xml_lookup_namespace(prefix)`` to look up common namespace URIs
- Added ``xml_find_undefined_prefixes(xml, xpath)`` to detect undeclared namespace prefixes
- Added implicit casting from XML/HTML types to VARCHAR, enabling string functions on XML/HTML values

**Bug Fixes**

- Fixed UTF-8 encoding in ``html_extract_text`` - characters like "chère" are now correctly preserved (Issue #53)
- Fixed documentation mismatches between README and actual function behavior (Issue #54)
- Added regression tests for ``xml_extract_attributes`` segfault report (Issue #55)

**Documentation**

- Added comprehensive XPath namespace handling documentation with ``local-name()`` examples
- Updated test statistics: 58 test suites, 1901 assertions
- Added documentation for ``html_escape`` and ``html_unescape`` functions
- Created Read the Docs documentation structure

**New Test Coverage**

- Added test suite for namespace handling patterns (Issue #4)
- Added test suite for batch file processing (Issue #17)
- Added tests for UTF-8 encoding with various character sets

v1.2.0
------

**New Features**

- Added ``union_by_name`` parameter for combining files with different schemas
- Added ``all_varchar`` parameter for forcing VARCHAR types
- Added ``force_list`` parameter for ensuring LIST types

**Bug Fixes**

- Fixed cross-record attribute discovery for nested elements (Issue #50)
- Fixed LIST extraction and record element serialization
- Fixed schema consistency for multi-file reads

**Improvements**

- Enhanced thread safety with per-operation configuration (Issue #7)
- Improved error handling for malformed documents

v1.1.0
------

**New Features**

- Added ``read_html`` and ``read_html_objects`` functions
- Added HTML table extraction functions
- Added ``html_extract_links`` and ``html_extract_images``
- Added ``xml_to_json`` with comprehensive options

**Improvements**

- Improved schema inference for complex nested structures
- Better handling of repeated elements
- Enhanced type detection for dates and timestamps

v1.0.0
------

**Initial Release**

- Core XML parsing with libxml2
- ``read_xml`` and ``read_xml_objects`` functions
- XPath extraction functions
- XML validation and formatting utilities
- Basic schema inference