mluttikh/xml2arrow - CodSpeed

xml2arrow

Blog Docs Changelog

Performance History

Latest Results

perf: skip attribute decode/unescape for entity-free values Attribute values were unconditionally run through decode_and_unescape_value, costing a per-attribute decode + unescape even for plain UTF-8 values with no entity references. Mirror the Event::Text path: append the raw attribute bytes directly and only fall back to full decode/unescape when an entity ('&') is present. Utf8 fields are validated once at row finalization anyway, and numeric fields parse straight from bytes. Also mark parse_attributes #[inline(never)] so the added check does not bloat the shared handle_event dispatch loop (kept the hot Start/Text/End path compact and avoided a code-layout regression). Measured -3.2% to -4.2% on the attribute-heavy parse_small/parse_medium benches (both buffered and zero-copy, p<0.01).

perf/skip-attribute-decode

9 days ago

perf: add opt-in strip_namespaces to skip per-name prefix scan Element and attribute names were always resolved via quick-xml's local_name(), which runs a memchr(':') namespace-prefix scan on every name. A function-level profile showed this name resolution is ~5.8% of total parse time, the bulk of it that scan. Add ParserOptions::strip_namespaces (default true, preserving current behavior). When false, names resolve via the raw qualified name(), skipping the scan. For prefix-free documents name() and local_name() are byte-identical, so disabling is free and existing configs are unaffected; for prefixed input, configured paths must spell out the prefix. Wired as a runtime bool on the converter (like validate_attributes), gating the element Start/Empty and attribute-key sites. Measured -4.1% to -6.7% across parse_small/parse_medium/parse_wide_fanout (both buffered and zero-copy, p<0.01).

perf/strip-namespaces

9 days ago

docs: update version in README

main

10 days ago

chore: Release v0.17 (#84)

main

10 days ago

chore: Release v0.17

release_v0_17

10 days ago

perf: add reusable Parser to amortize path-trie compilation (#83) * perf: add opt-in validate_closing_tags to skip end-tag checks Add ParserOptions::validate_closing_tags (defaults to true, preserving prior behavior). Setting it false disables quick-xml's per-end-tag name validation, since PathTracker already enforces nesting via depth tracking. Measured ~2-6% throughput gain across parse_small, parse_medium, and parse_wide_fanout (all p<0.01). The trade-off is that opening/closing-tag mismatches are no longer rejected, so the fast path is opt-in for trusted inputs only. Also factor the shared reader configuration into configure_reader() so both entry points apply ParserOptions identically. * perf: add opt-in validate_attributes to skip duplicate-attribute check Add ParserOptions::validate_attributes (defaults to true, preserving current behavior). When false, the attribute iterator runs with quick-xml's duplicate-key detection disabled via with_checks(false). Beyond skipping an O(n^2) scan of an element's attributes, this removes a heap allocation quick-xml otherwise makes per attribute-bearing element (it records each seen key's byte range in a Vec). On attribute-heavy documents that allocation dominates the check's cost. Measured on the attribute-heavy benches (clean A/B, only the config bool flipped): parse_small / buffered -6.5% (p=0.00) parse_small / zero_copy -6.2% (p=0.00) parse_medium / buffered -4.9% (p=0.00) parse_medium / zero_copy -7.1% (p=0.00) The trade-off mirrors validate_closing_tags: a duplicated attribute is no longer rejected. Because field values accumulate by appending, a duplicate's values are concatenated rather than reported as an error, so the fast path is opt-in for trusted inputs only. Covered by test_validate_attributes_false_still_parses_attributes and test_validate_attributes_false_tolerates_duplicate_attribute. * perf: add reusable Parser to amortize path-trie compilation parse_xml/parse_xml_slice rebuild the PathRegistry trie and re-validate the Config on every call — a fixed cost (~8.5us here) paid before any XML is read. On large documents it is amortized to nothing, but on small ones it dominates: a measured ~33% of total parse time on a 2KB document. Introduce a `Parser` type that compiles the config + path registry once and exposes parse()/parse_slice() so callers processing many documents with one schema pay that cost a single time. The existing free functions become thin wrappers over a throwaway Parser, so the public API is purely additive and behavior is unchanged. Internally the per-parse converter now borrows the registry from the Parser rather than owning it, and builds only the fresh Arrow builders each parse. Adds a parse_tiny benchmark (with reused-Parser variants) to guard the setup cost, and an integration test proving no state leaks between documents parsed through one Parser. * docs: document reusable Parser for many-document workloads * style: rustfmt run_parse signature

main

10 days ago

style: rustfmt run_parse signature

perf/reuse-parser

10 days ago

docs: document reusable Parser for many-document workloads

perf/reuse-parser

10 days ago

Latest Branches

+4%

perf: skip attribute decode/unescape for entity-free values#85

9 days ago

587540a

perf/skip-attribute-decode

+4%

perf: add opt-in strip_namespaces to skip per-name prefix scan#86

9 days ago

56a611d

perf/strip-namespaces

0%

chore: Release v0.17#84

10 days ago

f9d475f

release_v0_17

© 2026 CodSpeed Technology

Home Terms Privacy Docs