perf(neon): combine 4 escape masks in SIMD domain before GPR extraction
The chunk loop (64 bytes/iteration) previously extracted each vector's
escape mask to a GPR individually, creating 4 serial shrn→fmov→cbz
chains on the critical path (~32 cycles on Apple Silicon).
Now compute all 4 mask vectors independently (can pipeline), combine
them with 3 vorrq instructions in the vector domain, and extract a
single combined bitmask. Individual bitmasks are only extracted lazily
in the slow path (when escapes exist).
Fast path savings per 64-byte chunk:
- 3 fewer shrn instructions (6 cycles)
- 3 fewer fmov d→x transfers (15 cycles)
- 3 fewer cbz branches (3 cycles)
- Stores optimized from 3×str to 2×stp by compiler
Benchmark (5.5MB source fixtures): ~3.3% faster (10.47ms → 10.12ms)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>