Latest Results
feat: adds try_encode and try_decode with utf-8 special-case (#4060)
## Summary
**This gives ~3.2x speedup for decoding binary arrays into string
arrays**
This PR adds try_encode and try_decode with utf-8 special-case. You'll
see cases for binary-to-binary transforms like gzip compress and
decompress, as well as binary-to-text and text-to-binary transformations
for things like converting bytes to utf-8 and visa-versa. We can
continue to build from this [with additional
encodings](https://docs.python.org/3/library/codecs.html#standard-encodings)
and I've carved out a special no-copy path for utf-8.
## Performance Results
Three runs with 10 iterations (+1 warmup) on 1 million rows shows ~3.2x
speedup.
```
❯ pytest ./tests/functions/test_codecs.py -k test_try_decode_utf8_perf -s
Native try_decode stats (seconds): {'mean': 0.1969996452331543, 'median': 0.19691014289855957, 'min': 0.1933138370513916, 'max': 0.20042800903320312, 'stdev': 0.0018028098671721037}
UDF try_decode stats (seconds): {'mean': 0.6376919507980346, 'median': 0.6374071836471558, 'min': 0.6186070442199707, 'max': 0.6605658531188965, 'stdev': 0.011603869017790357}
**Average speedup: 3.24x**
❯ pytest ./tests/functions/test_codecs.py -k test_try_decode_utf8_perf -s
Native try_decode stats (seconds): {'mean': 0.19709632396697999, 'median': 0.19748806953430176, 'min': 0.19363689422607422, 'max': 0.1991891860961914, 'stdev': 0.00167838499446807}
UDF try_decode stats (seconds): {'mean': 0.6387589693069458, 'median': 0.639365553855896, 'min': 0.6251809597015381, 'max': 0.651353120803833, 'stdev': 0.0075957305958397415}
**Average speedup: 3.24x**
❯ pytest ./tests/functions/test_codecs.py -k test_try_decode_utf8_perf -s
Native try_decode stats (seconds): {'mean': 0.19655859470367432, 'median': 0.19698894023895264, 'min': 0.19165897369384766, 'max': 0.19891595840454102, 'stdev': 0.0019603584148133366}
UDF try_decode stats (seconds): {'mean': 0.6334790706634521, 'median': 0.6332188844680786, 'min': 0.6258370876312256, 'max': 0.6455898284912109, 'stdev': 0.0063130945873989455}
**Average speedup: 3.22x**
```
## Related Issues
#3989
#4062
## Changes Made
* Adds codec kind to differentiate between text and binary encodings
* Adds try_encode and try_decode to python expression API (and all
layers beneath)
* Adds a special-case udf for decoding utf-8 since we only need to
validate the bytes
## Checklist
- [x] All tests have passed
- [x] Documented in API Docs
- [x] Documented in User Guide
- [x] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [x] Documentation builds and is formatted properly (tag @/ccmao1130
for docs review) Active Branches
#40820%
#40930%
#40970%
© 2025 CodSpeed Technology