messense/jieba-rs - CodSpeed

jieba-rs

Blog Docs Changelog

Performance History

Latest Results

Add tag field to Keyword struct Closes #86. The extract_tags API now includes POS tag information on each Keyword, matching the Python jieba API.

main

14 days ago

Add tag field to Keyword struct Closes #86. The extract_tags API now includes POS tag information on each Keyword, matching the Python jieba API.

main

14 days ago

feat: add posseg (part-of-speech tagging) HMM for OOV words (#146) Implement a compound-state HMM Viterbi (4 positions × 64 POS tags = 256 states) for POS-tagging unknown/OOV Chinese words, ported from Python jieba's posseg module. - Add `posseg.rs` with dense 256×256 transition matrix for O(1) lookup - Embed probability data (start, trans, emit, char_state_tab) via `include_flate`, gated on `default-dict` feature - Improve `tag()` to use posseg HMM for OOV CJK words instead of falling back to `"x"` (e.g. `"张尧"` now correctly tagged as `"nr"`/person name) - Add conversion script (`scripts/convert_posseg.py`) for regenerating `posseg.txt` from Python jieba's pickle files - Add benchmark for `tag_with_oov` (~2.5µs per call)

main

14 days ago

feat: add posseg (part-of-speech tagging) HMM for OOV words Implement a compound-state HMM Viterbi (4 positions × 64 POS tags = 256 states) for POS-tagging unknown/OOV Chinese words, ported from Python jieba's posseg module. - Add `posseg.rs` with dense 256×256 transition matrix for O(1) lookup - Embed probability data (start, trans, emit, char_state_tab) via `include_flate`, gated on `default-dict` feature - Improve `tag()` to use posseg HMM for OOV CJK words instead of falling back to `"x"` (e.g. `"张尧"` now correctly tagged as `"nr"`/person name) - Add conversion script (`scripts/convert_posseg.py`) for regenerating `posseg.txt` from Python jieba's pickle files - Add benchmark for `tag_with_oov` (~2.5µs per call)

feat/posseg

14 days ago

feat: add posseg (part-of-speech tagging) HMM for OOV words Implement a compound-state HMM Viterbi (4 positions × 64 POS tags = 256 states) for POS-tagging unknown/OOV Chinese words, ported from Python jieba's posseg module. - Add `posseg.rs` with dense 256×256 transition matrix for O(1) lookup - Embed probability data (start, trans, emit, char_state_tab) via `include_flate`, gated on `default-dict` feature - Improve `tag()` to use posseg HMM for OOV CJK words instead of falling back to `"x"` (e.g. `"张尧"` now correctly tagged as `"nr"`/person name) - Add conversion script (`scripts/convert_posseg.py`) for regenerating `posseg.txt` from Python jieba's pickle files - Add benchmark for `tag_with_oov` (~2.5µs per call)

feat/posseg

14 days ago

Optimize HMM emit prob lookups and precompute log frequencies (#145) - Change HMM emit probability maps from phf::Map<&str, f64> to phf::Map<char, f64> since all keys are single characters. This avoids string hashing/comparison overhead in the Viterbi inner loop. - Precompute log(freq) in Record::log_freq to eliminate per-lookup f64::ln() calls in calc(). - Simplify viterbi() to iterate chars directly instead of using byte offset peekable iterator. - Use direct array indexing for TRANS_PROBS instead of .get().unwrap(). Benchmarks show 9-13% improvement on segmentation workloads.

main

14 days ago

Optimize HMM emit prob lookups and precompute log frequencies - Change HMM emit probability maps from phf::Map<&str, f64> to phf::Map<char, f64> since all keys are single characters. This avoids string hashing/comparison overhead in the Viterbi inner loop. - Precompute log(freq) in Record::log_freq to eliminate per-lookup f64::ln() calls in calc(). - Simplify viterbi() to iterate chars directly instead of using byte offset peekable iterator. - Use direct array indexing for TRANS_PROBS instead of .get().unwrap(). Benchmarks show 9-13% improvement on segmentation workloads.

optimize-hmm-and-calc

14 days ago

Replace regex splitting with hand-rolled `SplitByCharacterClass` (#144) Replace `RE_HAN_DEFAULT`, `RE_HAN_CUT_ALL`, `RE_SKIP_CUT_ALL`, and `RE_SKIP_DEFAULT` regexes in `lib.rs` with inline character classifiers (`is_han_default`, `is_han_cut_all`, `is_skip_cut_all`) and a generic `SplitByCharacterClass` iterator. Also replace HMM `RE_HAN` with `is_hmm_han` classifier, keeping only `RE_SKIP` as regex due to its complex pattern `([a-zA-Z0-9]+(?:.\d+)?%?)`. Profiling showed regex `find_fwd`/`find_rev` accounted for ~29% of CPU time; this drops to <1% with the character class approach. Benchmark improvements (vs previous commit on `add-byte-positions`): - `no_hmm`: 1.22 µs → 1.02 µs (-16%) - `with_hmm`: 1.82 µs → 1.50 µs (-18%) - `cut_for_search`: 2.31 µs → 2.00 µs (-14%)

main

14 days ago

Latest Branches

0%

feat: add posseg (part-of-speech tagging) HMM for OOV words#146

14 days ago

18a9fff

feat/posseg

+16%

Optimize HMM emit prob lookups and precompute log frequencies#145

14 days ago

306b0c2

optimize-hmm-and-calc

+38%

Replace regex splitting with hand-rolled `SplitByCharacterClass`#144

14 days ago

9babb6a

replace-regex-with-char-class

© 2026 CodSpeed Technology

Home Terms Privacy Docs