messense
jieba-rs
Blog
Docs
Changelog
Blog
Docs
Changelog
Overview
Branches
Benchmarks
Runs
Performance History
Latest Results
Add tag field to Keyword struct Closes #86. The extract_tags API now includes POS tag information on each Keyword, matching the Python jieba API.
main
14 days ago
Add tag field to Keyword struct Closes #86. The extract_tags API now includes POS tag information on each Keyword, matching the Python jieba API.
main
14 days ago
feat: add posseg (part-of-speech tagging) HMM for OOV words (#146) Implement a compound-state HMM Viterbi (4 positions × 64 POS tags = 256 states) for POS-tagging unknown/OOV Chinese words, ported from Python jieba's posseg module. - Add `posseg.rs` with dense 256×256 transition matrix for O(1) lookup - Embed probability data (start, trans, emit, char_state_tab) via `include_flate`, gated on `default-dict` feature - Improve `tag()` to use posseg HMM for OOV CJK words instead of falling back to `"x"` (e.g. `"张尧"` now correctly tagged as `"nr"`/person name) - Add conversion script (`scripts/convert_posseg.py`) for regenerating `posseg.txt` from Python jieba's pickle files - Add benchmark for `tag_with_oov` (~2.5µs per call)
main
14 days ago
feat: add posseg (part-of-speech tagging) HMM for OOV words Implement a compound-state HMM Viterbi (4 positions × 64 POS tags = 256 states) for POS-tagging unknown/OOV Chinese words, ported from Python jieba's posseg module. - Add `posseg.rs` with dense 256×256 transition matrix for O(1) lookup - Embed probability data (start, trans, emit, char_state_tab) via `include_flate`, gated on `default-dict` feature - Improve `tag()` to use posseg HMM for OOV CJK words instead of falling back to `"x"` (e.g. `"张尧"` now correctly tagged as `"nr"`/person name) - Add conversion script (`scripts/convert_posseg.py`) for regenerating `posseg.txt` from Python jieba's pickle files - Add benchmark for `tag_with_oov` (~2.5µs per call)
feat/posseg
14 days ago
feat: add posseg (part-of-speech tagging) HMM for OOV words Implement a compound-state HMM Viterbi (4 positions × 64 POS tags = 256 states) for POS-tagging unknown/OOV Chinese words, ported from Python jieba's posseg module. - Add `posseg.rs` with dense 256×256 transition matrix for O(1) lookup - Embed probability data (start, trans, emit, char_state_tab) via `include_flate`, gated on `default-dict` feature - Improve `tag()` to use posseg HMM for OOV CJK words instead of falling back to `"x"` (e.g. `"张尧"` now correctly tagged as `"nr"`/person name) - Add conversion script (`scripts/convert_posseg.py`) for regenerating `posseg.txt` from Python jieba's pickle files - Add benchmark for `tag_with_oov` (~2.5µs per call)
feat/posseg
14 days ago
Optimize HMM emit prob lookups and precompute log frequencies (#145) - Change HMM emit probability maps from phf::Map<&str, f64> to phf::Map<char, f64> since all keys are single characters. This avoids string hashing/comparison overhead in the Viterbi inner loop. - Precompute log(freq) in Record::log_freq to eliminate per-lookup f64::ln() calls in calc(). - Simplify viterbi() to iterate chars directly instead of using byte offset peekable iterator. - Use direct array indexing for TRANS_PROBS instead of .get().unwrap(). Benchmarks show 9-13% improvement on segmentation workloads.
main
14 days ago
Optimize HMM emit prob lookups and precompute log frequencies - Change HMM emit probability maps from phf::Map<&str, f64> to phf::Map<char, f64> since all keys are single characters. This avoids string hashing/comparison overhead in the Viterbi inner loop. - Precompute log(freq) in Record::log_freq to eliminate per-lookup f64::ln() calls in calc(). - Simplify viterbi() to iterate chars directly instead of using byte offset peekable iterator. - Use direct array indexing for TRANS_PROBS instead of .get().unwrap(). Benchmarks show 9-13% improvement on segmentation workloads.
optimize-hmm-and-calc
14 days ago
Replace regex splitting with hand-rolled `SplitByCharacterClass` (#144) Replace `RE_HAN_DEFAULT`, `RE_HAN_CUT_ALL`, `RE_SKIP_CUT_ALL`, and `RE_SKIP_DEFAULT` regexes in `lib.rs` with inline character classifiers (`is_han_default`, `is_han_cut_all`, `is_skip_cut_all`) and a generic `SplitByCharacterClass` iterator. Also replace HMM `RE_HAN` with `is_hmm_han` classifier, keeping only `RE_SKIP` as regex due to its complex pattern `([a-zA-Z0-9]+(?:.\d+)?%?)`. Profiling showed regex `find_fwd`/`find_rev` accounted for ~29% of CPU time; this drops to <1% with the character class approach. Benchmark improvements (vs previous commit on `add-byte-positions`): - `no_hmm`: 1.22 µs → 1.02 µs (-16%) - `with_hmm`: 1.82 µs → 1.50 µs (-18%) - `cut_for_search`: 2.31 µs → 2.00 µs (-14%)
main
14 days ago
Latest Branches
CodSpeed Performance Gauge
0%
feat: add posseg (part-of-speech tagging) HMM for OOV words
#146
14 days ago
18a9fff
feat/posseg
CodSpeed Performance Gauge
+16%
Optimize HMM emit prob lookups and precompute log frequencies
#145
14 days ago
306b0c2
optimize-hmm-and-calc
CodSpeed Performance Gauge
+38%
Replace regex splitting with hand-rolled `SplitByCharacterClass`
#144
14 days ago
9babb6a
replace-regex-with-char-class
© 2026 CodSpeed Technology
Home
Terms
Privacy
Docs