Releases: benbrandt/text-splitter
Releases · benbrandt/text-splitter
v0.22.0
v0.21.0
Breaking Changes
- Special tokens are now also encoded by both Huggingface and Tiktoken tokenizers. This is closer to the default behavior on the Python side, and should make sure if a model adds tokens at the beginning or end of a sequence, these are accounted for as well. This is especially important for embedding models that can add a special token to the beginning of the sequence, and the chunks generated didn't actually fit within the context window because of this.
What's New
Rust
- MSRV is now 1.80 to remove dependency on once_cell.
Full Changelog: v0.20.1...v0.21.0
v0.20.1
Fixes
- Python: correctly specify version for compatibility with
uv
installations.
Full Changelog: v0.20.0...v0.20.1
v0.20.0
Breaking Changes
- Switched backing Unicode segmentation implementation from
unicode-segmentation
toicu_segmenter
. This brings some modest performance gains, along with being able to leverage the official Unicode crate. There may be slight differences in chunk behavior in some edge cases, so treating this as a breaking change.
Full Changelog: v0.19.1...v0.20.0
v0.19.1
What's New
- Python splitters have new
chunk_all
andchunk_all_indices
method so the multiple texts can be processed in parallel. (For Rust, you should be able to userayon
to do this already)
Full Changelog: v0.19.0...v0.19.1
v0.19.0
v0.18.1
v0.18.0
v0.17.1
v0.17.0
Breaking Changes
- Support
[email protected]
for CodeSplitters. - Due to a slight change in the backing unicode segmentation implementation, there are some slight shifts in behavior for CodeSplitters as well (in my tests, mostly that semicolons have a more logical grouping with previous content).
Full Changelog: v0.16.1...v0.17.0