Skip to content

Releases: benbrandt/text-splitter

v0.22.0

17 Jan 10:15
217fb50
Compare
Choose a tag to compare

Breaking Changes

  • Revert change to special token behavior in v0.21. This had many unintended side effects, and does not seem to be recommended for chunking.

Full Changelog: v0.21.0...v0.22.0

v0.21.0

16 Jan 07:55
9da8748
Compare
Choose a tag to compare

Breaking Changes

  • Special tokens are now also encoded by both Huggingface and Tiktoken tokenizers. This is closer to the default behavior on the Python side, and should make sure if a model adds tokens at the beginning or end of a sequence, these are accounted for as well. This is especially important for embedding models that can add a special token to the beginning of the sequence, and the chunks generated didn't actually fit within the context window because of this.

What's New

Rust

  • MSRV is now 1.80 to remove dependency on once_cell.

Full Changelog: v0.20.1...v0.21.0

v0.20.1

01 Jan 20:22
Compare
Choose a tag to compare

Fixes

  • Python: correctly specify version for compatibility with uv installations.

Full Changelog: v0.20.0...v0.20.1

v0.20.0

14 Dec 20:50
Compare
Choose a tag to compare

Breaking Changes

  • Switched backing Unicode segmentation implementation from unicode-segmentation to icu_segmenter. This brings some modest performance gains, along with being able to leverage the official Unicode crate. There may be slight differences in chunk behavior in some edge cases, so treating this as a breaking change.

Full Changelog: v0.19.1...v0.20.0

v0.19.1

14 Dec 07:07
Compare
Choose a tag to compare

What's New

  • Python splitters have new chunk_all and chunk_all_indices method so the multiple texts can be processed in parallel. (For Rust, you should be able to use rayon to do this already)

Full Changelog: v0.19.0...v0.19.1

v0.19.0

28 Nov 10:49
9248906
Compare
Choose a tag to compare

Breaking Changes

  • Update to tokenizers v0.21

Full Changelog: v0.18.1...v0.19.0

v0.18.1

25 Oct 19:31
977b0c6
Compare
Choose a tag to compare

What's New

  • Ensure tokenizer sizers with truncation parameters count their overflow encodings by @Jeadie in #433

New Contributors

Full Changelog: v0.18.0...v0.18.1

v0.18.0

14 Oct 12:57
27fefce
Compare
Choose a tag to compare

Breaking

Change supported tiktoken-rs version to 0.6.x

Full Changelog: v0.17.1...v0.18.0

v0.17.1

11 Oct 05:07
4eb54cf
Compare
Choose a tag to compare

What's New

  • Loosen regex crate version requirement

Full Changelog: v0.17.0...v0.17.1

v0.17.0

06 Oct 13:33
474f5a6
Compare
Choose a tag to compare

Breaking Changes

  • Support [email protected] for CodeSplitters.
  • Due to a slight change in the backing unicode segmentation implementation, there are some slight shifts in behavior for CodeSplitters as well (in my tests, mostly that semicolons have a more logical grouping with previous content).

Full Changelog: v0.16.1...v0.17.0