-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Backport 2.x] feat: implement text chunking processor with fixed token length and delimiter algorithm #644
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…elimiter algorithm (#607) * implement chunking processor and fixed token length Signed-off-by: yuye-aws <[email protected]> * initialize node client for document chunking processor Signed-off-by: yuye-aws <[email protected]> * initialize document chunking processor with analysis registry Signed-off-by: yuye-aws <[email protected]> * chunker factory create with analysis registry Signed-off-by: yuye-aws <[email protected]> * implement tokenizer in fixed token length algorithm with analysis registry Signed-off-by: yuye-aws <[email protected]> * add max token count parsing logic Signed-off-by: yuye-aws <[email protected]> * bug fix for non-existing index Signed-off-by: yuye-aws <[email protected]> * change error log Signed-off-by: yuye-aws <[email protected]> * implement evenly chunk Signed-off-by: yuye-aws <[email protected]> * unit tests for chunker factory Signed-off-by: yuye-aws <[email protected]> * unit tests for chunker factory Signed-off-by: yuye-aws <[email protected]> * add error message for chunker factory tests Signed-off-by: yuye-aws <[email protected]> * resolve comments Signed-off-by: yuye-aws <[email protected]> * Revert "implement evenly chunk" This reverts commit 93dd2f4. Signed-off-by: yuye-aws <[email protected]> * add default value logic back Signed-off-by: yuye-aws <[email protected]> * implement unit test for fixed token length chunker Signed-off-by: yuye-aws <[email protected]> * add test cases in unit test for fixed token length chunker Signed-off-by: yuye-aws <[email protected]> * support map type as an input Signed-off-by: yuye-aws <[email protected]> * support map type as an input Signed-off-by: yuye-aws <[email protected]> * bug fix for map type Signed-off-by: yuye-aws <[email protected]> * bug fix for map type Signed-off-by: yuye-aws <[email protected]> * bug fix for map type in document chunking processor Signed-off-by: yuye-aws <[email protected]> * remove system out println Signed-off-by: yuye-aws <[email protected]> * add delimiter chunker Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add UT for delimiter chunker Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add delimiter chunker processor Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add more UTs Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add more UTs Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * basic unit tests for document chunking processor Signed-off-by: yuye-aws <[email protected]> * fix tests for getProcessors in neural search Signed-off-by: yuye-aws <[email protected]> * add unit tests with string, map and nested map type for document chunking processor Signed-off-by: yuye-aws <[email protected]> * add unit tests for parameter valdiation in document chunking processor Signed-off-by: yuye-aws <[email protected]> * add back deleted xml file Signed-off-by: yuye-aws <[email protected]> * restore xml file Signed-off-by: yuye-aws <[email protected]> * integration tests for document chunking processor Signed-off-by: yuye-aws <[email protected]> * add back Run_Neural_Search.xml Signed-off-by: yuye-aws <[email protected]> * restore Run_Neural_Search.xml Signed-off-by: yuye-aws <[email protected]> * add changelog Signed-off-by: yuye-aws <[email protected]> * update integration test for cascade processor Signed-off-by: yuye-aws <[email protected]> * add max chunk limit Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * remove useless and apply spotless Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * update error message Signed-off-by: yuye-aws <[email protected]> * change field UT Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * remove useless and apply spotless Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * change logic of max chunk number Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add max chunk limit into fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * Support list<list<string>> type in embedding and extract validation logic to common class Signed-off-by: zane-neo <[email protected]> Signed-off-by: yuye-aws <[email protected]> * fix unit tests for inference processor Signed-off-by: yuye-aws <[email protected]> * implement unit tests for unit tests with max_chunk_limit in fixed token length Signed-off-by: yuye-aws <[email protected]> * constructor for inference processor Signed-off-by: yuye-aws <[email protected]> * use inference processor Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * draft code for extending inference processor with document chunking processor Signed-off-by: yuye-aws <[email protected]> * api refactor for document chunking processor Signed-off-by: yuye-aws <[email protected]> * remove nested list key for chunking processor Signed-off-by: yuye-aws <[email protected]> * remove unused function Signed-off-by: yuye-aws <[email protected]> * remove processor validator Signed-off-by: yuye-aws <[email protected]> * remove processor validator Signed-off-by: yuye-aws <[email protected]> * Revert InferenceProcessor.java Signed-off-by: Yuye Zhu <[email protected]> Signed-off-by: yuye-aws <[email protected]> * revert changes in text embedding and sparse encoding processor Signed-off-by: yuye-aws <[email protected]> * implement chunk with map in document chunking processor Signed-off-by: yuye-aws <[email protected]> * add default delimiter value Signed-off-by: Lu <[email protected]> Signed-off-by: yuye-aws <[email protected]> * implement max chunk logic in document chunking processor Signed-off-by: yuye-aws <[email protected]> * add initial value for max chunk limit in document chunking processor Signed-off-by: yuye-aws <[email protected]> * bug fix in chunking processor: allow 0 max_chunk_limit Signed-off-by: yuye-aws <[email protected]> * implement overlap rate with big decimal Signed-off-by: yuye-aws <[email protected]> * update max chunk limit in delimiter Signed-off-by: yuye-aws <[email protected]> * update parameter setting for fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * update max chunk limit implementation in chunking processor Signed-off-by: yuye-aws <[email protected]> * fix unit tests for fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * spotless apply for document chunking processor Signed-off-by: yuye-aws <[email protected]> * initialize current chunk count Signed-off-by: yuye-aws <[email protected]> * parameter validation for max chunk limit Signed-off-by: yuye-aws <[email protected]> * fix integration tests Signed-off-by: yuye-aws <[email protected]> * fix current UT Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * change delimiter UT Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * remove delimiter useless code Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add more UT Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add UT for list inside map Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * add UT for list inside map Signed-off-by: xinyual <[email protected]> Signed-off-by: yuye-aws <[email protected]> * update unit tests for chunking processor Signed-off-by: yuye-aws <[email protected]> * add more unit tests for chunking processor Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * add java doc Signed-off-by: yuye-aws <[email protected]> * update java doc Signed-off-by: yuye-aws <[email protected]> * update java doc Signed-off-by: yuye-aws <[email protected]> * fix import order Signed-off-by: yuye-aws <[email protected]> * update java doc Signed-off-by: yuye-aws <[email protected]> * fix java doc error Signed-off-by: yuye-aws <[email protected]> * fix update ut for fixed token length chunker Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * implement chunk count wrapper for max chunk limit Signed-off-by: yuye-aws <[email protected]> * rename variable end to nextDelimiterPosition Signed-off-by: yuye-aws <[email protected]> * adjust method place Signed-off-by: yuye-aws <[email protected]> * update java doc for fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * reanme interface name and fixed token length algorithm name Signed-off-by: yuye-aws <[email protected]> * update fixed token length algorithm configuration for integration tests Signed-off-by: yuye-aws <[email protected]> * make delimiter member variables static Signed-off-by: yuye-aws <[email protected]> * remove redundant set field value in execute method Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * add integration tests with more tokenizers Signed-off-by: yuye-aws <[email protected]> * bug fix: unit test failure due to invalid tokenizer Signed-off-by: yuye-aws <[email protected]> * bug fix: token concatenation in fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * update chunker interface Signed-off-by: yuye-aws <[email protected]> * track chunkCount within function Signed-off-by: yuye-aws <[email protected]> * bug fix: allow white space as the delimiter Signed-off-by: yuye-aws <[email protected]> * fix fixed length chunker Signed-off-by: xinyual <[email protected]> * fix delimiter chunker Signed-off-by: xinyual <[email protected]> * fix chunker factory Signed-off-by: xinyual <[email protected]> * fix UTs Signed-off-by: xinyual <[email protected]> * fix UT and chunker factory Signed-off-by: xinyual <[email protected]> * move analysis_registry to non-runtime parameters Signed-off-by: xinyual <[email protected]> * fix Uts Signed-off-by: xinyual <[email protected]> * avoid java doc change Signed-off-by: xinyual <[email protected]> * move validate to commonUtlis Signed-off-by: xinyual <[email protected]> * remove useless function Signed-off-by: xinyual <[email protected]> * change java doc Signed-off-by: xinyual <[email protected]> * fix Document process ut Signed-off-by: xinyual <[email protected]> * fixed token length: re-implement with start and end offset Signed-off-by: yuye-aws <[email protected]> * update exception message Signed-off-by: yuye-aws <[email protected]> * fix document chunking processor IT Signed-off-by: yuye-aws <[email protected]> * bug fix: adjust start, end content position in fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * update changelog for 2.x release Signed-off-by: yuye-aws <[email protected]> * rename processor Signed-off-by: yuye-aws <[email protected]> * update default delimiter to be \n\n Signed-off-by: yuye-aws <[email protected]> * remove change log in 3.0 unreleased Signed-off-by: yuye-aws <[email protected]> * fix IT failure due to chunking processor rename Signed-off-by: yuye-aws <[email protected]> * update javadoc for text chunking processor factory Signed-off-by: yuye-aws <[email protected]> * adjust functions in chunker interface Signed-off-by: yuye-aws <[email protected]> * move algorithm name definition to concrete chunker class Signed-off-by: yuye-aws <[email protected]> * update string formatted message for text chunking processor Signed-off-by: yuye-aws <[email protected]> * update string formatted message for chunker factory Signed-off-by: yuye-aws <[email protected]> * update string formatted message for chunker parameter validator Signed-off-by: yuye-aws <[email protected]> * update java doc for delimiter algorithm Signed-off-by: yuye-aws <[email protected]> * support range double in chunker parameter validator Signed-off-by: yuye-aws <[email protected]> * update string formatted message for fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * update sneaky throw with text chunking processor it Signed-off-by: yuye-aws <[email protected]> * add word tokenizer restriction for fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * update error message for multiple algorithms in text chunking processor Signed-off-by: yuye-aws <[email protected]> * add comment in text chunking processor Signed-off-by: yuye-aws <[email protected]> * validate max chunk limit with util parameter class Signed-off-by: yuye-aws <[email protected]> * update comments Signed-off-by: yuye-aws <[email protected]> * update comments Signed-off-by: yuye-aws <[email protected]> * update java doc Signed-off-by: yuye-aws <[email protected]> * update java doc Signed-off-by: yuye-aws <[email protected]> * make parameter final Signed-off-by: yuye-aws <[email protected]> * implement a map from chunker name to constuctor function in chunker factory Signed-off-by: yuye-aws <[email protected]> * bug fix in chunker factory Signed-off-by: yuye-aws <[email protected]> * remove get all chunkers in chunker factory Signed-off-by: yuye-aws <[email protected]> * remove type check for parameter check for max token count Signed-off-by: yuye-aws <[email protected]> * remove type check for parameter check for analysis registry Signed-off-by: yuye-aws <[email protected]> * implement parser and validator Signed-off-by: yuye-aws <[email protected]> * update comment Signed-off-by: yuye-aws <[email protected]> * provide fixed token length as the default algorithm Signed-off-by: yuye-aws <[email protected]> * adjust exception message Signed-off-by: yuye-aws <[email protected]> * adjust exception message Signed-off-by: yuye-aws <[email protected]> * use object nonnull and require nonnull Signed-off-by: yuye-aws <[email protected]> * apply final to ingest document and chunk count Signed-off-by: yuye-aws <[email protected]> * merge parameter validator into the parser Signed-off-by: yuye-aws <[email protected]> * assign positive default value for max chunk limit Signed-off-by: yuye-aws <[email protected]> * validate supported chunker algorithm in text chunking processor Signed-off-by: yuye-aws <[email protected]> * update parameter setting of max chunk limit Signed-off-by: yuye-aws <[email protected]> * add unit test with non list of string Signed-off-by: yuye-aws <[email protected]> * add unit test with null input Signed-off-by: yuye-aws <[email protected]> * add unit test for tokenization excpetion in fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * tune method name in text chunking processor unit test Signed-off-by: yuye-aws <[email protected]> * tune method name in delimiter algorithm unit test Signed-off-by: yuye-aws <[email protected]> * add unit test for overlap rate too small in fixed token length algorithm Signed-off-by: yuye-aws <[email protected]> * tune method modifier for all classes Signed-off-by: yuye-aws <[email protected]> * tune code Signed-off-by: yuye-aws <[email protected]> * tune code Signed-off-by: yuye-aws <[email protected]> * tune exception type in parameter parser Signed-off-by: yuye-aws <[email protected]> * tune comment Signed-off-by: yuye-aws <[email protected]> * tune comment Signed-off-by: yuye-aws <[email protected]> * include max chunk limit in both algorithms Signed-off-by: yuye-aws <[email protected]> * tune comment Signed-off-by: yuye-aws <[email protected]> * allow 0 for max chunk limit Signed-off-by: yuye-aws <[email protected]> * update runtime max chunk limit in text chunking processor Signed-off-by: yuye-aws <[email protected]> * tune code for chunker Signed-off-by: yuye-aws <[email protected]> * implement test for multiple field max chunk limit exceed Signed-off-by: yuye-aws <[email protected]> * tune methods name in text chunking proceesor unit tests Signed-off-by: yuye-aws <[email protected]> * add unit tests for both algorithms with max chunk limit Signed-off-by: yuye-aws <[email protected]> * optimize code Signed-off-by: yuye-aws <[email protected]> * extract max chunk limit check to util class Signed-off-by: yuye-aws <[email protected]> * resolve code review comments Signed-off-by: yuye-aws <[email protected]> * fix unit tests Signed-off-by: yuye-aws <[email protected]> * bug fix: only update runtime max chunk limit when enabled Signed-off-by: yuye-aws <[email protected]> --------- Signed-off-by: yuye-aws <[email protected]> Signed-off-by: xinyual <[email protected]> Signed-off-by: zane-neo <[email protected]> Signed-off-by: Yuye Zhu <[email protected]> Signed-off-by: Lu <[email protected]> Co-authored-by: xinyual <[email protected]> Co-authored-by: zane-neo <[email protected]> Co-authored-by: Lu <[email protected]> (cherry picked from commit eea53aa)
opensearch-trigger-bot
bot
requested review from
heemin32,
navneet1v,
VijayanB,
vamshin,
jmazanec15,
naveentatikonda,
junqiu-lei,
martin-gaievski,
sean-zheng-amazon,
model-collapse,
zane-neo,
ylwu-amzn,
jngz-es and
vibrantvarun
as code owners
March 18, 2024 04:15
model-collapse
approved these changes
Mar 18, 2024
5 tasks
Signed-off-by: yuye-aws <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## 2.x #644 +/- ##
============================================
+ Coverage 83.28% 84.77% +1.48%
- Complexity 674 751 +77
============================================
Files 52 59 +7
Lines 2088 2325 +237
Branches 338 374 +36
============================================
+ Hits 1739 1971 +232
- Misses 196 198 +2
- Partials 153 156 +3 ☔ View full report in Codecov by Sentry. |
zane-neo
approved these changes
Mar 18, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport eea53aa from #607