-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Output well-formed UTF-8 bytes in SimpleTextCodec's segmentinfos #12897
Conversation
If needed, I'm happy to add versions of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks good, I just left a minor suggestion about the test.
} | ||
byte[] bytes = new byte[(int) length]; | ||
input.readBytes(bytes, 0, bytes.length); | ||
StandardCharsets.UTF_8.newDecoder().decode(ByteBuffer.wrap(bytes)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use Term#toString(BytesRef)
which includes such checks already and configures a few more flags on the decoder that look useful?
The SimpleTextSegmentInfoFormat was writing the random byte array used as a segment's ID directly -- not converting to a simple text representation of the byte array. As a result, the segment infos were often malformed.
f103563
to
59c3b8c
Compare
I implemented a similar change for binary doc values at #12987 |
…che#12897) The SimpleTextSegmentInfoFormat was writing the random byte array used as a segment's ID directly -- not converting to a simple text representation of the byte array. As a result, the segment infos were often malformed.
Description
The SimpleTextSegmentInfoFormat was writing the random byte array used as a segment's ID directly -- not converting to a simple text representation of the byte array. As a result, the segment infos were often malformed (as UTF-8 text).
The included test was failing before the change to write out the text representation of the byte array.