-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPCC-30559 Update DataPatterns.Profile to v1.9.4 #17907
HPCC-30559 Update DataPatterns.Profile to v1.9.4 #17907
Conversation
https://track.hpccsystems.com/browse/HPCC-30559 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dcamper one improvement suggestion, and one comment about size v length making it trickier to review.
// ASCII; continue scan | ||
bytes += 1; | ||
} | ||
else if ((0xC2 <= bytes[0] && bytes[0] <= 0xDF) && (0x80 <= bytes[1] && bytes[1] <= 0xBF)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possibility of accessing out of buffer memory if bytes+1==endPtr. Same for rest of the cases.
Actually because the input is a utf8 string then this is valid, but confusing because of the size/length confusion - so worth a comment to explain why it is valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added inline checks on each if() to avoid scanning past the end of the buffer.
return false; | ||
|
||
const unsigned char* bytes = reinterpret_cast<const unsigned char*>(str); | ||
const unsigned char* endPtr = bytes + lenStr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically you should have
size32_t sizeUtf8 = rtlUtf8Length(lenStr, str);
since length is the number of code points, not the number of bytes. But that would be inefficient.
I think the clean way is a two loops. One which scans all the ascii characters. Then a check if (bytes < endPtr) of the last codepoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I realized that this function shouldn't be using UTF8 as a parameter data type at all. It should be DATA. Making that change means lenStr makes sense again.
// Determine if a UTF-8 string really contains UTF-8 characters | ||
#UNIQUENAME(IsUTF8); | ||
LOCAL BOOLEAN %IsUTF8%(UTF8 str) := EMBED(C++) | ||
if (lenStr == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also worth
#option pure
in the function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added!
// ASCII; continue scan | ||
bytes += 1; | ||
} | ||
else if ((0xC2 <= bytes[0] && bytes[0] <= 0xDF) && (bytes+1 < endPtr) && (0x80 <= bytes[1] && bytes[1] <= 0xBF)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the parameter is a data you need to protect against accessing beyond the end of the buffer when you are looking at the last bytes of the string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is the intended purpose of the newly-added (bytes+1 < endPtr) clause in this if(). With short-circuiting, that should prevent scanning beyond endPtr, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it does. I should have paid closer attention to the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dcamper please squash and I will merge
// ASCII; continue scan | ||
bytes += 1; | ||
} | ||
else if ((0xC2 <= bytes[0] && bytes[0] <= 0xDF) && (bytes+1 < endPtr) && (0x80 <= bytes[1] && bytes[1] <= 0xBF)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it does. I should have paid closer attention to the change.
Changes include: * Support UTF-8 strings in Mode values and example text patterns * Security updates * Better identify upper- and lower-case Unicode characters in text patterns * Scan Unicode and UTF-8 strings to see if they can be represented with a STRING data type instead Signed-off-by: Dan S. Camper <[email protected]>
6083814
to
bc99237
Compare
@ghalliday Commits squashed. Please merge. Thanks! |
Changes include:
Support UTF-8 strings in Mode values and example text patterns
Security updates
Better identify upper- and lower-case Unicode characters in text patterns
Scan Unicode and UTF-8 strings to see if they can be represented with a STRING data type instead
Type of change:
Checklist:
Smoketest:
Testing:
Manual testing against select datasets.