HPCC-30559 Update DataPatterns.Profile to v1.9.4 #17907

dcamper · 2023-10-17T14:31:00Z

Changes include:

Support UTF-8 strings in Mode values and example text patterns
Security updates
Better identify upper- and lower-case Unicode characters in text patterns
Scan Unicode and UTF-8 strings to see if they can be represented with a STRING data type instead

Type of change:

This change is a bug fix (non-breaking change which fixes an issue).
This change is a new feature (non-breaking change which adds functionality).
This change improves the code (refactor or other change that does not change the functionality)
This change fixes warnings (the fix does not alter the functionality or the generated code)
This change is a breaking change (fix or feature that will cause existing behavior to change).
This change alters the query API (existing queries will have to be recompiled)

Checklist:

Smoketest:

Send notifications about my Pull Request position in Smoketest queue.
Test my draft Pull Request.

Testing:

Manual testing against select datasets.

github-actions · 2023-10-17T14:31:23Z

https://track.hpccsystems.com/browse/HPCC-30559
Jira updated

ghalliday

@dcamper one improvement suggestion, and one comment about size v length making it trickier to review.

ghalliday · 2023-10-19T16:56:05Z

ecllibrary/std/DataPatterns/Profile.ecl

+                    // ASCII; continue scan
+                    bytes += 1;
+                }
+                else if ((0xC2 <= bytes[0] && bytes[0] <= 0xDF) && (0x80 <= bytes[1] && bytes[1] <= 0xBF))


possibility of accessing out of buffer memory if bytes+1==endPtr. Same for rest of the cases.

Actually because the input is a utf8 string then this is valid, but confusing because of the size/length confusion - so worth a comment to explain why it is valid.

Added inline checks on each if() to avoid scanning past the end of the buffer.

ghalliday · 2023-10-19T17:01:08Z

ecllibrary/std/DataPatterns/Profile.ecl

+                return false;
+
+            const unsigned char*    bytes = reinterpret_cast<const unsigned char*>(str);
+            const unsigned char*    endPtr = bytes + lenStr;


Technically you should have
size32_t sizeUtf8 = rtlUtf8Length(lenStr, str);
since length is the number of code points, not the number of bytes. But that would be inefficient.

I think the clean way is a two loops. One which scans all the ascii characters. Then a check if (bytes < endPtr) of the last codepoint.

Actually, I realized that this function shouldn't be using UTF8 as a parameter data type at all. It should be DATA. Making that change means lenStr makes sense again.

ghalliday · 2023-10-19T17:12:05Z

ecllibrary/std/DataPatterns/Profile.ecl

+        // Determine if a UTF-8 string really contains UTF-8 characters
+        #UNIQUENAME(IsUTF8);
+        LOCAL BOOLEAN %IsUTF8%(UTF8 str) := EMBED(C++)
+            if (lenStr == 0)


Also worth
#option pure
in the function

ghalliday · 2023-10-24T11:16:58Z

ecllibrary/std/DataPatterns/Profile.ecl

+                    // ASCII; continue scan
+                    bytes += 1;
+                }
+                else if ((0xC2 <= bytes[0] && bytes[0] <= 0xDF) && (bytes+1 < endPtr) && (0x80 <= bytes[1] && bytes[1] <= 0xBF))


If the parameter is a data you need to protect against accessing beyond the end of the buffer when you are looking at the last bytes of the string.

That is the intended purpose of the newly-added (bytes+1 < endPtr) clause in this if(). With short-circuiting, that should prevent scanning beyond endPtr, no?

Yes it does. I should have paid closer attention to the change.

ghalliday

@dcamper please squash and I will merge

ghalliday · 2023-10-25T10:55:49Z

ecllibrary/std/DataPatterns/Profile.ecl

+                    // ASCII; continue scan
+                    bytes += 1;
+                }
+                else if ((0xC2 <= bytes[0] && bytes[0] <= 0xDF) && (bytes+1 < endPtr) && (0x80 <= bytes[1] && bytes[1] <= 0xBF))


Yes it does. I should have paid closer attention to the change.

Changes include: * Support UTF-8 strings in Mode values and example text patterns * Security updates * Better identify upper- and lower-case Unicode characters in text patterns * Scan Unicode and UTF-8 strings to see if they can be represented with a STRING data type instead Signed-off-by: Dan S. Camper <[email protected]>

dcamper · 2023-10-25T11:48:52Z

@ghalliday Commits squashed. Please merge. Thanks!

dcamper requested a review from ghalliday October 17, 2023 14:31

ghalliday reviewed Oct 19, 2023

View reviewed changes

dcamper requested a review from ghalliday October 19, 2023 18:55

dcamper changed the title ~~HPCC-30559 Update DataPatterns.Profile to v1.9.3~~ HPCC-30559 Update DataPatterns.Profile to v1.9.4 Oct 20, 2023

ghalliday requested changes Oct 24, 2023

View reviewed changes

dcamper requested a review from ghalliday October 24, 2023 12:44

ghalliday approved these changes Oct 25, 2023

View reviewed changes

dcamper force-pushed the hpcc-30559-datapatterns-profile-1.9.3 branch from 6083814 to bc99237 Compare October 25, 2023 11:41

ghalliday merged commit 3522729 into hpcc-systems:candidate-9.4.x Oct 26, 2023
30 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HPCC-30559 Update DataPatterns.Profile to v1.9.4 #17907

HPCC-30559 Update DataPatterns.Profile to v1.9.4 #17907

dcamper commented Oct 17, 2023

github-actions bot commented Oct 17, 2023

ghalliday left a comment

ghalliday Oct 19, 2023

dcamper Oct 19, 2023

ghalliday Oct 19, 2023

dcamper Oct 19, 2023

ghalliday Oct 19, 2023

dcamper Oct 19, 2023

ghalliday Oct 24, 2023

dcamper Oct 24, 2023

ghalliday Oct 25, 2023

ghalliday left a comment

ghalliday Oct 25, 2023

dcamper commented Oct 25, 2023

HPCC-30559 Update DataPatterns.Profile to v1.9.4 #17907

HPCC-30559 Update DataPatterns.Profile to v1.9.4 #17907

Conversation

dcamper commented Oct 17, 2023

Type of change:

Checklist:

Smoketest:

Testing:

github-actions bot commented Oct 17, 2023

ghalliday left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghalliday left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcamper commented Oct 25, 2023