Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPCC-30559 Update DataPatterns.Profile to v1.9.4 #17907

Conversation

dcamper
Copy link
Contributor

@dcamper dcamper commented Oct 17, 2023

Changes include:

  • Support UTF-8 strings in Mode values and example text patterns

  • Security updates

  • Better identify upper- and lower-case Unicode characters in text patterns

  • Scan Unicode and UTF-8 strings to see if they can be represented with a STRING data type instead

Type of change:

  • This change is a bug fix (non-breaking change which fixes an issue).
  • This change is a new feature (non-breaking change which adds functionality).
  • This change improves the code (refactor or other change that does not change the functionality)
  • This change fixes warnings (the fix does not alter the functionality or the generated code)
  • This change is a breaking change (fix or feature that will cause existing behavior to change).
  • This change alters the query API (existing queries will have to be recompiled)

Checklist:

  • My code follows the code style of this project.
    • My code does not create any new warnings from compiler, build system, or lint.
  • The commit message is properly formatted and free of typos.
    • The commit message title makes sense in a changelog, by itself.
    • The commit is signed.
  • My change requires a change to the documentation.
    • I have updated the documentation accordingly, or...
    • I have created a JIRA ticket to update the documentation.
    • Any new interfaces or exported functions are appropriately commented.
  • I have read the CONTRIBUTORS document.
  • The change has been fully tested:
    • I have added tests to cover my changes.
    • All new and existing tests passed.
    • I have checked that this change does not introduce memory leaks.
    • I have used Valgrind or similar tools to check for potential issues.
  • I have given due consideration to all of the following potential concerns:
    • Scalability
    • Performance
    • Security
    • Thread-safety
    • Cloud-compatibility
    • Premature optimization
    • Existing deployed queries will not be broken
    • This change fixes the problem, not just the symptom
    • The target branch of this pull request is appropriate for such a change.
  • There are no similar instances of the same problem that should be addressed
    • I have addressed them here
    • I have raised JIRA issues to address them separately
  • This is a user interface / front-end modification
    • I have tested my changes in multiple modern browsers
    • The component(s) render as expected

Smoketest:

  • Send notifications about my Pull Request position in Smoketest queue.
  • Test my draft Pull Request.

Testing:

Manual testing against select datasets.

@dcamper dcamper requested a review from ghalliday October 17, 2023 14:31
@github-actions
Copy link

Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dcamper one improvement suggestion, and one comment about size v length making it trickier to review.

// ASCII; continue scan
bytes += 1;
}
else if ((0xC2 <= bytes[0] && bytes[0] <= 0xDF) && (0x80 <= bytes[1] && bytes[1] <= 0xBF))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibility of accessing out of buffer memory if bytes+1==endPtr. Same for rest of the cases.

Actually because the input is a utf8 string then this is valid, but confusing because of the size/length confusion - so worth a comment to explain why it is valid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added inline checks on each if() to avoid scanning past the end of the buffer.

return false;

const unsigned char* bytes = reinterpret_cast<const unsigned char*>(str);
const unsigned char* endPtr = bytes + lenStr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically you should have
size32_t sizeUtf8 = rtlUtf8Length(lenStr, str);
since length is the number of code points, not the number of bytes. But that would be inefficient.

I think the clean way is a two loops. One which scans all the ascii characters. Then a check if (bytes < endPtr) of the last codepoint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I realized that this function shouldn't be using UTF8 as a parameter data type at all. It should be DATA. Making that change means lenStr makes sense again.

// Determine if a UTF-8 string really contains UTF-8 characters
#UNIQUENAME(IsUTF8);
LOCAL BOOLEAN %IsUTF8%(UTF8 str) := EMBED(C++)
if (lenStr == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also worth
#option pure
in the function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added!

@dcamper dcamper requested a review from ghalliday October 19, 2023 18:55
@dcamper dcamper changed the title HPCC-30559 Update DataPatterns.Profile to v1.9.3 HPCC-30559 Update DataPatterns.Profile to v1.9.4 Oct 20, 2023
// ASCII; continue scan
bytes += 1;
}
else if ((0xC2 <= bytes[0] && bytes[0] <= 0xDF) && (bytes+1 < endPtr) && (0x80 <= bytes[1] && bytes[1] <= 0xBF))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the parameter is a data you need to protect against accessing beyond the end of the buffer when you are looking at the last bytes of the string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the intended purpose of the newly-added (bytes+1 < endPtr) clause in this if(). With short-circuiting, that should prevent scanning beyond endPtr, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it does. I should have paid closer attention to the change.

@dcamper dcamper requested a review from ghalliday October 24, 2023 12:44
Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dcamper please squash and I will merge

// ASCII; continue scan
bytes += 1;
}
else if ((0xC2 <= bytes[0] && bytes[0] <= 0xDF) && (bytes+1 < endPtr) && (0x80 <= bytes[1] && bytes[1] <= 0xBF))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it does. I should have paid closer attention to the change.

Changes include:

* Support UTF-8 strings in Mode values and example text patterns

* Security updates

* Better identify upper- and lower-case Unicode characters in text patterns

* Scan Unicode and UTF-8 strings to see if they can be represented with a STRING data type instead

Signed-off-by: Dan S. Camper <[email protected]>
@dcamper dcamper force-pushed the hpcc-30559-datapatterns-profile-1.9.3 branch from 6083814 to bc99237 Compare October 25, 2023 11:41
@dcamper
Copy link
Contributor Author

dcamper commented Oct 25, 2023

@ghalliday Commits squashed. Please merge. Thanks!

@ghalliday ghalliday merged commit 3522729 into hpcc-systems:candidate-9.4.x Oct 26, 2023
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants