-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPCC-31921 Add caching of regex compiled search patterns #18748
HPCC-31921 Add caching of regex compiled search patterns #18748
Conversation
Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-31921 Jirabot Action Result: |
e6c944b
to
da6cb86
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dcamper looks good. One functional question, and a couple of performance comments.
system/jlib/jhash.hpp
Outdated
|
||
recentList.erase(foundIter->second.second); | ||
recentList.push_front(key); | ||
lookupMap[key] = {foundIter->second.first, recentList.begin()}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than replacing the entry you can just update the pointer in the LRU list.
foundIter->second.second = recentList.begin();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch; fixed.
system/jlib/jhash.hpp
Outdated
{ | ||
recentList.erase(foundIter->second.second); | ||
recentList.push_front(key); | ||
lookupMap[key] = {value, recentList.begin()}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
foundIter->second = ...
is probably more efficient, but it shouldn't be hit in the current usage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Agreed that in the current regex scenario this block won't be hit. I really struggled with creating a general purpose feature at this level of code versus just what I needed, and reigning in my desire to add another version of this LRU cache that would be more efficient for non-pointer objects.
rtl/eclrtl/eclregex.cpp
Outdated
// Check the cache | ||
{ | ||
CriticalBlock lock(compiledStrRegExprLock); | ||
compiledObjPtr = dynamic_cast<CCompiledStrRegExpr*>(compiledStrRegExprCache.get(regexHash)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the danger/likelihood of false positive matches caused by clashing hash codes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do the math, but the likelihood is "very very low" considering cache size is only 500. It is never impossible to prevent collisions on a hash, but our key space is huge compared to the number of cached items.
Earlier versions of this code used separate caches for the three data types (string, UTF-8 and unicode). That would reduce the possibility of collision but not enough, IMO. Separate caches would also eliminate the need for the added parent class and the virtualization of destructors.
rtl/eclrtl/eclregex.cpp
Outdated
} | ||
|
||
CCompiledStrRegExpr(CCompiledStrRegExpr & other) = default; // Note non-const argument |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the need for the non-const argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The std::shared_ptr in other apparently needs to be updated (the compiler complained about the original const version I had). If you know of a pattern where I can make other const, please tell me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, never mind. I think I had a different underlying issue at the time. All copy constructors now accept const arguments.
ecl/hql/hqlfold.cpp
Outdated
@@ -2796,7 +2796,7 @@ IHqlExpression * foldConstantOperator(IHqlExpression * expr, unsigned foldOption | |||
StringBuffer pattern, search; | |||
v0->getUTF8Value(pattern); | |||
v1->getUTF8Value(search); | |||
ICompiledStrRegExpr * compiled = rtlCreateCompiledU8StrRegExpr(pattern, !expr->hasAttribute(noCaseAtom)); | |||
ICompiledStrRegExpr * compiled = rtlCreateCompiledU8StrRegExpr(pattern.length(), pattern, !expr->hasAttribute(noCaseAtom)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look right. I think it is passing a size to a length parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent catch; fixed.
rtl/eclrtl/eclregex.cpp
Outdated
// Check the cache | ||
{ | ||
CriticalBlock lock(compiledStrRegExprLock); | ||
compiledObjPtr = dynamic_cast<CCompiledStrRegExpr*>(compiledStrRegExprCache.get(regexHash)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My recollection is that a dynamic cast is significantly less efficient than a static cast, or a virtual method call. It might be worth considering if it can be removed, but probably not worth it for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is less efficient, but safer; I blame muscle memory for this one. I've swapped in static_cast because "I know what I'm doing" (hahaha). static_cast is resolved at compile time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dcamper looks good. My only concerns are
i) Please can you include a regression test (that is ok to be released open source) and with a smaller number of iterations.
ii) There is no way to configure on/off. I think that's probably ok.
iii) How likely are there to be clashes in hash codes. How much less efficient is a hash table that stores strings? The problem is that mismatches would be silent, and could cause chaos.
That cached object now retains the original regex pattern and the options. If the object is retrieved from the cache via the hash key, then the saved pattern and options are compared to ensure that there is no collision. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it looks functionally correct. A couple of suggestions to discuss about the structure.
system/jlib/jhash.hpp
Outdated
if (foundIter == lookupMap.end()) | ||
return nullptr; | ||
|
||
recentList.erase(foundIter->second.second); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recentList.splice(0, recentList, foundIter->second.second)
is probably more efficient - it should avoid cloning the list element. From the docs that form should be valid if source and target are the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Changed here and in the set() method as well.
rtl/eclrtl/eclregex.cpp
Outdated
* <Software/Globals> section for an optional "regex" subsection with a "cacheSize" attribute | ||
* By default, the maximum cache size is set to 500 patterns. Override with 0 to disable caching. | ||
*/ | ||
void initMaxCacheSize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add static to avoid possible namespace pollution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch; fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dcamper looks very close. A few suggests to clean up the code, and I think this will currently leak RegexCacheEntry objects.
rtl/eclrtl/eclregex.cpp
Outdated
: savedOptions(_options), savedPattern(_pattern, _patternSize), compiledRegex16(_compiledRegex16) | ||
{} | ||
|
||
RegexCacheEntry(const RegexCacheEntry & other) = default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can now be = delete
rtl/eclrtl/eclregex.cpp
Outdated
|
||
RegexCacheEntry(const RegexCacheEntry & other) = default; | ||
|
||
virtual ~RegexCacheEntry() = default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can now be removed.
rtl/eclrtl/eclregex.cpp
Outdated
|
||
compiledRegex = pcre2_compile_8((PCRE2_SPTR8)_regex, regexSize, options, &errNum, &errOffset, pcre2CompileContext8); | ||
CCompiledStrRegExpr(const CCompiledStrRegExpr & other) = default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can now be = delete? or remove and just use the default.
rtl/eclrtl/eclregex.cpp
Outdated
failWithPCRE2Error(errNum, "Error in regex pattern: ", _regex, errOffset); | ||
} | ||
} | ||
virtual ~CCompiledStrRegExpr() = default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can also probably now go away
rtl/eclrtl/eclregex.cpp
Outdated
{ | ||
CCompiledStrRegExpr * compiledObjPtr = nullptr; | ||
uint32_t options = (_isCaseSensitive ? 0 : PCRE2_CASELESS); | ||
hash64_t regexHash = HASH64_INIT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cleanup:
regexHash = RegexHashEntry::hashValue(_regexLength, _regex, options);
and elsewhere
while (lookupMap.size() > maxCacheSize) | ||
{ | ||
lookupMap.erase(recentList.back()); | ||
recentList.pop_back(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When will the cache entry be deleted? Should it be a map to a std::unique_ptr?
1d747f5
to
261c5f3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dcamper looks good and also a clean implementation. Thanks, please squash and I will merge.
fae681d
to
7637764
Compare
7637764
to
7997a0a
Compare
Clean implementation with a great deal of help. Thank you! Rebased to latest master branch and squashed. Please merge. |
Type of change:
Checklist:
Smoketest:
Testing:
Manual execution of regression tests. Performance testing with pathological code (wall clock timings).