Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPCC-30299 Update secrets in the background to avoid roxie stalls #18071

Merged
merged 1 commit into from
Jan 3, 2024

Conversation

ghalliday
Copy link
Member

@ghalliday ghalliday commented Nov 22, 2023

Type of change:

  • This change is a bug fix (non-breaking change which fixes an issue).
  • This change is a new feature (non-breaking change which adds functionality).
  • This change improves the code (refactor or other change that does not change the functionality)
  • This change fixes warnings (the fix does not alter the functionality or the generated code)
  • This change is a breaking change (fix or feature that will cause existing behavior to change).
  • This change alters the query API (existing queries will have to be recompiled)

Checklist:

  • My code follows the code style of this project.
    • My code does not create any new warnings from compiler, build system, or lint.
  • The commit message is properly formatted and free of typos.
    • The commit message title makes sense in a changelog, by itself.
    • The commit is signed.
  • My change requires a change to the documentation.
    • I have updated the documentation accordingly, or...
    • I have created a JIRA ticket to update the documentation.
    • Any new interfaces or exported functions are appropriately commented.
  • I have read the CONTRIBUTORS document.
  • The change has been fully tested:
    • I have added tests to cover my changes.
    • All new and existing tests passed.
    • I have checked that this change does not introduce memory leaks.
    • I have used Valgrind or similar tools to check for potential issues.
  • I have given due consideration to all of the following potential concerns:
    • Scalability
    • Performance
    • Security
    • Thread-safety
    • Cloud-compatibility
    • Premature optimization
    • Existing deployed queries will not be broken
    • This change fixes the problem, not just the symptom
    • The target branch of this pull request is appropriate for such a change.
  • There are no similar instances of the same problem that should be addressed
    • I have addressed them here
    • I have raised JIRA issues to address them separately
  • This is a user interface / front-end modification
    • I have tested my changes in multiple modern browsers
    • The component(s) render as expected

Smoketest:

  • Send notifications about my Pull Request position in Smoketest queue.
  • Test my draft Pull Request.

Testing:

@ghalliday ghalliday requested a review from afishbeck November 22, 2023 14:40
Copy link

@ghalliday
Copy link
Member Author

This is easiest to review as two separate commits - one which refactors the code and rearranges it. The second implements the background updater and fixes problems with access times not being updated.

@ghalliday
Copy link
Member Author

Worth checking the unit tests are correct, and also that the timeouts are sensible.
By default it will check for updates every 30 seconds (timeout/20) , on any items that are due to need refreshing in the next 2 minutes (timeout/5) .

Copy link
Member

@richardkchapman richardkchapman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments, plus a question on the logic

{
const char * slash = strchr(key, '/');
assertex(slash);
const char * at = strchr(slash, '@');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will go wrong if anyone put a # into a vault id. Perhaps that is checked elsewhere

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. The vault id is just the name of the vault config entry in values.yaml or environment.xml. I don't think there is currently any checking for allowed characters. We have control over what is allowed and probably should limit the characters used. Could add checking to the helm chart and the CVault constructor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added HPCC-30124 to check them.

const char * at = strchr(slash, '@');
const char * hash = strchr(slash, '#');

const char * end = nullptr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like this code would be simpler if you initialised end = key+strlen(key) rather than null

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be, not sure I want to change it no though.

// We should never replace known contents for unknown contents
// so once this returns true it should always return true
bool hasContents() const
{
return contents != nullptr;
}

//Has the secret value been used since it was last checked for an update?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm struggling to understand why it is relevant whether a secret value has been used recently, when updating them. Unless you are removing secrets from the cache if not used recently (which I don't think you are).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aim is to not refresh secrets in the background that have not been used recently - to avoid unnecessary load on the vault.

for (auto & entry : secrets)
{
SecretCacheEntry * secret = entry.second.get();
if (secret->isActive() && secret->needsRefresh(when))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't feel right - just because a secret has not been read recently doesn't mean that the next time it IS used it's ok to use an old version, does it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next time it is actually used it will check to see if it needs a refresh, and if it does it will go and get the new value. This PR doesn't change that behaviour, it preemptively refreshes secrets that are in active use behind the scenes to avoid roxie pausing when it needs to access them.

@AttilaVamos
Copy link
Contributor

AttilaVamos commented Nov 23, 2023

I found a bunch of core files (in CentOS based On-Demand Smoketest) and the generated trace files show similar problem:

Core was generated by `hthor --workunit=W20231123-104712-1 --daliServers=10.22.254.118:7070'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f8a3e4df387 in raise () from /lib64/libc.so.6

Backtrace for all threads
==========================
Thread 1 (Thread 0x7f8a3cb1f300 (LWP 15963)):
#0  0x00007f8a3e4df387 in raise () from /lib64/libc.so.6
#1  0x00007f8a3e4e0a78 in abort () from /lib64/libc.so.6
#2  0x00007f8a3edf0a95 in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6
#3  0x00007f8a3edeea06 in ?? () from /lib64/libstdc++.so.6
#4  0x00007f8a3eded9b9 in ?? () from /lib64/libstdc++.so.6
#5  0x00007f8a3edee624 in __gxx_personality_v0 () from /lib64/libstdc++.so.6
#6  0x00007f8a3e8868e3 in ?? () from /lib64/libgcc_s.so.1
#7  0x00007f8a3e886e17 in _Unwind_Resume () from /lib64/libgcc_s.so.1
#8  0x00007f8a4138d8c2 in CInterfaceOf<ISpan>::Release() const () from /opt/HPCCSystems/lib/libjlib.so
#9  0x00007f8a41291041 in DummyLogCtx::~DummyLogCtx() () from /opt/HPCCSystems/lib/libjlib.so
#10 0x00007f8a3e4e305a in __cxa_finalize () from /lib64/libc.so.6
#11 0x00007f8a4123daa7 in ?? () from /opt/HPCCSystems/lib/libjlib.so
#12 0x00007fff042b6aa0 in ?? ()
#13 0x00007f8a44c7708a in _dl_fini () from /lib64/ld-linux-x86-64.so.2
Backtrace stopped: frame did not save the PC

 Registers:
==========================
rax            0x0                 0
rbx            0x7f8a3e871868      140231731255400
rcx            0xffffffffffffffff  -1
rdx            0x6                 6
rsi            0x3e5b              15963
rdi            0x3e5b              15963
rbp            0x403148            0x403148 <typeinfo name for IException*>
rsp            0x7fff042b6078      0x7fff042b6078
r8             0x7fff042b5cb0      140733263338672
r9             0x4                 4
r10            0x8                 8
r11            0x202               514
r12            0x575590            5723536
r13            0x0                 0
r14            0x0                 0
r15            0x0                 0
rip            0x7f8a3e4df387      0x7f8a3e4df387 <raise+55>
eflags         0x202               [ IF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0

 Disas:
==========================
Dump of assembler code for function raise:
   0x00007f8a3e4df350 <+0>:	mov    %fs:0x2d4,%ecx
   0x00007f8a3e4df358 <+8>:	mov    %fs:0x2d0,%esi
   0x00007f8a3e4df360 <+16>:	test   %esi,%esi
   0x00007f8a3e4df362 <+18>:	jne    0x7f8a3e4df398 <raise+72>
   0x00007f8a3e4df364 <+20>:	mov    $0xba,%eax
   0x00007f8a3e4df369 <+25>:	syscall 
   0x00007f8a3e4df36b <+27>:	mov    %eax,%ecx
   0x00007f8a3e4df36d <+29>:	mov    %eax,%fs:0x2d0
   0x00007f8a3e4df375 <+37>:	mov    %eax,%esi
   0x00007f8a3e4df377 <+39>:	movslq %edi,%rdx
   0x00007f8a3e4df37a <+42>:	movslq %esi,%rsi
   0x00007f8a3e4df37d <+45>:	movslq %ecx,%rdi
   0x00007f8a3e4df380 <+48>:	mov    $0xea,%eax
   0x00007f8a3e4df385 <+53>:	syscall 
=> 0x00007f8a3e4df387 <+55>:	cmp    $0xfffffffffffff000,%rax
   0x00007f8a3e4df38d <+61>:	ja     0x7f8a3e4df3ad <raise+93>
   0x00007f8a3e4df38f <+63>:	repz ret 
   0x00007f8a3e4df391 <+65>:	nopl   0x0(%rax)
   0x00007f8a3e4df398 <+72>:	test   %ecx,%ecx
   0x00007f8a3e4df39a <+74>:	jg     0x7f8a3e4df377 <raise+39>
   0x00007f8a3e4df39c <+76>:	mov    %ecx,%eax
   0x00007f8a3e4df39e <+78>:	neg    %eax
   0x00007f8a3e4df3a0 <+80>:	and    $0x7fffffff,%ecx
   0x00007f8a3e4df3a6 <+86>:	cmove  %esi,%eax
   0x00007f8a3e4df3a9 <+89>:	mov    %eax,%ecx
   0x00007f8a3e4df3ab <+91>:	jmp    0x7f8a3e4df377 <raise+39>
   0x00007f8a3e4df3ad <+93>:	mov    0x390a9c(%rip),%rdx        # 0x7f8a3e86fe50
   0x00007f8a3e4df3b4 <+100>:	neg    %eax
   0x00007f8a3e4df3b6 <+102>:	mov    %eax,%fs:(%rdx)
   0x00007f8a3e4df3b9 <+105>:	or     $0xffffffffffffffff,%rax
   0x00007f8a3e4df3bd <+109>:	ret    
End of assembler dump.

I don't know it is related to your changes or not.
If not I will raise a JIRA for it.

if (resolved)
globalSecretCache.updateSecret(secret, resolved, now, accessed);
else
secret->noteFailedUpdate(now, accessed);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would we always honor accessed if not resolved ?

const char * slash = strchr(key, '/');
assertex(slash);
const char * at = strchr(slash, '@');
const char * hash = strchr(slash, '#');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - are these '/', '#', '@' special chars always ok to use as separators ?

Copy link
Contributor

@mckellyln mckellyln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two minor comments, but otherwise looks good to me.

Copy link
Member

@afishbeck afishbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor comments otherwise looks good. Vault name should be restricted to basic charset and validated in helm chart and perhaps elsewhere.

{
const char * slash = strchr(key, '/');
assertex(slash);
const char * at = strchr(slash, '@');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. The vault id is just the name of the vault config entry in values.yaml or environment.xml. I don't think there is currently any checking for allowed characters. We have control over what is allowed and probably should limit the characters used. Could add checking to the helm chart and the CVault constructor.

MilliSleep(30); // elapsed=180 = 80 + 80 + 20
CPPUNIT_ASSERT(secret6->isValid());
CPPUNIT_ASSERT(!secret6->isStale());
unsigned version3 = secret6->getVersion(); // Mark the value as accessed, but too early to be refreshed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the above comment wrong? No longer too early to be refreshed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct comment is wrong.

@ghalliday
Copy link
Member Author

@afishbeck I have created a jira, fixed the comment and squashed.
What do you think is the best target for this. Is it important enough to go in 9.4.x? Without it we may get periodic pauses accessing secrets in roxie queries.

@afishbeck
Copy link
Member

@afishbeck I have created a jira, fixed the comment and squashed. What do you think is the best target for this. Is it important enough to go in 9.4.x? Without it we may get periodic pauses accessing secrets in roxie queries.

I agree that this is an important improvement. One way of looking at it is that since this can be turned off via config, what is the chance the code change would break something even if turned off?

@ghalliday
Copy link
Member Author

@mckellyln I decided not to merge because I was concerned about msTick() wrapping. In this case, secrets are never removed from the hash table - which means that if a roxie has been up for more than 25 days, and early in its life it accessed a secret, but hasn't since, the test for needs refresh and expired may be wrong.
I think the only ill effect is that isActive() will start returning true - so all secrets will be requested from the vault every 25 days, but switching to nsTick() avoids the problem for 250 years.
Please can you check the update - the unit tests caught several mistakes.

}

extern jlib_decl void setSecretTimeout(unsigned timeoutMs)
{
secretTimeoutMs = timeoutMs;
secretTimeoutNs = (unsigned __int64)timeoutMs * 1000000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why unsigned __int64 and not __uint64 ? Does it matter ?

Copy link
Contributor

@mckellyln mckellyln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit 2 looks good.
Approved.

@ghalliday ghalliday merged commit ababf40 into hpcc-systems:candidate-9.4.x Jan 3, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants