Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPCC-32193 Fix some issues with spill stats in smart join activity #18866

Merged
merged 1 commit into from
Sep 16, 2024

Conversation

shamser
Copy link
Contributor

@shamser shamser commented Jul 10, 2024

The current temp file statistics for smartjoin did not include all stats from all temp files:

  1. temp files were closed before its sizes were recorded in the stats
  2. stats from some types of temp files were not being tracked such as overflowWriteFile from RHS
  3. stats from temp files that were closed in CSpillableStreamBase were not preserved
  4. peak temp file size was not tracked in CThorSpillableRowArray
  5. make CThorRowCollectorBase use of stats from CThorSpillableRowArray for more accurate and simpler temp stats tracking (for CThorRowCollectorBase)

Type of change:

  • This change is a bug fix (non-breaking change which fixes an issue).
  • This change is a new feature (non-breaking change which adds functionality).
  • This change improves the code (refactor or other change that does not change the functionality)
  • This change fixes warnings (the fix does not alter the functionality or the generated code)
  • This change is a breaking change (fix or feature that will cause existing behavior to change).
  • This change alters the query API (existing queries will have to be recompiled)

Checklist:

  • My code follows the code style of this project.
    • My code does not create any new warnings from compiler, build system, or lint.
  • The commit message is properly formatted and free of typos.
    • The commit message title makes sense in a changelog, by itself.
    • The commit is signed.
  • My change requires a change to the documentation.
    • I have updated the documentation accordingly, or...
    • I have created a JIRA ticket to update the documentation.
    • Any new interfaces or exported functions are appropriately commented.
  • I have read the CONTRIBUTORS document.
  • The change has been fully tested:
    • I have added tests to cover my changes.
    • All new and existing tests passed.
    • I have checked that this change does not introduce memory leaks.
    • I have used Valgrind or similar tools to check for potential issues.
  • I have given due consideration to all of the following potential concerns:
    • Scalability
    • Performance
    • Security
    • Thread-safety
    • Cloud-compatibility
    • Premature optimization
    • Existing deployed queries will not be broken
    • This change fixes the problem, not just the symptom
    • The target branch of this pull request is appropriate for such a change.
  • There are no similar instances of the same problem that should be addressed
    • I have addressed them here
    • I have raised JIRA issues to address them separately
  • This is a user interface / front-end modification
    • I have tested my changes in multiple modern browsers
    • The component(s) render as expected

Smoketest:

  • Send notifications about my Pull Request position in Smoketest queue.
  • Test my draft Pull Request.

Testing:

Copy link

Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-32193

Jirabot Action Result:
Workflow Transition: Merge Pending
Updated PR

@shamser shamser force-pushed the issue32193 branch 2 times, most recently from 30cc34f to 2a67393 Compare July 16, 2024 08:36
@shamser shamser changed the base branch from candidate-9.8.x to candidate-9.6.x August 8, 2024 08:29
@shamser shamser force-pushed the issue32193 branch 2 times, most recently from 5aaba8c to 333932c Compare August 8, 2024 09:40
@shamser shamser marked this pull request as ready for review August 8, 2024 09:40
@shamser shamser requested review from ghalliday and jakesmith August 8, 2024 09:40
Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broadly looks good. One question about thread safety

@@ -3018,6 +3035,13 @@ class CLookupJoinActivityBase : public CInMemJoinBase<HTHELPER, IHThorHashJoinAr
activeStats.setStatistic(StNumSmartJoinDegradedToLocal, aggregateFailoversToLocal); // NB: is going to be same for all slaves.
activeStats.setStatistic(StNumSmartJoinSlavesDegradedToStd, aggregateFailoversToStandard);
}
if (overflowWriteFileIO)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does there need to be a critical block incase the overflow file, or the rhsSlaveRows are killed when this is being called?

Copy link
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shamser - please see comments

@@ -413,7 +413,7 @@ class graph_decl CThorSpillableRowArray : private CThorExpandingRowArray, implem
mutable CriticalSection cs;
ICopyArrayOf<IWritePosCallback> writeCallbacks;
size32_t compBlkSz = 0; // means use default

CRuntimeStatisticCollection inactiveStats; // reset after each kill
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they're not really "inactive" at this point afaics, they're the current stats of CThorSpillableRowArray (they get updated every save()). I think would be better to rename to 'stats'

overflowWriteStream.clear();
if (overflowWriteFileIO)
{
mergeRemappedStats(PARENT::inactiveStats, overflowWriteFileIO, diskToTempStatsMap);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: I think this is correct approach - utility class uses standard IO stats types, and the activity remaps them to what it wants to

thorlcr/activities/lookupjoin/thlookupjoinslave.cpp Outdated Show resolved Hide resolved
@@ -2678,8 +2690,12 @@ class CLookupJoinActivityBase : public CInMemJoinBase<HTHELPER, IHThorHashJoinAr
IOutputMetaData *inputOutputMeta = rightITDL->queryFromActivity()->queryContainer().queryHelper()->queryOutputMeta();
// rows may either be in separate slave row arrays or in single rhs array, or split.
rowcount_t total = rightCollector ? rightCollector->numRows() : (getGlobalRHSTotal() + rhs.ordinality());
if (rightCollector && rightCollector->hasSpilt())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think should move above the if !isOOMException .. i.e. still want to capture stats.
It would be cleaner if there was a Owned (before the try) and the code in the exception handler only differentiated the exception type, e.g.:

        catch (IException *e)
        {
            if (!isOOMException(e))
                exception.setown(e);
            else
            {
                IOutputMetaData *inputOutputMeta = rightITDL->queryFromActivity()->queryContainer().queryHelper()->queryOutputMeta();
                // rows may either be in separate slave row arrays or in single rhs array, or split.
                rowcount_t total = rightCollector ? rightCollector->numRows() : (getGlobalRHSTotal() + rhs.ordinality());
                exception.setown(checkAndCreateOOMContextException(this, e, "gathering RHS rows for lookup join", total, inputOutputMeta, NULL));
            }
        }
        if (rightCollector && rightCollector->hasSpilt())
            mergeStats(PARENT::inactiveStats, rightCollector);
        if (exception)
            throw exception.getClear();

@@ -234,6 +234,7 @@ class CSpillableStreamBase : public CSpillable
unsigned spillCompInfo;
CThorSpillableRowArray rows;
Owned<CFileOwner> spillFile;
CRuntimeStatisticCollection inactiveStats;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are not really inactiveStats at this stage, I'd rename 'stats'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment missed? still think clearer if this is renamed

{
throwOnOom = false;
}

CThorSpillableRowArray::CThorSpillableRowArray(CActivityBase &activity, IThorRowInterfaces *rowIf, EmptyRowSemantics emptyRowSemantics, StableSortFlag stableSort, rowidx_t initialSize, size32_t _commitDelta)
: CThorExpandingRowArray(activity, rowIf, ers_forbidden, stableSort, false, initialSize), commitDelta(_commitDelta)
: CThorExpandingRowArray(activity, rowIf, ers_forbidden, stableSort, false, initialSize), commitDelta(_commitDelta), inactiveStats(spillStatistics)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a mixture of some utility containers (inc jfile objects) that create standard stats (iCSmartRowBuffer creates regular IO stats). and others that now create 'spillStatistics'.
It would be clearer/more consistent, if we always let the utility classes create standard stats., and only in the activity code were they remapped.
i.e. I don't think CThorSpillableRowArray should itself capture "spillStatistics" and do the remapping. The activities that use them should do the remapping.

Copy link
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shamser - please see comments.

I think it would aide the review process, to break this down into 2 or more separate PR's if it can be made consistent.
e.g. > make CThorRowCollectorBase use of stats from CThorSpillableRowArray for more accurate and simpler temp stats tracking (for CThorRowCollectorBase)
(and any knock on effects) could be own PR?

thorlcr/activities/lookupjoin/thlookupjoinslave.cpp Outdated Show resolved Hide resolved
thorlcr/activities/lookupjoin/thlookupjoinslave.cpp Outdated Show resolved Hide resolved
@@ -234,6 +234,7 @@ class CSpillableStreamBase : public CSpillable
unsigned spillCompInfo;
CThorSpillableRowArray rows;
Owned<CFileOwner> spillFile;
CRuntimeStatisticCollection inactiveStats;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment missed? still think clearer if this is renamed

thorlcr/thorutil/thmem.cpp Outdated Show resolved Hide resolved
thorlcr/thorutil/thmem.cpp Show resolved Hide resolved
thorlcr/thorutil/thmem.cpp Show resolved Hide resolved
@shamser shamser requested a review from jakesmith September 2, 2024 10:22
Copy link
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shamser - looks good, only one very minor comment.

@@ -1638,11 +1648,8 @@ class CThorRowCollectorBase : public CSpillable
Owned<CSharedSpillableRowSet> spillableRowSet;
unsigned options = 0;
unsigned spillCompInfo = 0;
RelaxedAtomic<unsigned> statOverflowCount{0};
RelaxedAtomic<offset_t> statSizeSpill{0};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: now also looks unused

@shamser shamser requested a review from jakesmith September 4, 2024 15:31
Copy link
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shamser - looks good. Please squash.

Copy link
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shamser - reapproving.

@ghalliday
Copy link
Member

@jakesmith are we still happy for this to go in 9.6.x? Confident it will not cause a regression?

Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shamser one question

@@ -3001,12 +3022,13 @@ class CLookupJoinActivityBase : public CInMemJoinBase<HTHELPER, IHThorHashJoinAr
GetTempFilePath(tempFilename, "lookup_local");
ActPrintLog("Overflowing RHS broadcast rows to spill file: %s", tempFilename.str());
overflowWriteFile.setown(container.queryActivity()->createOwnedTempFile(tempFilename.str()));
overflowWriteStream.setown(createRowWriter(&(overflowWriteFile->queryIFile()), queryRowInterfaces(rightITDL), rwFlags));
overflowWriteFileIO.setown(overflowWriteFile->queryIFile().open(IFOcreate));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shamser should this be being cleared at some point - otherwise I think it will stay open longer than you want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghalliday I'm now clearing overflowWriteFileIO at the same time that overflowWriteStream is cleared (Line 2118).

@shamser shamser requested a review from ghalliday September 13, 2024 11:33
@ghalliday
Copy link
Member

@shamser I think you have accidentally included a vcpkg update. Please squash, but remove that extra update.

The current temp file statistics for smartjoin did not include all
stats from all temp files:
1) temp files were closed before its sizes were recorded in the stats
2) stats from some types of temp files were not being tracked such
as overflowWriteFile from RHS
3) stats from temp files that were closed in CSpillableStreamBase
were not preserved
4) peak temp file size was not tracked in CThorSpillableRowArray
5) make CThorRowCollectorBase use of stats from CThorSpillableRowArray
for more accurate and simpler temp stats tracking.

Signed-off-by: Shamser Ahmed <[email protected]>

HPCC-32193 Close overflowWriteFileIO when it's no longer needed

Signed-off-by: Shamser Ahmed <[email protected]>
@ghalliday ghalliday merged commit 0f9e143 into hpcc-systems:candidate-9.6.x Sep 16, 2024
52 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants