-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Automation Enhancement] Mechanism to close the created Gradle Check AUTOCUT flaky test issues. #14475
Comments
I think the 1st option is pretty simple and straightforward, thanks @prudhvigodithi ! |
Agree that the 1st option is the simpler one and probably worth trying first. |
Thanks the solution 1 is in place now, here is an example #14499 (comment). Thank you |
Closing this issue as today we have the mechanism to close the created Gradle Check AUTOCUT flaky test issues. |
Is your feature request related to a problem? Please describe
Background
Coming from the initial implementation #13950, the automation as described in the DEVELOPER_GUIDE will identify and start creating the flaky test report issues based on a test failures in the post merge actions. The data used to create these issues is part of the OpenSearch Metrics Project (For more details refer Gradle Check Metrics Dashboard). The initial goal to find the flaky tests and creating a detailed issue report was solved.
Problem Statement
Now the issues that are auto created with the automation can only be closed once the failures are not part of the post merge actions for the next 30 days (the query executed on the metrics clusters is targeting to filter the failing tests in past 30 days), example here is an AUTOCUT issue created related to
RemoteStoreClusterStateRestoreIT
, even though this was identified and fixed promptly there is no way to for a user to close this as the automation will again flagRemoteStoreClusterStateRestoreIT
and re-opens the issue as theRemoteStoreClusterStateRestoreIT
was identified failing in past 30 days. With this the issue remains open (for next 30 days and if not again failed in post merge action builds) even though the flaky test is fixed by the user.Describe the solution you'd like
Proposed Solution
Solution 1
As proposed here
If the issue is closed (considering the flaky test is fixed by the user) the automation should not re-open unless the data is different from what shown in the issue body, if anything (in the issue body) is different after closed then it should re-open the issue. Here the data to compare is the markdown table and not the linked PR's as during the PR creation the failures sometimes could be genuine. So re-open when seen a new failure (with a different post merge commit) after the issue is closed. This should also solve the problem where sometimes we think the Flaky test is fixed but would re-occur and with new reoccurrence the issue should re-open with new data.
This solution is simple comparison with existing test names and git reference on the existing issue body and decide to re-open (once the issue is closed by the user) the issue or keep in the closed state.
Solution 2
This solution targets to have a database of events and decide based on events to open a new issue or keep the issue in closed state.
Create a new index
gradle-check-flaky-tests
, from identified flaky test names in OpenSearch Gradle Check Metrics which is part of the automation FetchPostMergeFailedTestClass. Now create a new document for each test name with atest_class
andgit_reference
association. Example asThe
flaky_identified_at
is the date when the document was 1st created.The
updated_at
is when the daily automation was triggered.(Optional) The
time_open_in_days
is the difference between(updated_at - flaky_identified_at)
.(Optional) The
time_closed_in_days
is the difference between(updated_at - flaky_identified_at)
once theflaky
is set tofalse
.The
flaky
will be set tofalse
once the issue is closed by the user.The
fixed_at
will be the currentupdated_at
after theflaky
is set tofalse
(Its ~time when the issue was closed).The
issue_number
is the GitHub issue number created for thetest_class
(example as #14326).Now for the upcoming automation runs if it identifies the
test_name
for the samegit_reference
with"flaky": flase
it should not re-open the issue, if it finds thetest_name
for differentgit_reference
then it means even though the same flaky test is fixed it failed for another post merge commit (git_reference
) and should create a new document and a new issue flagging the test as flaky for different commit. For open issues the automation will continue to keep updating the issue body and the above document fields still keeping the"flaky": true
.The assumption here the user will only close the issue when all the Test Names part of the issue, example #14381 are closed. The framework maintains one GitHub Issue for all test failures grouped by test class and different documents in cluster, one for each test name.
With this solution we can even build trends on these flaky test documents using the OpenSearch Metrics Dashboard.
Related component
Other
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: