Unstable Reliability test results #4386

pro-akim · 2023-08-02T15:47:31Z

Description

Different results were found running reliability tests over the same artifacts obtained from the workload benchmark test.
Research should be performed.

Details

Performed: wazuh/wazuh#18017
A member of the framework team tried to recreate #4364
Different Reliability report was detected.

Performing pytest test_cluster_logs/test_cluster_connection, it was reported:

  ~/Doc/W/R/wazuh-qa/t/r/test_cluster on   master ❯ python3 -m pytest test_cluster_logs/test_cluster_connection/test_cluster_connection.py --artifacts_path='/Users/framos/Desktop/artifacts' --html=report.html
================================================================= test session starts ==================================================================
platform darwin -- Python 3.9.6, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
rootdir: /Users/framos/Documents/Wazuh/Repositories/wazuh-qa/tests, configfile: pytest.ini
plugins: testinfra-5.0.0, html-3.1.1, metadata-2.0.4
collected 1 item

test_cluster_logs/test_cluster_connection/test_cluster_connection.py F                                                                           [100%]

======================================================================= FAILURES =======================================================================
_______________________________________________________________ test_cluster_connection ________________________________________________________________

artifacts_path = '/Users/framos/Desktop/artifacts'

    def test_cluster_connection(artifacts_path):
        """Verify that no worker disconnects from the master once they are connected.

        For each worker, this test looks for the first successful connection message
        in its logs. Then it looks for any failed connection attempts after the successful
        connection found above.

        Args:
            artifacts_path (str): Path where folders with cluster information can be found.
        """
        if not artifacts_path:
            pytest.fail("Parameter '--artifacts_path=<path>' is required.")

        cluster_log_files = glob(join(artifacts_path, 'worker_*', 'logs', 'cluster.log'))
        if len(cluster_log_files) == 0:
            pytest.fail(f'No files found inside {artifacts_path}.')

        for log_file in cluster_log_files:
            with open(log_file) as f:
                s = mmap(f.fileno(), 0, access=ACCESS_READ)
                # Search first successful connection message.
                conn = re.search(rb'^.*Successfully connected to master.*$', s, flags=re.MULTILINE)
                if not conn:
                    pytest.fail(f'Could not find "Sucessfully connected to master" message in the '
                                f'{node_name.search(log_file)[1]}')

                # Search if there are any connection attempts after the message found above.
                if re.search(rb'^.*Could not connect to master. Trying.*$|^.*Sucessfully connected to master.*$',
                             s[conn.end():], flags=re.MULTILINE):
                    disconnected_nodes.append(node_name.search(log_file)[1])

        if disconnected_nodes:
>           pytest.fail(f'The following nodes disconnected from master at any point:\n- ' + '\n- '.join(disconnected_nodes))
E           Failed: The following nodes disconnected from master at any point:
E           - worker_2
E           - worker_5
E           - worker_4
E           - worker_3
E           - worker_15
E           - worker_12
E           - worker_24
E           - worker_23
E           - worker_22
E           - worker_25
E           - worker_13
E           - worker_14
E           - worker_6
E           - worker_1
E           - worker_8
E           - worker_9
E           - worker_7
E           - worker_11
E           - worker_16
E           - worker_20
E           - worker_18
E           - worker_19
E           - worker_21
E           - worker_17
E           - worker_10

test_cluster_logs/test_cluster_connection/test_cluster_connection.py:47: Failed
-------------- generated html file: file:///Users/framos/Documents/Wazuh/Repositories/wazuh-qa/tests/reliability/test_cluster/report.html --------------
=============================================================== short test summary info ================================================================
FAILED test_cluster_logs/test_cluster_connection/test_cluster_connection.py::test_cluster_connection - Failed: The following nodes disconnected from ...
================================================================== 1 failed in 0.84s ===================================================================

Meanwhile in the first test was:

                if not conn:
>                   pytest.fail(f'Could not find "Sucessfully connected to master" message in the '
                                f'{node_name.search(log_file)[1]}')
E                   Failed: Could not find "Sucessfully connected to master" message in the worker_21

Therefore, using the same artifact downloaded from 274

And running the same test one more time,

Results of reports are both different
Previous report
New report

Checking data from logs, it is possible to see that both worked with the same data (dates the same in reliability/test_cluster/test_cluster_logs/test_cluster_worker_logs_order
/test_cluster_worker_logs_order.py::test_check_logs_order_workers)

pro-akim · 2023-08-02T15:51:00Z

Updated reliability test results:

3/5 tests failed.

Failures:

reliability/test_cluster/test_cluster_logs/test_cluster_connection/test_cluster_connection.py::test_cluster_connection

Testing: Verify that no worker disconnects from the master once they are connected.

  if disconnected_nodes:
>           pytest.fail(f'The following nodes disconnected from master at any point:\n- ' + '\n- '.join(disconnected_nodes))
E           Failed: The following nodes disconnected from master at any point:
E           - worker_21
E           - worker_13
E           - worker_4
E           - worker_6
E           - worker_18
E           - worker_20
E           - worker_11
E           - worker_2
E           - worker_14
E           - worker_3
E           - worker_9
E           - worker_17
E           - worker_22
E           - worker_7
E           - worker_5
E           - worker_19
E           - worker_12
E           - worker_25
E           - worker_8
E           - worker_24
E           - worker_23
E           - worker_10
E           - worker_16
E           - worker_15
E           - worker_1

test_cluster_logs/test_cluster_connection/test_cluster_connection.py:47: Failed

This issue was closed:
Test cluster connection in Reliability Test is not detecting the correct pattern #4364
New issue related to new results is opened:
Unstable connection between master and workers according cluster.log in Reliability test #4385

reliability/test_cluster/test_cluster_logs/test_cluster_error_logs/test_cluster_error_logs.py::test_cluster_error_logs

Testing: Look for any error messages in the logs of the cluster nodes.

It seems related to:

Error sending sendsync in Wazuh cluster wazuh#15802

The main difference with the mentioned issue is that this behavior happened during all the activity in all the 25 workers. It was not related to OS.

>       assert not nodes_with_errors, 'Errors were found in the "cluster.log" file of ' \
                                      'these nodes: \n- ' + '\n- '.join(nodes_with_errors)
E       AssertionError: Errors were found in the "cluster.log" file of these nodes: 
E         - worker_21
E         - worker_13
E         - worker_4
E         - worker_6
E         - worker_18
E         - worker_20
E         - worker_11
E         - worker_2
E         - worker_14
E         - worker_3
E         - worker_9
E         - worker_17
E         - worker_22
E         - worker_7
E         - worker_5
E         - worker_19
E         - worker_12
E         - worker_25
E         - worker_8
E         - worker_24
E         - worker_23
E         - worker_10
E         - worker_16
E         - worker_15
E         - worker_1
E       assert not {'worker_1': [b'2023/07/25 09:03:37 ERROR: [Worker CLUSTER-Workload_benchmarks_metrics_B274_manager_1] [Main] Error se...nager_12] [Main] Error sending sendsync response to local client: Error 3020 - Timeout sending request: ok', ...], ...}

test_cluster/test_cluster_logs/test_cluster_error_logs/test_cluster_error_logs.py:57: AssertionError

Detected ERROR was always as the following log:

2023/07/25 09:28:04 ERROR: [Worker CLUSTER-Workload_benchmarks_metrics_B274_manager_25] [Main] Error sending sendsync response to local client: Error 3020 - Timeout sending request: ok

reliability/test_cluster/test_cluster_logs/test_cluster_worker_logs_order
/test_cluster_worker_logs_order.py::test_check_logs_order_workers

Testing: Check that cluster logs appear in the expected order.

This pattern happens in every worker.

E            - Log type: Agent-groups recv
E              Expected logs: ['The checksum of both databases match.*Reset the attempt counter.', 'The checksum of both databases match.*', 'Checksum comparison failed. Attempt 10/10.*', 'Checksum comparison failed. Attempt .*', "The master's checksum and the worker's checksum are different. Local checksum:.*| Master checksum:.*."]
E              Found log: 2023/07/25 09:25:34 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B274_manager_21] [Agent-groups recv] The checksum of master (0107654add85786339eba7c5fa72cee25783dd6c) and worker (4ba9e4d09c32baef1d763fd81e3cf804aa06b607) are different.

It seems that log messages in Integrity check and Agent-groups recv have changed, however, it is clear that the message has the same meaning.
Log messages should be updated following the new conventions.
This issue were updated:

Test cluster worker logs order is detecting old format of logs in cluster.log #4365

Previous report
New report

pro-akim · 2023-08-03T10:25:58Z

Update:

Tests were conducted to investigate the possibility of observing unstable results; however, the results remain consistent. The attached reports indicate the same outcomes as the 'new updated' results reported in this issue.

Here are the links to the reports:

report1
report2
report3

It still appears peculiar that when performing specific searches with the old report results, some logs before the update were not found in each log. However, it is evident that the same artifacts were used, as indicated by the test results where the date of the footprinted logs remains consistent.

On the other hand, the only variation observed was the Python and Linux version of the old report compared to the news.

It will require continuing to monitor test reports over time and observing for repetitions to identify a common factor associated with result variations.

roronoasins · 2023-08-03T14:40:56Z

I was not able to reproduce the first error

E                   Failed: Could not find "Sucessfully connected to master" message in the worker_21

but the second one

        if disconnected_nodes:
>           pytest.fail(f'The following nodes disconnected from master at any point:\n- ' + '\n- '.join(disconnected_nodes))
E           Failed: The following nodes disconnected from master at any point:
E           - worker_3
E           - worker_1
E           - worker_2
E           - worker_24
E           - worker_21
E           - worker_22
E           - worker_23
E           - worker_11
E           - worker_20
E           - worker_9
E           - worker_15
E           - worker_17
E           - worker_18
E           - worker_14
E           - worker_12
E           - worker_4
E           - worker_10
E           - worker_7
E           - worker_25
E           - worker_6
E           - worker_13
E           - worker_8
E           - worker_5
E           - worker_19
E           - worker_16

It seems like some error lines from the first iteration appear in the attached artifacts, but the expected behavior is the one that I obtained in this attempt. This seems to be expected because the worker logs contain the first successful connection, as well as the connection loss and connection attempts.

Besides, this commit 2bde406 should also fix the error message

pytest.fail(f'Could not find "Sucessfully connected to master" message in the '
                            f'{node_name.search(log_file)[1]}')

and the next regex

re.search(rb'^.*Could not connect to master. Trying.*$|^.*Sucessfully connected to master.*$',
                         s[conn.end():], flags=re.MULTILINE):

pro-akim · 2023-08-04T10:20:39Z

Update

Executing tests from different venv to check the test's stability in different environment settings.

report1
report2
report3

No differences were detected from the last report.

A new related issue was opened:

Fix misspelling bug in the test_cluster_connection #4391

pro-akim · 2023-08-04T15:16:18Z

Update

Unfortunately, conducting tests in different environments and on several occasions, we have been unable to reproduce the error.
The test's behavior was stable and as expected.
This issue will be kept as a reference in case a similar behavior occurs in future executions.
This issue is now moved to pending review.

jnasselle · 2023-08-07T12:22:03Z

Update

Moving ETA from Aug4 to Aug8 because release-related tasks were developed during this current issue

jnasselle · 2023-08-09T14:50:54Z

Update

Moving ETA from Aug8 to Aug9 because release-related tasks were developed during this current issue review

jnasselle · 2023-08-09T14:52:50Z

LGTM! great analysis @pro-akim @roronoasins

pro-akim added team/qa type/enhancement labels Aug 2, 2023

pro-akim mentioned this issue Aug 2, 2023

Test cluster connection in Reliability Test is not detecting the correct pattern #4364

Closed

pro-akim self-assigned this Aug 2, 2023

pro-akim changed the title ~~Inconsistent Reliability test results~~ Unstable Reliability test results Aug 2, 2023

wazuhci added this to Release 4.6.0 Aug 2, 2023

wazuhci moved this to In progress in Release 4.6.0 Aug 2, 2023

wazuhci moved this from In progress to On hold in Release 4.6.0 Aug 3, 2023

roronoasins self-assigned this Aug 3, 2023

roronoasins mentioned this issue Aug 3, 2023

Fix misspelling bug in the test_cluster_connection #4391

Closed

wazuhci moved this from On hold to In progress in Release 4.6.0 Aug 4, 2023

wazuhci moved this from In progress to On hold in Release 4.6.0 Aug 4, 2023

wazuhci moved this from On hold to Pending review in Release 4.6.0 Aug 4, 2023

wazuhci moved this from Pending review to In review in Release 4.6.0 Aug 8, 2023

jnasselle closed this as completed Aug 9, 2023

wazuhci moved this from In review to Done in Release 4.6.0 Aug 9, 2023

roronoasins mentioned this issue Aug 23, 2023

Fix misspelling regex and error in test_cluster_connection #4392

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unstable Reliability test results #4386

Unstable Reliability test results #4386

pro-akim commented Aug 2, 2023 •

edited

Loading

pro-akim commented Aug 2, 2023 •

edited

Loading

pro-akim commented Aug 3, 2023

roronoasins commented Aug 3, 2023

pro-akim commented Aug 4, 2023 •

edited

Loading

pro-akim commented Aug 4, 2023

jnasselle commented Aug 7, 2023

jnasselle commented Aug 9, 2023

jnasselle commented Aug 9, 2023

Unstable Reliability test results #4386

Unstable Reliability test results #4386

Comments

pro-akim commented Aug 2, 2023 • edited Loading

Description

Details

pro-akim commented Aug 2, 2023 • edited Loading

Updated reliability test results:

3/5 tests failed.

Failures:

pro-akim commented Aug 3, 2023

Update:

roronoasins commented Aug 3, 2023

pro-akim commented Aug 4, 2023 • edited Loading

Update

pro-akim commented Aug 4, 2023

Update

jnasselle commented Aug 7, 2023

Update

jnasselle commented Aug 9, 2023

Update

jnasselle commented Aug 9, 2023

pro-akim commented Aug 2, 2023 •

edited

Loading

pro-akim commented Aug 2, 2023 •

edited

Loading

pro-akim commented Aug 4, 2023 •

edited

Loading