Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unstable Reliability test results #4386

Closed
pro-akim opened this issue Aug 2, 2023 · 8 comments
Closed

Unstable Reliability test results #4386

pro-akim opened this issue Aug 2, 2023 · 8 comments
Assignees

Comments

@pro-akim
Copy link
Member

pro-akim commented Aug 2, 2023

Description

Different results were found running reliability tests over the same artifacts obtained from the workload benchmark test.
Research should be performed.

Details

Performed: wazuh/wazuh#18017
A member of the framework team tried to recreate #4364
Different Reliability report was detected.

Performing pytest test_cluster_logs/test_cluster_connection, it was reported:

  ~/Doc/W/R/wazuh-qa/t/r/test_cluster on   master ❯ python3 -m pytest test_cluster_logs/test_cluster_connection/test_cluster_connection.py --artifacts_path='/Users/framos/Desktop/artifacts' --html=report.html
================================================================= test session starts ==================================================================
platform darwin -- Python 3.9.6, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
rootdir: /Users/framos/Documents/Wazuh/Repositories/wazuh-qa/tests, configfile: pytest.ini
plugins: testinfra-5.0.0, html-3.1.1, metadata-2.0.4
collected 1 item

test_cluster_logs/test_cluster_connection/test_cluster_connection.py F                                                                           [100%]

======================================================================= FAILURES =======================================================================
_______________________________________________________________ test_cluster_connection ________________________________________________________________

artifacts_path = '/Users/framos/Desktop/artifacts'

    def test_cluster_connection(artifacts_path):
        """Verify that no worker disconnects from the master once they are connected.

        For each worker, this test looks for the first successful connection message
        in its logs. Then it looks for any failed connection attempts after the successful
        connection found above.

        Args:
            artifacts_path (str): Path where folders with cluster information can be found.
        """
        if not artifacts_path:
            pytest.fail("Parameter '--artifacts_path=<path>' is required.")

        cluster_log_files = glob(join(artifacts_path, 'worker_*', 'logs', 'cluster.log'))
        if len(cluster_log_files) == 0:
            pytest.fail(f'No files found inside {artifacts_path}.')

        for log_file in cluster_log_files:
            with open(log_file) as f:
                s = mmap(f.fileno(), 0, access=ACCESS_READ)
                # Search first successful connection message.
                conn = re.search(rb'^.*Successfully connected to master.*$', s, flags=re.MULTILINE)
                if not conn:
                    pytest.fail(f'Could not find "Sucessfully connected to master" message in the '
                                f'{node_name.search(log_file)[1]}')

                # Search if there are any connection attempts after the message found above.
                if re.search(rb'^.*Could not connect to master. Trying.*$|^.*Sucessfully connected to master.*$',
                             s[conn.end():], flags=re.MULTILINE):
                    disconnected_nodes.append(node_name.search(log_file)[1])

        if disconnected_nodes:
>           pytest.fail(f'The following nodes disconnected from master at any point:\n- ' + '\n- '.join(disconnected_nodes))
E           Failed: The following nodes disconnected from master at any point:
E           - worker_2
E           - worker_5
E           - worker_4
E           - worker_3
E           - worker_15
E           - worker_12
E           - worker_24
E           - worker_23
E           - worker_22
E           - worker_25
E           - worker_13
E           - worker_14
E           - worker_6
E           - worker_1
E           - worker_8
E           - worker_9
E           - worker_7
E           - worker_11
E           - worker_16
E           - worker_20
E           - worker_18
E           - worker_19
E           - worker_21
E           - worker_17
E           - worker_10

test_cluster_logs/test_cluster_connection/test_cluster_connection.py:47: Failed
-------------- generated html file: file:///Users/framos/Documents/Wazuh/Repositories/wazuh-qa/tests/reliability/test_cluster/report.html --------------
=============================================================== short test summary info ================================================================
FAILED test_cluster_logs/test_cluster_connection/test_cluster_connection.py::test_cluster_connection - Failed: The following nodes disconnected from ...
================================================================== 1 failed in 0.84s ===================================================================

Meanwhile in the first test was:

                if not conn:
>                   pytest.fail(f'Could not find "Sucessfully connected to master" message in the '
                                f'{node_name.search(log_file)[1]}')
E                   Failed: Could not find "Sucessfully connected to master" message in the worker_21

Therefore, using the same artifact downloaded from 274

And running the same test one more time,

Results of reports are both different
Previous report
New report

Checking data from logs, it is possible to see that both worked with the same data (dates the same in reliability/test_cluster/test_cluster_logs/test_cluster_worker_logs_order
/test_cluster_worker_logs_order.py::test_check_logs_order_workers)

@pro-akim
Copy link
Member Author

pro-akim commented Aug 2, 2023

Updated reliability test results:

3/5 tests failed.

Failures:

  • reliability/test_cluster/test_cluster_logs/test_cluster_connection/test_cluster_connection.py::test_cluster_connection

Testing: Verify that no worker disconnects from the master once they are connected.

  if disconnected_nodes:
>           pytest.fail(f'The following nodes disconnected from master at any point:\n- ' + '\n- '.join(disconnected_nodes))
E           Failed: The following nodes disconnected from master at any point:
E           - worker_21
E           - worker_13
E           - worker_4
E           - worker_6
E           - worker_18
E           - worker_20
E           - worker_11
E           - worker_2
E           - worker_14
E           - worker_3
E           - worker_9
E           - worker_17
E           - worker_22
E           - worker_7
E           - worker_5
E           - worker_19
E           - worker_12
E           - worker_25
E           - worker_8
E           - worker_24
E           - worker_23
E           - worker_10
E           - worker_16
E           - worker_15
E           - worker_1

test_cluster_logs/test_cluster_connection/test_cluster_connection.py:47: Failed
  • reliability/test_cluster/test_cluster_logs/test_cluster_error_logs/test_cluster_error_logs.py::test_cluster_error_logs
Testing: Look for any error messages in the logs of the cluster nodes.

It seems related to:

The main difference with the mentioned issue is that this behavior happened during all the activity in all the 25 workers. It was not related to OS.

>       assert not nodes_with_errors, 'Errors were found in the "cluster.log" file of ' \
                                      'these nodes: \n- ' + '\n- '.join(nodes_with_errors)
E       AssertionError: Errors were found in the "cluster.log" file of these nodes: 
E         - worker_21
E         - worker_13
E         - worker_4
E         - worker_6
E         - worker_18
E         - worker_20
E         - worker_11
E         - worker_2
E         - worker_14
E         - worker_3
E         - worker_9
E         - worker_17
E         - worker_22
E         - worker_7
E         - worker_5
E         - worker_19
E         - worker_12
E         - worker_25
E         - worker_8
E         - worker_24
E         - worker_23
E         - worker_10
E         - worker_16
E         - worker_15
E         - worker_1
E       assert not {'worker_1': [b'2023/07/25 09:03:37 ERROR: [Worker CLUSTER-Workload_benchmarks_metrics_B274_manager_1] [Main] Error se...nager_12] [Main] Error sending sendsync response to local client: Error 3020 - Timeout sending request: ok', ...], ...}

test_cluster/test_cluster_logs/test_cluster_error_logs/test_cluster_error_logs.py:57: AssertionError

Detected ERROR was always as the following log:

2023/07/25 09:28:04 ERROR: [Worker CLUSTER-Workload_benchmarks_metrics_B274_manager_25] [Main] Error sending sendsync response to local client: Error 3020 - Timeout sending request: ok
  • reliability/test_cluster/test_cluster_logs/test_cluster_worker_logs_order
    /test_cluster_worker_logs_order.py::test_check_logs_order_workers
Testing: Check that cluster logs appear in the expected order.

This pattern happens in every worker.

E            - Log type: Agent-groups recv
E              Expected logs: ['The checksum of both databases match.*Reset the attempt counter.', 'The checksum of both databases match.*', 'Checksum comparison failed. Attempt 10/10.*', 'Checksum comparison failed. Attempt .*', "The master's checksum and the worker's checksum are different. Local checksum:.*| Master checksum:.*."]
E              Found log: 2023/07/25 09:25:34 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B274_manager_21] [Agent-groups recv] The checksum of master (0107654add85786339eba7c5fa72cee25783dd6c) and worker (4ba9e4d09c32baef1d763fd81e3cf804aa06b607) are different.

It seems that log messages in Integrity check and Agent-groups recv have changed, however, it is clear that the message has the same meaning.
Log messages should be updated following the new conventions.
This issue were updated:

Previous report
New report

@pro-akim pro-akim self-assigned this Aug 2, 2023
@pro-akim pro-akim changed the title Inconsistent Reliability test results Unstable Reliability test results Aug 2, 2023
@wazuhci wazuhci moved this to In progress in Release 4.6.0 Aug 2, 2023
@pro-akim
Copy link
Member Author

pro-akim commented Aug 3, 2023

Update:

Tests were conducted to investigate the possibility of observing unstable results; however, the results remain consistent. The attached reports indicate the same outcomes as the 'new updated' results reported in this issue.

Here are the links to the reports:

report1
report2
report3

It still appears peculiar that when performing specific searches with the old report results, some logs before the update were not found in each log. However, it is evident that the same artifacts were used, as indicated by the test results where the date of the footprinted logs remains consistent.

On the other hand, the only variation observed was the Python and Linux version of the old report compared to the news.

It will require continuing to monitor test reports over time and observing for repetitions to identify a common factor associated with result variations.

@wazuhci wazuhci moved this from In progress to On hold in Release 4.6.0 Aug 3, 2023
@roronoasins roronoasins self-assigned this Aug 3, 2023
@roronoasins
Copy link

I was not able to reproduce the first error

E                   Failed: Could not find "Sucessfully connected to master" message in the worker_21

but the second one

        if disconnected_nodes:
>           pytest.fail(f'The following nodes disconnected from master at any point:\n- ' + '\n- '.join(disconnected_nodes))
E           Failed: The following nodes disconnected from master at any point:
E           - worker_3
E           - worker_1
E           - worker_2
E           - worker_24
E           - worker_21
E           - worker_22
E           - worker_23
E           - worker_11
E           - worker_20
E           - worker_9
E           - worker_15
E           - worker_17
E           - worker_18
E           - worker_14
E           - worker_12
E           - worker_4
E           - worker_10
E           - worker_7
E           - worker_25
E           - worker_6
E           - worker_13
E           - worker_8
E           - worker_5
E           - worker_19
E           - worker_16

It seems like some error lines from the first iteration appear in the attached artifacts, but the expected behavior is the one that I obtained in this attempt. This seems to be expected because the worker logs contain the first successful connection, as well as the connection loss and connection attempts.


Besides, this commit 2bde406 should also fix the error message

pytest.fail(f'Could not find "Sucessfully connected to master" message in the '
                            f'{node_name.search(log_file)[1]}')

and the next regex

re.search(rb'^.*Could not connect to master. Trying.*$|^.*Sucessfully connected to master.*$',
                         s[conn.end():], flags=re.MULTILINE):

@pro-akim
Copy link
Member Author

pro-akim commented Aug 4, 2023

Update

Executing tests from different venv to check the test's stability in different environment settings.

report1
report2
report3

No differences were detected from the last report.

A new related issue was opened:

@wazuhci wazuhci moved this from In progress to On hold in Release 4.6.0 Aug 4, 2023
@pro-akim
Copy link
Member Author

pro-akim commented Aug 4, 2023

Update

Unfortunately, conducting tests in different environments and on several occasions, we have been unable to reproduce the error.
The test's behavior was stable and as expected.
This issue will be kept as a reference in case a similar behavior occurs in future executions.
This issue is now moved to pending review.

@wazuhci wazuhci moved this from On hold to Pending review in Release 4.6.0 Aug 4, 2023
@jnasselle
Copy link
Member

Update

Moving ETA from Aug4 to Aug8 because release-related tasks were developed during this current issue

@wazuhci wazuhci moved this from Pending review to In review in Release 4.6.0 Aug 8, 2023
@jnasselle
Copy link
Member

Update

Moving ETA from Aug8 to Aug9 because release-related tasks were developed during this current issue review

@jnasselle
Copy link
Member

LGTM! great analysis @pro-akim @roronoasins

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Done
Development

No branches or pull requests

3 participants