Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test scripts to the tools folder #74

Open
wants to merge 28 commits into
base: main
Choose a base branch
from
Open

Conversation

khaledk2
Copy link
Collaborator

I have added the script which we used to test the elasticsearch cluster to the tools folder, I have added some instructions to guide the user.
I have modified the code to copy them automatically to the host machine (searchengine/searchengine/maintenance_scripts/).
It includes a script for indexing or reindexing the data (index_data.sh). There is also a script to check the progress of the indexing process (check_indexing_process.sh)

@khaledk2
Copy link
Collaborator Author

I have added get_search_terms_from_log method to manage.py .
This method analyzes the log file and generates CSV files which contain the stats for the used search terms for each of the resources. The user should provide the folder which contains the log file/s.

@khaledk2
Copy link
Collaborator Author

khaledk2 commented Mar 20, 2023

The methods can be used on the idr-testing.

The following command will analyse the log file which is saved in /data/searchengine/searchengine/logs/ and genrates three CSV files (report_image.csv, report_project.csv, and screen_project.csv) which contains stats about the search terms.

sudo docker run --rm -v /data/searchengine/searchengine/logs/:/data/searchengine/searchengine/logs/ khaledk2/searchengine:test get_search_terms_from_log -l /data/searchengine/searchengine/logs/

When run for the first time it will copy the maintenance scripts inside /data/searchengine/searchengine/maintenance_scripts/ folder. There is a description for the scripts inside this file:

60177ad#diff-3e5af57bb465c1a51df3b132974d8cfda29bc937fe791761c142a4309c96c038

@jburel
Copy link
Member

jburel commented May 12, 2023

@khaledk2 typo in the file name. It should be queries
I think it is more example of a complex queries using and/or filters than the number of queries that is important

@khaledk2
Copy link
Collaborator Author

khaledk2 commented May 12, 2023

@jburel I have fixed the typo and renamed the script to complex_queries.

@khaledk2
Copy link
Collaborator Author

khaledk2 commented Oct 30, 2023

I have added two endpoints to access some stats:

Each returns an Excel file which contains three sheets (image, project and screen).

  • The first one contains some metadata about the most common Key/value pairs for each resource
  • The second contains the most searched terms for each resource (i.e. image, project, screen)

It has been deployed on the idr-testing.

  • The following URL returns the metadata which contains the attribute, no of unique buckets and number of images

https://idr-testing.openmicroscopy.org/searchengine/api/stats/metadata

  • The following URL will return the most common search terms.

https://idr-testing.openmicroscopy.org/searchengine/api/stats/searchterms


* The searchEngine functions can be tested using the ``check_searchengine_health.sh`` script. The script takes about 15 minutes to run. The script output is saved to a text file check_report.txt in the``/data/searchengine/searchengine/`` folder.

* It is possible to stop an elasticsearch cluster node using this script::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* It is possible to stop an elasticsearch cluster node using this script::
* It is possible to stop an elasticsearch cluster node using this script (replace n with an integer, e.g. 1,2,3)::

* It is possible to stop an elasticsearch cluster node using this script::

bash stop_node.sh n
where n is an integer, e.g. 1,2, 3.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
where n is an integer, e.g. 1,2, 3.


* The ``check_cluster_health.sh`` script is used to check the cluster status at any time.

* The searchEngine functions can be tested using the ``check_searchengine_health.sh`` script. The script takes about 15 minutes to run. The script output is saved to a text file check_report.txt in the``/data/searchengine/searchengine/`` folder.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* The searchEngine functions can be tested using the ``check_searchengine_health.sh`` script. The script takes about 15 minutes to run. The script output is saved to a text file check_report.txt in the``/data/searchengine/searchengine/`` folder.
* The searchEngine functions can be tested using the ``check_searchengine_health.sh`` script. The script takes about 15 minutes to run. The script output is saved to a text file check_report.txt in the ``/data/searchengine/searchengine/`` folder.
The added space will hopefully fix the formatting issue

where n is an integer, e.g. 1,2, 3.
* backup_elasticsearch_data.sh script is used to backup the Elasticsearch data.

* It is possible to index or re-index the data using this bash ``scrpt index_data.sh``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* It is possible to index or re-index the data using this bash ``scrpt index_data.sh``.
* It is possible to index or re-index the data using the ``index_data.sh`` script.


* It is possible to index or re-index the data using this bash ``scrpt index_data.sh``.

* It is possible to restore the Elasticsearch data from the backup (snapshot) using the following command::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* It is possible to restore the Elasticsearch data from the backup (snapshot) using the following command::
* Restore the Elasticsearch data from the backup (snapshot) using the following command::


bash restore_elasticsearch_data.sh

* It may take up to 15 minutes to restore the data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be a new bullet point. It is an explanation for the previous bullet point.


* It may take up to 15 minutes to restore the data.

* The ``check_indexing_process.sh`` script is used to check the indexing data progress.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* The ``check_indexing_process.sh`` script is used to check the indexing data progress.
* Check the progress of the data indexing using the ``check_indexing_process.sh`` script.

@pwalczysko
Copy link
Member

pwalczysko commented Nov 20, 2023

Studied the two produced excel sheets. I found both of them very useful.

  1. searchterms excel sheet:

Screenshot 2023-11-20 at 14 30 22

  • Insert a header (a separate top row) explaining that the hits are attempted searches with KVPs. Example see above.
  • Show a column diagram additionally to the pie chart
  • Show the first 5 values of the KVPs which were searched for (e.g. Cell Line: Hela, Cell Line: blah1, Cell Line: Blah2, Cell Line: Blah3, Cell Line: Blah4 with numbers in a column graph representation to make clear what Cell Lines do people search for in the first place)
  • Try to remove nonsensical KVP terms which were inserted due to some user searching for a nonsense strings - if the string is not in IDR, remove it.
  • Use the term Container (with explanation) rather than Project and Screen
  1. metadata excel sheet
  • Explain that "Bucket" means unique term.
  • Do not leave the Publication Title etc. just for Project, we must have them for Screen too, and also please produce a summary for both Projects and Screens.

Copy link
Member

@pwalczysko pwalczysko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some text formatting suggestions made. Also, please improve the layout of the resulting excel sheets as per #74 (comment)

@khaledk2
Copy link
Collaborator Author

I have implemented the suggested modifications. It has been deployed on the idr-testing.
It displays the first 5 values of the KVPs which were searched along with the number of searches for each by default for the following endpoint:
https://idr-testing.openmicroscopy.org/searchengine/api/stats/searchterms
image
The user can specify the number of returned values of the KVPs by setting return_values in the URL, e.g. to return 4 values only. This can be done by:
https://idr-testing.openmicroscopy.org/searchengine/api/stats/searchterms?return_values=4
It is also possible to return all the searched values of the KVPs by setting return_values=all
https://idr-testing.openmicroscopy.org/searchengine/api/stats/searchterms?return_values=all

tools/instructions.rst Outdated Show resolved Hide resolved
@pwalczysko
Copy link
Member

Thank you @khaledk2 . Imho, this is very helpful. Lgtm.

@khaledk2
Copy link
Collaborator Author

The endpoints of stats (search terms and metadata) have been secured. So It is required now username and password to access them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants