Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add dry_run flag #44

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

add dry_run flag #44

wants to merge 10 commits into from

Conversation

jburel
Copy link
Member

@jburel jburel commented Aug 30, 2022

this PR re-activates the work started by @joshmoore in #8

@jburel jburel requested a review from khaledk2 August 30, 2022 12:12
Set dry_run flag as part of vals to be consumed within the process pool
@jburel jburel mentioned this pull request Aug 31, 2022
@khaledk2
Copy link
Collaborator

It may be a good idea to write the data to a JSON or CSV file rather than print them, what do you think?

@jburel
Copy link
Member Author

jburel commented Aug 31, 2022

printing to a file will make sense. Do we want a flag to indicate the output?
No flag: print to the console
output: specify format e.g. json csv

@khaledk2
Copy link
Collaborator

I think we do not need another flag as we can set the dry_mode as an integer, which can have three values:
0 off (default i.e. push the date to the Elasticsearch index)
1 print to the console,
2 save to a file.
What do you think?

@jburel
Copy link
Member Author

jburel commented Aug 31, 2022

you mean re-using --dry_run flag. We do not use boolean but an int
That might be confusing, in that case let's for for Json and if an error occurs when saving to Json we print to the console instead

validate the indexing if the dry_run is False
clean the up the code for writing JSON file
@khaledk2
Copy link
Collaborator

It is now writing a JSON file in case the dry_run is True, and printing to the console in case of any error.

I have tested it and it seems to be working fine.

Use the JSON format which is used to insert the data into Elasticsearch.
@khaledk2
Copy link
Collaborator

khaledk2 commented Sep 2, 2022

I have fixed the JSON format, it is similar to that used to insert the data into Elasticsearch.

@joshmoore I have deployed it in pilot-idr0000-omeroreadwrite searchengine. You can use the following docker command to run the dry_mode:

sudo docker run -d --name searchengine_d --rm -v /data/searchengine/searchengine/:/etc/searchengine/ --network=searchengine-net khaledk2/searchengine:latest get_index_data_from_database -d

It will save the results to files named in this format:
data_n.json , n=1,2, .....

The files are saved to:
/data/searchengine/searchengine/

@khaledk2
Copy link
Collaborator

khaledk2 commented Sep 2, 2022

The JSON is a list that contains dicts, each has a format like that:

{
        "_index": "image_keyvalue_pair_metadata",
        "_source": {
            "doc_type": "image_keyvalue_pair_metadata",
            "id": 1462,
            "owner_id": 2,
            "experiment": null,
            "group_id": 3,
            "name": "X_110222_S1 [Well C-11; Field #1]",
            "description": null,
            "project_name": null,
            "project_id": null,
            "dataset_name": null,
            "dataset_id": null,
            "screen_id": 3,
            "screen_name": "idr0001-graml-sysgro/screenA",
            "plate_id": 53,
            "plate_name": "X_110222_S1",
            "well_id": 293,
            "wellsample_id": 1939,
            "key_values": [
                {
                    "name": "Gene Identifier",
                    "value": "SPAC25G10.06",
                    "index": 0
                },
                {
                    "name": "Organism",
                    "value": "Schizosaccharomyces pombe",
                    "index": 0
                },
                {
                    "name": "Strain",
                    "value": "rps2801",
                    "index": 0
                },
                {
                    "name": "Channels",
                    "value": "GFP:endogenous alpha tubulin 2;Cascade blue:growth media",
                    "index": 1
                },
                {
                    "name": "Gene Identifier URL",
                    "value": "http://www.pombase.org/spombe/result/SPAC25G10.06",
                    "index": 1
                },
                {
                    "name": "Gene Symbol",
                    "value": "rps2801",
                    "index": 2
                },
                {
                    "name": "Replicate Group",
                    "value": "1",
                    "index": 2
                }
            ]
        }
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants