A microservice which parses CSV files of COVID-19 sample information, validated the information and saves valid data to MongoDB.
- Requirements for Development
- Getting Started
- Running
- Migrations
- Priority Samples
- Testing
- Formatting, Type Checking and Linting
- Miscellaneous
The following tools are required for development:
-
python (use
pyenv
or something similar to install the python version specified in thePipfile
) -
install the required packages using pipenv:
brew install pipenv
-
Optionally, to test SFTP, this Docker image is helpful.
-
mongodb (currently 4.2 is running in production)
brew tap mongodb/brew brew install [email protected] brew services start [email protected]
-
To support the parsing of messages from RabbitMQ instead of via SFTP, both RabbitMQ and Redpanda must be available. Running these from Docker is highly recommended. Follow the instructions under the Docker section of this document to bringing up the dependencies and ensure these are available.
Some activities in Crawler require additional dependencies from other repositories.
If you intend to test/develop/use the test data generation functionality of
Crawler at the /v1/cherrypick-test-data
endpoint, you will also need to be
running a local instance of Baracoda. If
you need to change the port Crawler uses to contact Baracoda, you can do so in
the BARACODA_BASE_URL
in the config file crawler/config/defaults.py
.
If you will be sending RabbitMQ messages to update plate map samples, the system
will check Cherrytrack to see whether the plate has already been picked.
Cherrytrack can be run locally from a clone of its
repository. If you need Crawler to
communicate with Cherrytrack on a different port, you can change the port used
by updating the CHERRYTRACK_BASE_URL
in the config file
crawler/config/defaults.py
.
The app is set to run with development settings when not deployed via Ansible.
To change this you can update the line in .flaskenv
to another module if desired:
SETTINGS_MODULE=crawler.config.development
Install the require dependencies:
pipenv install --dev
Once all the required packages are installed, enter the virtual environment with (this will also load the .env
file):
pipenv shell
Crawler requires access to an instance of RabbitMQ which is among the
dependencies Docker Compose can set up for you. Otherwise if you
have a local instance of RabbitMQ, it can also be used after changing the
RabbitMQ host and port in ./setup_dev_rabbit.py
and in
./crawler/config/defaults.py
. Crawler will generate errors if the expected
resources are not present in RabbitMQ, so run the following to generate those in
your dev environment.
python ./setup_dev_rabbit.py
To then run the app, use the command:
flask run
This will cause the crawler to execute an ingest every 30 minutes, triggered by cron, so at 10 and 40 minutes past the hour.
This scheduled behaviour can be turned off by adding the following to the development.py
file:
SCHEDULER_RUN = False
You can also adjust the behaviour of the scheduled ingest using the settings in the same file.
To run an ingest immediately, whether Flask is running or not, the runner.py
file can be used with the arguments shown:
python runner.py --help
usage: runner.py [-h] [--sftp] [--keep-files] [--add-to-dart] [--centre_prefix {ALDP,MILK,QEUH,CAMC,RAND,HSLL,PLYM,BRBR}]
Parse CSV files from the Lighthouse Labs and store the sample information in MongoDB
optional arguments:
-h, --help show this help message and exit
--sftp use SFTP to download CSV files, defaults to using local files
--keep-files keeps the CSV files after the runner has been executed
--add-to-dart on processing samples, also add them to DART
--centre_prefix {ALDP,MILK,QEUH,CAMC,RAND,HSLL,PLYM,BRBR}
process only this centre's plate map files
When the crawler process runs every 30 minutes it should be updating the MLWH lighthouse_sample table as it goes with records for all rows that are inserted into MongoDB. If that MLWH insert process fails you should see a critical exception for the file in Lighthouse-UI. This may be after records inserted correctly into MongoDB, and re-running the file will not re-attempt the MLWH inserts in that situation.
There is a manual migration task that can be run to fix this discrepancy (update_mlwh_with_legacy_samples) that allows insertion of rows to the MLWH between two MongoDB created_at
datetimes.
NB: Both datetimes are inclusive: range includes those rows greater than or equal to start datetime, and less than or equal to end datetime.
Usage (inside pipenv shell):
python run_migration.py update_mlwh_with_legacy_samples 200115_1200 200116_1600
Where the time format is YYMMDD_HHmm. Both start and end timestamps must be present.
The process should not duplicate rows that are already present in MLWH, so you can be generous with your timestamp range.
When the Beckman robots come online, we need to populate the DART database with the filtered positive samples that are available physically. This can be achieved using the 'update_dart' migration.
This can also be used similarly to the existing MLWH migration: if a DART insert process fails, you will see a critical exception for the file in the Lighthouse-UI. After addressing reason for failure, run between relevant timestamps to re-insert/update data into DART.
In short, this migration performs the following steps:
- Get the
RESULT = positive
samples (which are not controls) from mongo between a start and end date - Removes samples from this list which have already been cherrypicked by inspecting the events in the MLWH
- Determining whether they are filtered positive samples using the latest rule
- Determining the plate barcode UUID
- Update mongo with the filtered positive and UUID values
- Update MLWH with the same filtered positive and UUID values
- Create/update the DART database with all the positive samples and setting the filtered positive samples as 'pickable'
To run the migration:
python run_migration.py update_mlwh_and_dart_with_legacy_samples 200115_1200 200116_1600
Where the time format is YYMMDD_HHmm. Both start and end timestamps must be present.
If a sample is prioritised (has must_sequence
flag set) it will be treated the same as a fit_to_pick
sample.
During the prioritisation run (after all the centres' files have been processed), any existing priority samples flagged as 'unprocessed' will be:
- Updated in the MLWH
lighthouse_sample
table with the values of the priority (must_sequence
andpreferentially_sequence
) added to it - Inserted in DART as 'pickable' if the plate is in state
pending
- Updated as 'processed' in mongo so it won't be processed again unless there is a change for it
This will be applied with the following set of rules:
- All records in mongodb from the
priority_samples
collection whereprocessed
istrue
will be ignored - All new updates of prioritisation will be updated in the MLWH
- If the sample is in a plate that is not in a
pending
state no updates will be performed in DART for this sample even if there is any new prioritisation set for it - If the sample has
filtered_positive
set totrue
, the sample will be flagged as 'pickable' in DART - If the sample has a 'must_sequence' priority setting, the sample will be flagged as 'pickable' in DART (even when the
filtered_positive
for that sample is set to 'false') - If the sample changes its prioritisation, the setting for 'pickable' will be removed in DART
- After a record from the
priority_samples
collection has been processed, it will be flagged by settingprocessed
set totrue
This is a history of past and current rules by which positive samples are further filtered and identified as
'filtered positive'. Note that any rule change requires the update_filtered_positives
migration be run, as outlined
in the below relevant section.
The implementation of the current version can be found in FilteredPositiveIdentifier, with the implementation of previous versions (if any) in the git history.
A sample is filtered positive if:
- it has a positive RESULT
This is the pre-"fit-to-pick" implementation, without any extra filtering on top of the RESULT=Positive requirement.
A sample is filtered positive if:
- it has a positive RESULT
- it is not a control (ROOT_SAMPLE_ID does not start with 'CBIQA_')
- all of CH1_CQ, CH2_CQ and CH3_CQ are
None
, or one of these is less than or equal to 30
More information on this version can be found on this Confluence page.
A sample is filtered positive if:
- it has a 'Positive' RESULT
- it is not a control (ROOT_SAMPLE_ID does not start with 'CBIQA_', 'QC0', or 'ZZA000')
- all of CH1_CQ, CH2_CQ and CH3_CQ are
None
, or one of these is less than or equal to 30
More information on this version can be found on this Confluence page.
A sample is filtered positive if:
- it has a 'Positive' RESULT
- it is not a control (ROOT_SAMPLE_ID does not start with 'CBIQA_', 'QC0', or 'ZZA')
- all of CH1_CQ, CH2_CQ and CH3_CQ are
None
, or one of these is less than or equal to 30
On changing the positive filtering version/definition, all unpicked samples stored in MongoDB, MLWH and DART need updating to determine whether they are still filtered positive under the new rules, and can therefore be cherrypicked. In order to keep the databases in sync, the update process for all is performed in a single manual migration (update_filtered_positives) which identifies unpicked samples, re-determines their filtered positive value, and updates the databases.
Usage (inside pipenv shell):
python run_migration.py update_filtered_positives
OR
python run_migration.py update_filtered_positives omit_dart
By default, the migration will attempt to use DART, as it will safely fail if DART cannot be accessed, hence warning
the user to reconsider what they are doing. However, using DART can be omitted by including the omit_dart
flag.
Neither process duplicates any data, instead updating existing entries.
The tests require a connection to the 'lighthouse_sample' table in the Multi-LIMS Warehouse (MLWH). The credentials for
connecting to the MLWH are configured in the defaults.py
file, or in the relevant environment file, for example
test.py
. You can run the tests by connecting to the UAT instance of the MLWH, or an existing local copy you already
have. Or, you can create a basic local one containing just the relevant table by running the following from the top
level folder (this is what it does in the CI):
python setup_test_db.py
To run the tests, execute:
python -m pytest -vs
Black is used as a formatter, to format code before committing:
black .
Mypy is used as a type checker, to execute:
mypy .
Flake8 is used for linting, to execute:
flake8
A little convenience script can be used to run the formatting, type checking and linting:
./forlint.sh
If you do not have root access pyodbc will not work if you use brew. Using the docker compose you can set up the full stack and it will also set the correct environment variables.
To run the dependencies used by Crawler and also Lighthouse, there is a separate configuration for Docker Compose:
./dependencies/up.sh
Note: These dependencies are shared with Lighthouse so if you start these dependencies here, there's no need to also attempt to do so in the Lighthouse repository. They are the same resources in both and the second one to be started will show exceptions about ports already being allocated.
When you want to shut the dependencies back down, you can do so with:
./dependencies/down.sh
To build and run the container for Crawler, run from the root of the repository:
docker-compose up
To run the tests:
You will need to find the id of the container with image name crawler_runner
docker exec -ti <container_id> python -m pytest -vs
There is now a volume for the runner so there is hot reloading i.e. changes in the code and tests will be updated when you rerun tests.
To make use of the ODBC driver on macOS, follow this guide by Microsoft.
This post was used for the naming conventions within mongo.
Node is required to run npx:
npx markdown-toc -i README.md