Skip to content

Commit

Permalink
Merge branch 'sharing_functions' into sharing_github_action
Browse files Browse the repository at this point in the history
  • Loading branch information
yulric authored Jan 4, 2024
2 parents 3c396b7 + dfa9798 commit 26e054c
Show file tree
Hide file tree
Showing 3 changed files with 167 additions and 132 deletions.
264 changes: 151 additions & 113 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -195,129 +195,167 @@ This section summarizes all the columns part of the file

**license**: The shortform identifier of a license for this data.

# Feature implementation
# Implementation

## How the features will be implemented?
## Function Signature

The features that data custodian will request to filter the data through the rules defined in the schema will be implemented using python script that uses `Pandas` library to perform the task. There will be one main function called `create_filtered_dataset(org_group, data, sharing_rules)` that will take in 3 arguments. The org_group is the name of the organization and sharing_rules is the schema which will be of list type. The data argument is of dictionary type. These arguments are provided by the data custodian through web application.
The function which implements the sharing feature takes three arguments:

The data custodian enters in the name of the organization that they wish to share their data with.
The data custodian will also provide the data that they want to filter. The data will be a dictionary with keys referencing the table name and values will be list of dictionaries. Each nested dictionary inside the list will correspond to a single row of the table and will only contain one value per key. An example will be:
1. `data`: A Python dictionary containing the data for each table to filter. The argument does not have to contain all the entities defined in the ODM but can only contain those on which the sharing rules should be applied. An example is shown below,

```
{
"WWMeasure": [
{
"uWwMeasureID": "Measure WW100",
"SampleID": "Sample S100",
"type": "covN1",
"value": 20000,
},
{
"uWwMeasureID": "Measure WW101",
"SampleID": "Sample S101",
"type": "covN1",
"value": 15000,
},
],
"Sample": [
{
"SampleID": "Sample S100",
"siteID": "Site T100",
"datetime": "2021-02-01 9:00:00 PM",
"type": "RawWW",
},
{
"SampleID": "Sample S101",
"siteID": "Site T101",
"datetime": "2021-02-01 9:00:00 PM",
"type": "RawWW",
},
{
"SampleID": "Sample S102",
"siteID": "Site T102",
"datetime": "2021-02-01 9:00:00 PM",
"type": "RawWW",
},
],
"Lab": [
```
{
"WWMeasure": [
{
"uWwMeasureID": "Measure WW100",
"sampleID": "Sample S100",
"type": "covN1",
"value": 20000,
},
{
"uWwMeasureID": "Measure WW101",
"sampleID": "Sample S101",
"type": "covN1",
"value": 15000,
},
],
"Sample": [
{
"sampleID": "Sample S100",
"siteID": "Site T100",
"dateTime": "2021-02-01 9:00:00 PM",
"type": "RawWW",
},
{
"sampleID": "Sample S101",
"siteID": "Site T101",
"dateTime": "2021-02-01 9:00:00 PM",
"type": "RawWW",
},
{
"sampleID": "Sample S102",
"siteID": "Site T102",
"dateTime": "2021-02-01 9:00:00 PM",
"type": "RawWW",
},
],
"Lab": [
{
"labID": "Lab L100",
"assayMethodIDDefault": "Assay Y100",
"name": "University L100 Lab",
},
{
"labID": "Lab L101",
"assayMethodIDDefault": "Assay Y101",
"name": "University L100 Lab",
},
],
}
```

The above `data` argument example has three tables, **WWMeasure**, **Sample**, and **Lab**, with each table containing two rows. The dictionary keys are the table names as specified in the ODM and the values contain a list of dictionaries for each row in that table. Once again, the names of the columns and their value types should match up with their specification in the ODM.

2. `sharing_rules`: A list containing the sharing rules to be applied to the data argument. Each item in the list is a dictionary containing all the fields as defined in the schema section above. An example is shown below,

```
[
{
"labID": "Lab L100",
"assayMethodIDDefault": "Assay Y100",
"name": "University L100 Lab",
"ruleID": "1",
"sharedWith": "Public;PHAC;Local;Provincial;Quebec;OntarioWSI;CanadianWasteWaterDatabase",
"table": "ALL",
"variable": "sampleID",
"direction": "row",
"ruleValue": "S101;S102",
},
{
"labID": "Lab L101",
"assayMethodIDDefault": "Assay Y101",
"name": "University L100 Lab",
"ruleID": "2",
"sharedWith": "Public",
"table": "WWMeasure",
"variable": "analysisDate",
"direction": "row",
"ruleValue": "[2021-01-25, 2021-01-26); (2021-01-26,2021-01-31]",
},
],
}
```
]
```

Above example of data provides two rows for three different tables 'WWMeasure', 'Sample', and 'Lab'. "WWMeasure" table is represented by a key in the main dictionary. The value is a list of dictionaries. Each dictionary is a row within the `WWMeasure` table. In above example there are two rows. The "Sample" table is another key in main dictionary and it's value is again a list of dictionaries which are the rows in the table. There are 3 dictionaries or rows in the `Sample` list or table. The `Lab` table again has 2 rows or dictionaries within the list.
The above `sharing_rules` example contains two rules to apply to the data.

3. `organization`: A string containing the name of the organization for whom the filtered data will be shared with. The value of this argument should match up with an organization provided in the `sharing_rules` argument.

The data custodian will provide schema with predefined rules to filter the data. This schema is a list of dictionaries. An example of how schema might look is following:
The function will return a dictionary whose keys and values are given below:

```
[
{
"filterValue": "S101;S102",
"sharedWith": "Public;PHAC;Local;Provincial;Quebec;OntarioWSI;CanadianWasteWaterDatabase",
"table": "ALL",
"variable": "sampleID",
"direction": "row",
"ruleID": 1,
},
{
"filterValue": '["2021-01-25", "2021-01-26"); ("2021-01-26","2021-01-31"]',
"sharedWith": "Public",
"table": "WWMeasure",
"variable": "analysisDate",
"direction": "row",
"ruleID": 2,
},
{
"filterValue": '2021-01-20; ("2021-01-25", "infinity"]',
"sharedWith": "PHAC;Local;Provincial;Quebec;OntarioWSI;CanadianWasteWaterDatabase",
"tableName": "WWMeasure",
"variableName": "analysisDate",
"direction": "row",
"ruleID": 3,
},
{
"filterValue": "siteT101;siteT102",
"sharedWith": "Public;PHAC;Local;Provincial;Quebec;OntarioWSI;CanadianWasteWaterDatabase",
"tableName": "ALL",
"variable": "siteID",
"direction": "column",
"ruleID": 4,
},
{
"filterValue": "ALL",
"sharedWith": "Public;Local",
"table": "ALL",
"variable": "contactName;contactEmail;contactPhone;contactPhoneExt",
"direction": "column",
"ruleID": 5,
},
]
```

In above schema there are 5 rules as each dictionary pertains to a single rule. The first rule filters the rows of all tables that have variable name 'sampleID'. The rows filtered are the one that have the value of `sampleID` column set to `S101` 'OR' `S102`. The rows are removed only for the organizations that are mentioned in the `sharedWith` property.

The second rule inside the second dictionary filters all the `analysisDate` values in `WWMeasure` table on 2021-01-25 or between 2021-01-26 to 2021-01-31 excluding the date 2021-01-26. The rows are only removed for the data shared with `Public`. The first and second rule removes rows from the table, therefore, the `direction` property is set to 'row'.

The last or the fifth rule removes all the variables in the `variable` sharing property which are 'contactName','contactEmail','contactPhone', and 'contactPhoneExt' from all the tables only for `Public` and `Local` organizations as mentioned in `sharedWith` property. In this rule, the column is filtered, therefore, the `direction` property is set to 'column'.
* **filtered_data**: The data to share with an organization. This is a copy of the `data` parameter with the columns and rows that meet the exclusion rules defined in the sharing rules for the passed organization filtered out. It has the same structure as the `data` argument described above.
* **sharing_summary**: A list of dictionaries containing the columns/rows removed, the name of the table they were removed from and the ID of the rule that removed them. An example is shown below,

The `create_filtered_dataset()` function returns back two things:
1. The filtered dataset in the form of python dictionary with tables as keys and list of dictionaries as values.
2. The list of dictionaries where each dictionary is the row removed with it's ruleid.

The function does use several sub functions to carry out the filter process using Pandas library .
Below is the figure that demonstrates the flow of actions in an activity diagram:

<img src="activitydiagram.svg">

The .puml files contains the code for plantuml diagrams.
```
[
{
"rule_id": "1",
"entities_filtered": [
{
"table": "WWMeasure",
"columns_removed": {
"type": [
{
"uWwMeasureID": "Measure WW100",
"type": "covN1",
},
{
"uWwMeasureID": "Measure WW101",
"type": "covN2",
}
]
}
}
]
},
{
"rule_id: "2",
"entities_filtered": [
{
"table": "Sample",
"rows_removed": [
{
"sampleID": "Sample S100",
"siteID": "Site T100",
"dateTime": "2021-02-01 9:00:00 PM",
"type": "RawWW",
},
{
"sampleID": "Sample S101",
"siteID": "Site T101",
"dateTime": "2021-02-01 9:00:00 PM",
"type": "RawWW",
},
]
},
{
"table": "Lab",
"rows_removed": [
{
"labID": "Lab L100",
"assayMethodIDDefault": "Assay Y100",
"name": "University L100 Lab",
},
]
}
]
}
]
```

The above example contains two dictionaries which describes the entities filtered out due to the rule with ID 1 and 2.

The `rule_id` field in each dictionary gives the ID of the rule due to which the entities in the `entities_filtered` field were filtered.

The `entities_filtered` field is a list of dictionaries where each dictionary gives the name of the table and the rows/columns that were removed from it. The keys in each dictionary are described below:

* `table`: The name of table from where the rows/columns were removed
* `rows_removed`: A list of dictionaries, where each dictionary is the row that was removed from the table
* `columns_removed`: A list of dictionaries where the key in each dictionary is the name of the column that was removed from the table and the value is a list of dictionaries containing the value of each cell in the removed column along with the value of the primary key of the row.

Describing the example above,

1. For the rule with ID 1, the **type** column was removed from the **WWMeasure** table, and it also gives the cells within the columns that were removed. Here the cells contained the values **covN1** and **covN2** with the primary keys of the rows containing those cells being **Measure WW100** and **Measure WW101**.
2. The For the rule with ID 1, 2 rows were removed from the **Sample** table and 1 row was removed from the **Lab** table. It also gives the actual rows that were removed.
5 changes: 0 additions & 5 deletions sharing/create_dataset.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,3 @@
import pandas as pd # pylint: disable=import-error
from numpy import nan # pylint: disable=import-error
from pandas import Timestamp # pylint: disable=import-error


def create_dataset(rules: list, data: dict = {}, org: str = '') -> dict:
"""Filters data and returns filtered data and shared summary in dictionary.
Expand Down
30 changes: 16 additions & 14 deletions sharing/create_filtered_dataset.puml
Original file line number Diff line number Diff line change
Expand Up @@ -4,43 +4,45 @@ skinparam wrapWidth 80

start

:Create a copy of the data parameter (filtered_data). This is the data that will be returned to the user. Rows will be removed from here as we iterate through each rule that applies to the passed organization;
:Create a copy of the data parameter (filtered_data). This is the data that will be returned to the user. Rows/columns will be removed from here as we iterate through each rule that applies to the passed organization;

:Filter the sharing rules parameter to only find the ones that apply to the passed organization parameter (org_rules);

:Create a variable which will hold the summary of the data removed for each rule. This should include the ID of the rule, the names of the columns removed and from which table, and the primary key of the rows removed and the table they were removed from;
:Create a variable which will hold the summary of the data removed for each rule. This should include the ID of the rule, the names of the table from where rows/columns were removed, the rows that were removed, and the names of the columns that were removed;

while (Are there more rules in org_rules to apply?) is (yes)
while (Iterate through each rule in org_rules)
:Create a subset of filtered_data containing only the tables and columns that the current rule applies to (current_rule_data);

:Create a variable (current_rule_summary) which will hold a summary of the data removed for the current;
:Create a variable (current_rule_summary) which will hold a summary of the data removed for the current rule;

while (Are there more rule values in the current rule to check?) is (yes)
while (Are there more tables in current_rule_data to check?) is (yes)
while (Iterate through all the values in the ruleValues column for the current rule)
while (Iterate through each table in current_rule_data)
if (The current rule's direction is **row**?) then (yes)
while(Are there more rows in current_rule_data for the current table to check?) is (yes)
if (Do any cells in the current row meets this rule value's constraints (See the check_cell diagram)) then (yes)
while(Iterate through each row of the current table in current_rule_data)
if (Do any cells in the current row meets the current rule value's constraints (See the check_cell diagram)) then (yes)
:Add an entry in current_rule_summary for this row with this table;

:Remove this row from filtered_data;

:Remove rows from filtered_data which references the current removed row from filtered_data;
endif
endwhile (no)
elseif (The current rule's direction is **column**) then (yes)
while (Are there more columns in current_rule_data for the current table to check?) is (yes)
while (Iterate through each column of the current table in current_rule_data)
if (Do any cells in the current column meets this rule value's constraints (See the check_cell diagram)) then (yes)
:Add an entry in current_rule_summary for this columns with this table;
:Add an entry in current_rule_summary for this column with this table;

:Remove this column from filtered_data;
endif
endwhile (no)
endwhile
else
:Throw an error informing the user about an unknown direction value;
endif
endwhile (no)
endwhile

:Add the data from current_rule_summary to the master summary variable including the ID of the current rule;
endwhile (no)
endwhile (no)
endwhile
endwhile

:Use the master summary variable to create the report object;

Expand Down

0 comments on commit 26e054c

Please sign in to comment.