Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gcc puller generic #1044

Merged
merged 14 commits into from
Oct 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 64 additions & 28 deletions gis/gccview/README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,9 @@
# GCCVIEW pipeline
* [Overview](#overview)
* [Where the layers are pulled](#where-the-layers-are-pulled)
* [How the script works](#how-the-script-works)
* [Data Pipeline](#data-pipeline)
* [Adding new layers to GCC Puller DAG](#adding-new-layers-to-gcc-puller-dag)
* [Manually fetch layers - Using Jupyter Notebook](#manually-fetch-layers---using-jupyter-notebook)
* [Manually fetch layers - Using Click in command prompt](#manually-fetch-layers---using-click-in-command-prompt)


* [Manually fetch layers](#manually-fetch-layers)

## Overview

Expand Down Expand Up @@ -79,47 +76,86 @@ The GCC pipeline will be pulling multiple layers into the `gis_core` and `gis` s
|private_road|27|13|
|school|28|17|
|library|28|28|
|pavement_asset|2|36|

## How the script works
## Data Pipeline

The pipeline consists of two files, `gcc_puller_functions.py` for the functions and `gcc_layers_pull.py` for the Airflow DAG. The main function that fetches the layers is called `get_layer` and it takes in five parameters. Here is a list that describes what each parameter means:
The pipeline consists of two files, `gcc_puller_functions.py` for the functions and `/dags/gcc_layers_pull.py` for the Airflow DAG. The main function that fetches the layers is called `get_layer` and it takes in eight parameters. Here is a list that describes what each parameter means:

- mapserver_n (int): ID of the mapserver that host the desired layer
- layer_id (int): ID of the layer within the particular mapserver
- schema_name (string): name of destination schema
- is_audited (Boolean): True if the layer will be in a table that is audited, False if the layer will be inserted as a child table part of a parent table
- is_audited (Boolean): True if the layer will be in a table that is audited, False if the layer will be non-audited
- cred (Airflow PostgresHook): the Airflow PostgresHook that directs to credentials to enable a connection to a particular database
- con (used when manually pull): the path to the credential config file. Default is ~/db.cfg
- primary_key (used when pulling an audited table): primary key for this layer, returned from dictionary pk_dict when pulling for the Airflow DAG, set it manually when pulling a layer yourself.
- is_partitioned (Boolean): True if the layer will be inserted as a child table part of a parent table, False if the layer will be neither audited nor partitioned.

In the DAG file, the arguments for each layer are stored in dictionaries called "bigdata_layers" and "ptc_layers", in the order above. The DAG will be executed once every 3 months, particularly on the 15th of every March, June, September, and December every year.
In the DAG file, the arguments for each layer are stored in dictionaries called "bigdata_layers" and "ptc_layers", in the order above. The DAG will be executed once every 3 months, particularly on the 15th of every March, June, September, and December every year. The DAG will pull either audited table or partitioned table since the "is_partitioned" argument is not stored in dictionaries and are set to default value True.

## Adding new layers to GCC Puller DAG
1. Identify the mapserver_n and layer_id for the layer you wish to add. You can find COT transportation layers here: https://insideto-gis.toronto.ca/arcgis/rest/services/cot_geospatial2/FeatureServer, where mapserver_n is 2 and the layer_id is in brackets after the layer name.
2. Add a new entry to "bigdata_layers" or "ptc_layers" dictionaries in [gcc_layers_pull.py](/dags/gcc_layers_pull.py) depending on the destination database.
3. If is_audited = True, you must also add a primary key for the new layer to "pk_dict" in [gcc_puller_functions.py](gcc_puller_functions.py).
2. Add a new entry to "bigdata_layers" or "ptc_layers" dictionaries in airflow's variable depending on the destination database.
3. If is_audited = True, you must also add a primary key for the new layer to "gcc_layers" in the corresponding airflow variable.

## Manually fetch layers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a description of the 3 options: audited, partitioned, ... none? Each have different input requirements and audited/partitioned require target table to exist in database already I think.


If you need to pull a layer as a one-off task, `gcc_puller_functions.py` allows you to pull any layer from the GCC Rest API. Please note that the script must be run locally or on a on-prem server as it needs connection to insideto.

Before running the script, ensure that you have set up the appropriate environment with all necessary packages installed. You might have to set the `https_proxy` in your environment with your novell username and password in order to clone this repo or install packages. If you run into any issues, don't hestitate to ask a sysadmin. You can then install all packages in the `requirement.txt`, either with:
1) Activate your virtual environment, it should automatically install them for you

Pipenv:

`pipenv shell`

`pipenv install`

Venv:

`source .venv/bin/activate`

`python3 -m pip install -r requirements.txt`
2) Install packages with pip if you are not using a virtual environment (you should)

`pip install -r requirements.txt`


Now you are set to run the script!

There are 7 inputs that can be entered.

`--mapserver`: Mapserver number, e.g. cotgeospatial_2 will be 2

`--layer-id`: Layer id

`--schema-name`: Name of destination schema

`--con`(optional): The path to the credential config file. Default is ~/db.cfg

`--is_audited`: Whether table will be audited or not, specify the option on the command line will set this option to True; while not specifying will give the default False.

`primary_key`(required when pulling an audited table): Primary key for the layer

## Manually fetch layers - Using Jupyter Notebook
`is_partitioned`: Whether table will be a child table of a parent table or with no feature, specify the option on the command line will set this option to True; while not specifying will give the default False.

One option is to use [this notebook](./gcc_puller.ipynb) on Morbius server environment to fetch layer from gccview rest api and send it to postgresql in the schema you want.
Example of pulling the library layer (table with no feature) to the gis schema.

To use the Jupyter notebook:
1. Know the name of the layer you want to fetch.
2. Look for the mapserver that host the layer, and the layer id using the tables above.
3. Determine the schema of where you want the downloaded table to be.
4. Enter the .cfg file path at the 'Config' code block.
5. Enter the variables using the pre-existing template code block provided at the end of the notebook file.
6. Execute the code blocks from top to bottom.
7. Open pgAdmin, go to the specified schema and check if the layer's information had been pulled correctly.

Note that if you want to pull a partitioned child table into your personal schema, you need to set up the parent table first. Refer to the .sql files in `/gis/gccview/sql`.
```python
python3 bdit_data-sources/gis/gccview/gcc_puller_functions.py --mapserver 28 --layer-id 28 --schema-name gis --con db.cfg
```

## Manually fetch layers - Using Click in command prompt
Example of pulling the intersection layer (partitioned) to the gis_core schema.

The second option is to execute `gcc_puller_functions.py` in command prompt (or venv in Morbius or other servers).

There are 5 inputs that need to be entered, which are very similar to the ones listed above for function `get_layer`. The only difference is that the last parameter now needs to be a string that contains the path to your .cfg file.
```python
python3 bdit_data-sources/gis/gccview/gcc_puller_functions.py --mapserver 12 --layer-id 42 --schema-name gis_core --con db.cfg --is-partitioned
```

Run the following line to see the details of how to enter each parameter with Click.
Example of pulling the city_ward layer (partitioned) to the gis_core schema.

```python3 {FULL_PATH_TO_gcc_puller_functions.py} --help```

Note that if the script doesn't work, one reason might be because your credentials don't have access to the GCC API.
```python
python3 bdit_data-sources/gis/gccview/gcc_puller_functions.py --mapserver 0 --layer-id 0 --schema-name gis_core --con db.cfg --is-audited --primary-key area_id
```
Loading