bridgeAI-regression-model-data-ingestion

Data Ingestion and versioning

The data used is available here. Provide an accessible path to the csv file in the config.yaml file in data_url. Please ensure that the file can be downloaded using curl. Or you can provide the url in the environment variable DATA_URL
Update the python environment in .env file
Install poetry if not already installed
Install the dependencies using poetry poetry install
update the config and model parameters in the config.yaml file
Add ./src to the PYTHONPATH - export PYTHONPATH="${PYTHONPATH}:./src"
Run poetry run python src/main.py

The below manual steps are automated using the data ingestion dag in the DAGs repo

dvc init from the root of the repo to set the repo as a dvc repo if it is not already done

Add dvc remote

dvc remote add -f <dvc-remote-name> <dvc-remote-path>

Add the files that needs to be tracked to dvc

dvc add artefacts/test_data.csv artefacts/train_data.csv artefacts/val_data.csv

Add the dvc files to git

git add artefacts/test_data.csv.dvc artefacts/train_data.csv.dvc artefacts/val_data.csv.dvc

Push the data to dvc remote
```
dvc push -r <dvc-remote-name>
```
Git push and tag the repo with version of data for future use

Data ingestion and versioning - using docker

Build the docker image - docker build -t data-ingestion .
Run the container with the correct DATA_URL and DVC_REMOTE as environment variables. (Refer to the following Environment Variables table for complete list)
docker run -e DVC_REMOTE=s3:some/remote -e DATA_URL=https://raw.githubusercontent.com/renjith-digicat/random_file_shares/main/HousingData.csv --rm data-ingestion

Data ingestion and versioning - using Airflow DAG (Recommended method)

Set up the kubernetes cluster and infrastructure required using Infrastructure repo
Access the airflow UI made available using the above infra repo
Update the airflow variables accordingly
Trigger the data_ingestion_dag

Once the DAG execution completed, the data ingestion repo will be updated with a new data version in the specified branch of the repo.

Environment Variables

The following environment variables can be set to configure the training:

Variable	Default Value	Description
DATA_URL	`https://raw.githubusercontent.com/renjith-digicat/random_file_shares/main/HousingData.csv`	Url to the raw data CSV data used for training
CONFIG_PATH	`./config.yaml`	File path to the data cleansing, versioning and other configuration file
LOG_LEVEL	`INFO`	The logging level for the application. Valid values are `DEBUG`, `INFO`, `WARNING`, `ERROR`, and `CRITICAL`.
DVC_REMOTE	`/tmp/test-dvc-remote`	A DVC remote path
DVC_ENDPOINT_URL	`http://minio`	The URL endpoint for the DVC storage backend. This is typically the URL of an S3-compatible service, such as MinIO, used to store and manage datasets and model files.
DVC_REMOTE_NAME	`regression-model-remote`	The name for the dvc remote
DVC_ACCESS_KEY_ID	None	The access key id for dvc remote endpoint url (default value is embedded in the infra repo)
DVC_SECRET_ACCESS_KEY	None	The secret access key for dvc remote endpoint url (default value is embedded in the infra repo)
GITHUB_USERNAME	None	Github username using which new data version files will be pushed to github (default value is embedded in the infra repo)
GITHUB_PASSWORD	None	Github token for the above username (default value is embedded in the infra repo)

Running the tests

Ensure that you have the project requirements already set up by following the Data Ingestion and versioning instructions

Ensure pytest is installed. poetry install will install it as a dependency.

Run the tests with poetry run pytest ./tests

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.dvc		.dvc
.github/workflows		.github/workflows
artefacts		artefacts
src		src
tests		tests
.dvcignore		.dvcignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
config.yaml		config.yaml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bridgeAI-regression-model-data-ingestion

Data Ingestion and versioning

Data ingestion and versioning - using docker

Data ingestion and versioning - using Airflow DAG (Recommended method)

Environment Variables

Running the tests

About

Releases 7

Packages

Contributors 2

Languages

digicatapult/bridgeAI-regression-model-data-ingestion

Folders and files

Latest commit

History

Repository files navigation

bridgeAI-regression-model-data-ingestion

Data Ingestion and versioning

Data ingestion and versioning - using docker

Data ingestion and versioning - using Airflow DAG (Recommended method)

Environment Variables

Running the tests

About

Topics

Resources

Code of conduct

Security policy

Stars

Watchers

Forks

Releases 7

Packages 0

Contributors 2

Languages

Packages