- The data used is available here.
Provide an accessible path to the csv file in the
config.yaml
file indata_url
. Please ensure that the file can be downloaded usingcurl
. Or you can provide the url in the environment variableDATA_URL
- Update the python environment in
.env
file - Install
poetry
if not already installed - Install the dependencies using poetry
poetry install
- update the config and model parameters in the
config.yaml
file - Add
./src
to thePYTHONPATH
-export PYTHONPATH="${PYTHONPATH}:./src"
- Run
poetry run python src/main.py
The below manual steps are automated using the data ingestion dag in the DAGs repo
dvc init
from the root of the repo to set the repo as a dvc repo if it is not already done- Add dvc remote
dvc remote add -f <dvc-remote-name> <dvc-remote-path>
- Add the files that needs to be tracked to dvc
dvc add artefacts/test_data.csv artefacts/train_data.csv artefacts/val_data.csv
- Add the dvc files to git
git add artefacts/test_data.csv.dvc artefacts/train_data.csv.dvc artefacts/val_data.csv.dvc
- Push the data to dvc remote
dvc push -r <dvc-remote-name>
- Git push and tag the repo with version of data for future use
- Build the docker image -
docker build -t data-ingestion .
- Run the container with the correct
DATA_URL
andDVC_REMOTE
as environment variables. (Refer to the following Environment Variables table for complete list)
docker run -e DVC_REMOTE=s3:some/remote -e DATA_URL=https://raw.githubusercontent.com/renjith-digicat/random_file_shares/main/HousingData.csv --rm data-ingestion
- Set up the kubernetes cluster and infrastructure required using Infrastructure repo
- Access the airflow UI made available using the above infra repo
- Update the airflow variables accordingly
- Trigger the
data_ingestion_dag
Once the DAG execution completed, the data ingestion repo will be updated with a new data version in the specified branch of the repo.
The following environment variables can be set to configure the training:
Variable | Default Value | Description |
---|---|---|
DATA_URL | https://raw.githubusercontent.com/renjith-digicat/random_file_shares/main/HousingData.csv |
Url to the raw data CSV data used for training |
CONFIG_PATH | ./config.yaml |
File path to the data cleansing, versioning and other configuration file |
LOG_LEVEL | INFO |
The logging level for the application. Valid values are DEBUG , INFO , WARNING , ERROR , and CRITICAL . |
DVC_REMOTE | /tmp/test-dvc-remote |
A DVC remote path |
DVC_ENDPOINT_URL | http://minio |
The URL endpoint for the DVC storage backend. This is typically the URL of an S3-compatible service, such as MinIO, used to store and manage datasets and model files. |
DVC_REMOTE_NAME | regression-model-remote |
The name for the dvc remote |
DVC_ACCESS_KEY_ID | None | The access key id for dvc remote endpoint url (default value is embedded in the infra repo) |
DVC_SECRET_ACCESS_KEY | None | The secret access key for dvc remote endpoint url (default value is embedded in the infra repo) |
GITHUB_USERNAME | None | Github username using which new data version files will be pushed to github (default value is embedded in the infra repo) |
GITHUB_PASSWORD | None | Github token for the above username (default value is embedded in the infra repo) |
Ensure that you have the project requirements already set up by following the Data Ingestion and versioning instructions
- Ensure
pytest
is installed.poetry install
will install it as a dependency.
- Run the tests with
poetry run pytest ./tests