Skip to content

qte77/App-K8s-HF-WnB

Repository files navigation

App-K8s-HF-WnB

Cirrus CI - Base Branch Build Status CodeFactor Links (Fail Fast) version semver wakatime Open in Visual Studio Code

This project aims to create an end-to-end ML app as a functional MVP. The app itself uses Hugging Face (HF) and Weights&Biases (WandB) to reduce initial complexity. The ML modules used should be interchangeable without interrupting the pipeline. The app can be deployed into a Python venv, a Docker image and Kubernetes to showcase the separation of concerns of the different pipeline components.

Status

[DRAFT] [WIP] ----> Not fully implemented yet

For version history have a look at CHANGELOG.md.

Quickstart

  • TODO

TOC

Usage

If inside poetry venv

python -m app

or if outside

poetry run python -m app

Install

Python

From a venv with available poetry

make install

or with conda

envname='App-K8s-HF-WnB'
conda create -ym -n $envname poetry
conda activate $envname
make install

Container

  • TODO

Kubernetes

  • TODO

Reason

  • TODO

Purpose

  • Showcase an end-to-end app with train and inference mode
  • Implement self-contained modular pipeline

Paradigms

  • TDD/BDD
  • Mostly functional
  • Time-to-value, time-to-market
  • Light-weight
  • Code should (Dave Farley)
    • Work
    • Be modular
    • Be cohesive
    • Be appropriatly coupled
    • Be separated by concerns
    • Hide/abstract information

App Structure

Show essential structure
/
├─ app/
│  ├─ config/
│  ├─ payload/
│  ├─ pipeline/
│  ├─ utils/
│  └─ app.py
├─ assets/
├─ container/
├─ kubernetes/
│  ├─ base/
│  └─ overlay/
├─ tests/
├─ CHANGELOG.md
├─ make.bat
├─ Makefile
├─ pyproject.toml
└─ README.md
Show full structure
/
├─ .github/
│  ├─ workflows/
│  │  ├─ links-fail-fast.yml
│  └─ dependabot.yml
├─ app/
│  ├─ config/
│  │  ├─ defaults.yml
│  │  ├─ huggingface.yml
│  │  ├─ logging.conf
│  │  ├─ parameters.dummy.json
│  │  ├─ sweep-wandb.yml
│  │  ├─ sweep.yml
│  │  ├─ task.yml
│  │  ├─ wandb.key.dummy.yml
│  │  └─ wandb.yml
│  ├─ payload/
│  │  ├─ handle_hf.py
│  │  ├─ handle_sweep.py
│  │  ├─ infer_model.py
│  │  └─ train_model.py
│  ├─ pipeline/
│  │  ├─ load_hf_components.py
│  │  ├─ prepare_pipe_data.py
│  │  └─ prepare_pipe_params.py
│  ├─ utils/
│  │  ├─ handle_logging.py
│  │  ├─ handle_paths.py
│  │  ├─ load_configs.py
│  │  ├─ log_system_info.py
│  │  ├─ parse_args.py
│  │  └─ toggle_features.py
│  ├─ __main__.py
│  ├─ __version__.py
│  ├─ _version.py
│  ├─ app.py
│  └─ py.typed
├─ assets
│  ├─ tuna_importtime_dark.PNG
│  └─ tuna_importtime_light.PNG
├─ container/
│  └─ Dockerfile.PNG
├─ kubernetes/
│  ├─ base/
│  │  ├─ deployment.yml
│  │  ├─ kustomization.yml
│  │  ├─ pvc.yml
│  │  └─ service.yml
│  └─ overlay/
│     ├─ prod/
│     │  ├─ ingress.yml
│     │  ├─ kustomization.yml
│     │  └─ namespace.yml
│     └─ test/
│        ├─ ingress.yml
│        ├─ kustomization.yml
│        └─ namespace.yml
├─ tests/
│  ├─ behavior/
│  │  ├─ test_load_hf_components_behavior.py
│  │  └─ test_train_model_behavior.py
│  └─ functionality/
│  │  └─ test_load_hf_components_functionality.py
├─ .bumpversion.cfg
├─ .cirrus.yml
├─ .coveragerc
├─ .flake8
├─ .gitattributes
├─ .gitignore
├─ .gitmessage
├─ .markdownlint.yml
├─ .pre-commit-config.yaml
├─ .yamllint.yml
├─ CHANGELOG.md
├─ LICENSE
├─ make.bat
├─ Makefile
├─ pyproject.toml
└─ README.md

App Details

Import performance

The import performance of the app can be measured with python -X importtime -m app and visualized with tuna. From root this flow can be invoked by:

make importtime

An example how the visualized import time could look like

python importtime tuna visualized python importtime tuna visualized

TODO

ML

  • Get WandB sweep config
    • Implemented and functional
    • May be extended to other providers, but for MVP sufficient
  • Save models, datasets, tokenizer and metrics in local folder other than cache
  • Define the core of the app
    • train
    • infer

Coding

  • Basic exception handling
    • May be problematic with function returns
  • Type hinting in function calls
  • Read multiple yml inside one file inside config loader
    • Abondoned, adds unnecessary complexity, use separate yml
  • Expand into typing — Support for type hints
  • Use if for to check if feature can be provided properly instead of Ecxeption to catch it
  • Try dataclass and field from dataclasses
    • Used to auto add special classes like __init__, __str__, __repr__
    • Uses type hinting and decorators
  • Factor out Pipeline.py to prepare for functional only
    • Sole purpose of Pipeline.py is to represent the gathered configs
    • Replaced by dataclasses
    • Switch to recursion instead of for-loops for FP
  • Propagate debug state through app
    • Env APP_DEBUG_IS_ON checked by modules and written to global debug_on_global
  • Use omegaconf to load configs instead of own helper implementation
    • This package manages loading of configs from json or yaml
    • Offers type checking at load-time with Structured configs
  • Align to PEP 257 – Docstring Conventions
    • Multi-line docstrings
  • Try argparse
    • Fetch user or command input
  • Check dataclasses whether
    • It is suitable for functional programming
      • Low priority right now, design choice for a later stage
    • It is a code smell because it does not provide behavior but only a structure
      • Designed to hold data, may be comparable to structand enum
  • Have a look at PyTest
    • Explored in repo TDD-Playground
  • Line-continuation inside docstrings
  • Incorporate test objects
    • Fake, Mock, Stub
  • Refactor logging according to Martin Fowler Domain-Oriented Observability
    • Domain Probing 'A Domain Probe[...] enables us to add observability to domain logic while still talking in the language of the domain'
    • Decouple logging into dedicated functions and module
    • Define logging types, e.g. log, metrics, analytics
  • Explore feature toggles
    • Testing the logging and observability
  • Test pydantic for type checking
    • pydantic build for parsing and checking types at runtime
    • If the app uses data it produced only by itself, it may not be suitable because of speed loss
    • Use pydantic.BaseModel or pydantic.dataclasses.dataclass
    • Read in and validate mode at the same time
    • Try to unpack yaml into pydantic w/o customized parsing before
    • BaseSettingscould be useful to combine files and env
  • Use hydra to parametrize the app
    • Hydra supports importing of custom dataclasses.dataclass
  • Decouple concerns into separate containers, e.g. avoid big container because of torch
    • Difference between Abstraction vs Decoupling
    • Difference between Cohesion and Coupling
  • Implement basic API, e.g. with gunicorn or FastAPI
  • Use distroless containers
    • Reduce signal to noise ration of scanners
    • Reduce size

Dependency tracking and packaging

  • Explore use of pipenv with Pipfile & Pipfile.lock as a proposed replacement to requirements.txt
    • Auto-creation of venv
    • pipenv install -e for editable mode, i.e. 'dependency resolution can be performed with an up to date copy of the repository each time it is performed'
  • Use Poetry as replacement for pipenv
    • Auto-creation of venv
    • Build-tool for packaging
  • Experiment with pyproject.toml to build app wheel
    • Used to pool information for build, package, tools etc into one file
    • Some tools like flake8 do not support this approach
  • Create a package

Project management

  • Use Makefile instead of self-implemented imparative setup.sh
    • Implemented and functional
    • Need improvement for local venv install, because source can not run inside make
  • Adopt CHANGELOG.md
    • 'A changelog is a file which contains a curated, chronologically ordered list of notable changes for each version of a project.'
    • Seems to be reasonable
  • Adopt SemVer for semantic versioning
    • Seems to be reasonable
  • Implement basic CI/CD-Skeleton
    • Using bump2version, pre-commit, black etc
    • Rationale:
      • Get fast feedback
      • Raise confidence in codebase
      • Always keep codebase in releasable state
  • Adopt TDD/BDD as described by Dave Farley TDD Is The Best Design Technique and TDD vs BDD
    • Goals
      • Think of specification first, then test
      • Confirm behavior instead of testing the code
    • Structure
      • Specification (Test Suite) ==> Test (Szenario) ==> "Given, When, Then"
    • Sequence
      • Red: Write test ==> Green: Write code passing test ==> Blue Refactor code and test
      • Test: Arrange ==> Act ==> Assert ==> Clean
    • Frameworks Gherkin and Cucumber
  • Move from Makefile to Cirrus CLI
    • Use --dirty for write-backs to files instead of rsync instance, e.g. for isort
  • Implement pydoc-action to auto-generate into gh-pages /docs, e.g. Sphinx Build Action for Sphinx

Inspirations

Resources

  • TODO