Skip to content

Latest commit

 

History

History
433 lines (271 loc) · 11.5 KB

CONTRIBUTING.rst

File metadata and controls

433 lines (271 loc) · 11.5 KB

Contributing Guide

Welcome! There are many ways to contribute, including submitting bug reports, improving documentation, submitting feature requests, reviewing new submissions, or contributing code that can be incorporated into the project.

Limitations

We should keep close to these items during development:

  • Some companies still use old Spark versions, like 2.3.1. So it is required to keep compatibility if possible, e.g. adding branches for different Spark versions.
  • Different users uses onETL in different ways - some uses only DB connectors, some only files. Connector-specific dependencies should be optional.
  • Instead of creating classes with a lot of different options, prefer splitting them into smaller classes, e.g. options class, context manager, etc, and using composition.

Initial setup for local development

Install Git

Please follow instruction.

Create a fork

If you are not a member of a development team building onETL, you should create a fork before making any changes.

Please follow instruction.

Clone the repo

Open terminal and run these commands:

git clone [email protected]:myuser/onetl.git -b develop

cd onetl

Setup environment

Create virtualenv and install dependencies:

python -m venv venv
source venv/bin/activate
pip install -U wheel
pip install -U pip setuptools
pip install -U \
    -r requirements/core.txt \
    -r requirements/ftp.txt \
    -r requirements/hdfs.txt \
    -r requirements/kerberos.txt \
    -r requirements/s3.txt \
    -r requirements/sftp.txt \
    -r requirements/webdav.txt \
    -r requirements/dev.txt \
    -r requirements/docs.txt \
    -r requirements/tests/base.txt \
    -r requirements/tests/clickhouse.txt \
    -r requirements/tests/kafka.txt \
    -r requirements/tests/mongodb.txt \
    -r requirements/tests/mssql.txt \
    -r requirements/tests/mysql.txt \
    -r requirements/tests/postgres.txt \
    -r requirements/tests/oracle.txt \
    -r requirements/tests/pydantic-2.txt \
    -r requirements/tests/spark-3.5.4.txt

# TODO: remove after https://github.com/zqmillet/sphinx-plantuml/pull/4
pip install sphinx-plantuml --no-deps

Enable pre-commit hooks

Install pre-commit hooks:

pre-commit install --install-hooks

Test pre-commit hooks run:

pre-commit run

How to

Run tests locally

Using docker-compose

Build image for running tests:

docker-compose build

Start all containers with dependencies:

docker-compose up -d

You can run limited set of dependencies:

docker-compose up -d mongodb

Run tests:

docker-compose run --rm onetl ./run_tests.sh

You can pass additional arguments, they will be passed to pytest:

docker-compose run --rm onetl ./run_tests.sh -m mongodb -lsx -vvvv --log-cli-level=INFO

You can run interactive bash session and use it:

docker-compose run --rm onetl bash

./run_tests.sh -m mongodb -lsx -vvvv --log-cli-level=INFO

See logs of test container:

docker-compose logs -f onetl

Stop all containers and remove created volumes:

docker-compose down -v

Without docker-compose

Warning

To run HDFS tests locally you should add the following line to your /etc/hosts (file path depends on OS):

# HDFS server returns container hostname as connection address, causing error in DNS resolution
127.0.0.1 hdfs

Note

To run Oracle tests you need to install Oracle instantclient, and pass its path to ONETL_ORA_CLIENT_PATH and LD_LIBRARY_PATH environment variables, e.g. ONETL_ORA_CLIENT_PATH=/path/to/client64/lib.

It may also require to add the same path into LD_LIBRARY_PATH environment variable

Note

To run Greenplum tests, you should:

  • Download VMware Greenplum connector for Spark

  • Either move it to ~/.ivy2/jars/, or pass file path to CLASSPATH

  • Set environment variable ONETL_GP_PACKAGE_VERSION=local.

  • On Linux, you may have to set environment variable SPARK_EXTERNAL_IP to IP of onetl_onetl network gateway:

    export SPARK_EXTERNAL_IP=$(docker network inspect onetl_onetl --format '{{ (index .IPAM.Config 0).Gateway }}')

    This is because in some cases Spark does not properly detect hsot machine IP address, so Greenplum segments cannot connect to Spark executors.

Start all containers with dependencies:

docker-compose up -d

You can run limited set of dependencies:

docker-compose up -d mongodb

Load environment variables with connection properties:

source .env.local

Run tests:

./run_tests.sh

You can pass additional arguments, they will be passed to pytest:

./run_tests.sh -m mongodb -lsx -vvvv --log-cli-level=INFO

Stop all containers and remove created volumes:

docker-compose down -v

Build documentation

Build documentation using Sphinx:

cd docs
make html

Then open in browser docs/_build/index.html.

Review process

Please create a new GitHub issue for any significant changes and enhancements that you wish to make. Provide the feature you would like to see, why you need it, and how it will work. Discuss your ideas transparently and get community feedback before proceeding.

Significant Changes that you wish to contribute to the project should be discussed first in a GitHub issue that clearly outlines the changes and benefits of the feature.

Small Changes can directly be crafted and submitted to the GitHub Repository as a Pull Request.

Create pull request

Commit your changes:

git commit -m "Commit message"
git push

Then open Github interface and create pull request. Please follow guide from PR body template.

After pull request is created, it get a corresponding number, e.g. 123 (pr_number).

Write release notes

onETL uses towncrier for changelog management.

To submit a change note about your PR, add a text file into the docs/changelog/next_release folder. It should contain an explanation of what applying this PR will change in the way end-users interact with the project. One sentence is usually enough but feel free to add as many details as you feel necessary for the users to understand what it means.

Use the past tense for the text in your fragment because, combined with others, it will be a part of the "news digest" telling the readers what changed in a specific version of the library since the previous version.

You should also use reStructuredText syntax for highlighting code (inline or block), linking parts of the docs or external sites. If you wish to sign your change, feel free to add -- by :user:`github-username` at the end (replace github-username with your own!).

Finally, name your file following the convention that Towncrier understands: it should start with the number of an issue or a PR followed by a dot, then add a patch type, like feature, doc, misc etc., and add .rst as a suffix. If you need to add more than one fragment, you may add an optional sequence number (delimited with another period) between the type and the suffix.

In general the name will follow <pr_number>.<category>.rst pattern, where the categories are:

  • feature: Any new feature
  • bugfix: A bug fix
  • improvement: An improvement
  • doc: A change to the documentation
  • dependency: Dependency-related changes
  • misc: Changes internal to the repo like CI, test and build changes

A pull request may have more than one of these components, for example a code change may introduce a new feature that deprecates an old feature, in which case two fragments should be added. It is not necessary to make a separate documentation fragment for documentation changes accompanying the relevant code changes.

Examples for adding changelog entries to your Pull Requests

Added a ``:github:user:`` role to Sphinx config -- by :github:user:`someuser`
Fixed behavior of ``WebDAV`` connector -- by :github:user:`someuser`
Added support of ``timeout`` in ``S3`` connector
-- by :github:user:`someuser`, :github:user:`anotheruser` and :github:user:`otheruser`

Tip

See pyproject.toml for all available categories (tool.towncrier.type).

How to skip change notes check?

Just add ci:skip-changelog label to pull request.

Release Process

Before making a release from the develop branch, follow these steps:

  1. Checkout to develop branch and update it to the actual state
git checkout develop
git pull -p
  1. Backup NEXT_RELEASE.rst
cp "docs/changelog/NEXT_RELEASE.rst" "docs/changelog/temp_NEXT_RELEASE.rst"
  1. Build the Release notes with Towncrier
VERSION=$(cat onetl/VERSION)
towncrier build "--version=${VERSION}" --yes
  1. Change file with changelog to release version number
mv docs/changelog/NEXT_RELEASE.rst "docs/changelog/${VERSION}.rst"
  1. Remove content above the version number heading in the ${VERSION}.rst file
awk '!/^.*towncrier release notes start/' "docs/changelog/${VERSION}.rst" > temp && mv temp "docs/changelog/${VERSION}.rst"
  1. Update Changelog Index
awk -v version=${VERSION} '/DRAFT/{print;print "    " version;next}1' docs/changelog/index.rst > temp && mv temp docs/changelog/index.rst
  1. Restore NEXT_RELEASE.rst file from backup
mv "docs/changelog/temp_NEXT_RELEASE.rst" "docs/changelog/NEXT_RELEASE.rst"
  1. Commit and push changes to develop branch
git add .
git commit -m "Prepare for release ${VERSION}"
git push
  1. Merge develop branch to master, WITHOUT squashing
git checkout master
git pull
git merge develop
git push
  1. Add git tag to the latest commit in master branch
git tag "$VERSION"
git push origin "$VERSION"
  1. Update version in develop branch after release:
git checkout develop

NEXT_VERSION=$(echo "$VERSION" | awk -F. '/[0-9]+\./{$NF++;print}' OFS=.)
echo "$NEXT_VERSION" > onetl/VERSION

git add .
git commit -m "Bump version"
git push