Skip to content

WarFox/dqt

Repository files navigation

dqt - Data Quality Tool

A simple data quality tool. Collect and publish metrics about quality of data anywhere.

Docs

  1. Dimensions of data quality

Features [2/12]

  • [X] get metrics
  • [X] run tests
  • [ ] publish metrics to aws
  • [ ] publish metrics to prometheus
  • [ ] publish metadata to DataHub
  • [ ] build dashboards
  • [ ] alets based on CloudWatch
  • [ ] other cloud providers?
  • [ ] multiple data sources
  • [ ] example dag with airflow
  • [ ] example with prefect
  • [ ] re-conciliation between two data sources, % missing, matching columns vs mismatch

Installation

Download from https://github.com/warfox/dqt

Usage

`dqt` is a command line tool that runs on JVM.

Make sure you have the jdbc drivers in classpath.

java -jar dqt.jar run -d datasource.edn -t table.edn
java -cp "/path/to/jdbc/driver/jar/:./dqt.jar" dqt.core run -d examples/postgres.edn -t examples/tables/employees.edn

datasource.edn

table.edn

Development

Run the project directly, via `:main-opts` (`-m dqt.core`):

$ clojure -M:run

Run the project, with parameters

$ clojure -M:run -d datasource.edn -t table.edn

Run the project’s tests (they’ll fail until you edit them):

$ clojure -T:build test
$ ./bin/kaocha

Build uberjar

$ clojure -T:build uberjar

This will produce an updated pom.xml file with synchronized dependencies inside the META-INF directory inside target/classes and the uberjar in target. You can update the version (and SCM tag) information in generated pom.xml by updating build.clj.

If you don’t want the pom.xml file in your project, you can remove it. The ci task will still generate a minimal pom.xml as part of the uber task, unless you remove version from build.clj.

Run that uberjar:

$ java -jar target/dqt-0.1.0-SNAPSHOT.jar

If you remove version from build.clj, the uberjar will become target/dqt-standalone.jar.

Options

FIXME: listing of options this app accepts.

Examples

datasource file datasource.edn

{:dbtype     "postgresql"
:dbname     "postgres"
:host       #or [#env DATABASE_HOSTNAME "localhost"]
:user       "postgres"
:password   "postgres"
:ssl        false
:classname  "org.postgres.Driver"
:sslfactory "org.postgresql.ssl.NonValidatingFactory"}

table file employees.edn

{:table-name :employees
 :metrics    [:row-count
              :avg-length
              :max-length
              :min-length
              :avg
              :sum
              :max
              :min
              :stddev
              :variance]

 :tests      [[:row-count > 10]
              [:avg-length-phone-number < 13]
              [:stddev-salary > 4500]
              [:sum-salary > 20000]
              [:max-length-email < 30]]}

Run

clj -M:dev:run run -d datasource.edn -t tables/employees.edn
bb run-example

Development

Run development mode with babashka

bb dev

Test database

Run docker compose up to have postgress running

Run migraion

bb migrate

Run test

clj -M:dev:test
clj -M:dev:test --watch
bb test
bb test:watch
$ bin/koacha
$ bin/koacha --watch

References

License

Copyright © 2021 Warfox

Distributed under the MIT License.

Releases

No releases published

Packages

No packages published

Languages