A simple data quality tool. Collect and publish metrics about quality of data anywhere.
- [X] get metrics
- [X] run tests
- [ ] publish metrics to aws
- [ ] publish metrics to prometheus
- [ ] publish metadata to DataHub
- [ ] build dashboards
- [ ] alets based on CloudWatch
- [ ] other cloud providers?
- [ ] multiple data sources
- [ ] example dag with airflow
- [ ] example with prefect
- [ ] re-conciliation between two data sources, % missing, matching columns vs mismatch
Download from https://github.com/warfox/dqt
`dqt` is a command line tool that runs on JVM.
Make sure you have the jdbc drivers in classpath.
java -jar dqt.jar run -d datasource.edn -t table.edn
java -cp "/path/to/jdbc/driver/jar/:./dqt.jar" dqt.core run -d examples/postgres.edn -t examples/tables/employees.edn
Run the project directly, via `:main-opts` (`-m dqt.core`):
$ clojure -M:run
Run the project, with parameters
$ clojure -M:run -d datasource.edn -t table.edn
Run the project’s tests (they’ll fail until you edit them):
$ clojure -T:build test
$ ./bin/kaocha
Build uberjar
$ clojure -T:build uberjar
This will produce an updated pom.xml
file with synchronized dependencies inside the META-INF
directory inside target/classes
and the uberjar in target
. You can update the version (and SCM tag)
information in generated pom.xml
by updating build.clj
.
If you don’t want the pom.xml
file in your project, you can remove it. The ci
task will
still generate a minimal pom.xml
as part of the uber
task, unless you remove version
from build.clj
.
Run that uberjar:
$ java -jar target/dqt-0.1.0-SNAPSHOT.jar
If you remove version
from build.clj
, the uberjar will become target/dqt-standalone.jar
.
FIXME: listing of options this app accepts.
{:dbtype "postgresql"
:dbname "postgres"
:host #or [#env DATABASE_HOSTNAME "localhost"]
:user "postgres"
:password "postgres"
:ssl false
:classname "org.postgres.Driver"
:sslfactory "org.postgresql.ssl.NonValidatingFactory"}
{:table-name :employees
:metrics [:row-count
:avg-length
:max-length
:min-length
:avg
:sum
:max
:min
:stddev
:variance]
:tests [[:row-count > 10]
[:avg-length-phone-number < 13]
[:stddev-salary > 4500]
[:sum-salary > 20000]
[:max-length-email < 30]]}
clj -M:dev:run run -d datasource.edn -t tables/employees.edn
bb run-example
bb dev
Run docker compose up
to have postgress running
bb migrate
clj -M:dev:test
clj -M:dev:test --watch
bb test
bb test:watch
$ bin/koacha
$ bin/koacha --watch
Copyright © 2021 Warfox
Distributed under the MIT License.