Skip to content

Commit

Permalink
Merge branch 'datahub-project:master' into master
Browse files Browse the repository at this point in the history
  • Loading branch information
anshbansal authored Oct 9, 2024
2 parents a74219b + 576ae8a commit 38250ed
Show file tree
Hide file tree
Showing 54 changed files with 1,432 additions and 74 deletions.
99 changes: 99 additions & 0 deletions .github/workflows/docker-unified.yml
Original file line number Diff line number Diff line change
Expand Up @@ -479,6 +479,39 @@ jobs:
context: .
file: ./docker/kafka-setup/Dockerfile
platforms: linux/amd64,linux/arm64/v8
kafka_setup_scan:
permissions:
contents: read # for actions/checkout to fetch code
security-events: write # for github/codeql-action/upload-sarif to upload SARIF results
actions: read # only required for a private repository by github/codeql-action/upload-sarif to get the Action run status
name: "[Monitoring] Scan Kafka Setup images for vulnerabilities"
runs-on: ubuntu-latest
needs: [ setup, kafka_setup_build ]
if: ${{ needs.setup.outputs.kafka_setup_change == 'true' || (needs.setup.outputs.publish == 'true' || needs.setup.outputs.pr-publish == 'true') }}
steps:
- name: Checkout # adding checkout step just to make trivy upload happy
uses: acryldata/sane-checkout-action@v3
- name: Download image
uses: ishworkh/docker-image-artifact-download@v1
if: ${{ needs.setup.outputs.publish != 'true' && needs.setup.outputs.pr-publish != 'true' }}
with:
image: ${{ env.DATAHUB_KAFKA_SETUP_IMAGE }}:${{ needs.setup.outputs.unique_tag }}
- name: Run Trivy vulnerability scanner
uses: aquasecurity/[email protected]
env:
TRIVY_OFFLINE_SCAN: true
with:
image-ref: ${{ env.DATAHUB_KAFKA_SETUP_IMAGE }}:${{ needs.setup.outputs.unique_tag }}
format: "template"
template: "@/contrib/sarif.tpl"
output: "trivy-results.sarif"
severity: "CRITICAL,HIGH"
ignore-unfixed: true
vuln-type: "os,library"
- name: Upload Trivy scan results to GitHub Security tab
uses: github/codeql-action/upload-sarif@v2
with:
sarif_file: "trivy-results.sarif"

mysql_setup_build:
name: Build and Push DataHub MySQL Setup Docker Image
Expand All @@ -500,6 +533,39 @@ jobs:
context: .
file: ./docker/mysql-setup/Dockerfile
platforms: linux/amd64,linux/arm64/v8
mysql_setup_scan:
permissions:
contents: read # for actions/checkout to fetch code
security-events: write # for github/codeql-action/upload-sarif to upload SARIF results
actions: read # only required for a private repository by github/codeql-action/upload-sarif to get the Action run status
name: "[Monitoring] Scan MySQL Setup images for vulnerabilities"
runs-on: ubuntu-latest
needs: [ setup, mysql_setup_build ]
if: ${{ needs.setup.outputs.mysql_setup_change == 'true' || (needs.setup.outputs.publish == 'true' || needs.setup.outputs.pr-publish == 'true') }}
steps:
- name: Checkout # adding checkout step just to make trivy upload happy
uses: acryldata/sane-checkout-action@v3
- name: Download image
uses: ishworkh/docker-image-artifact-download@v1
if: ${{ needs.setup.outputs.publish != 'true' && needs.setup.outputs.pr-publish != 'true' }}
with:
image: ${{ env.DATAHUB_MYSQL_SETUP_IMAGE }}:${{ needs.setup.outputs.unique_tag }}
- name: Run Trivy vulnerability scanner
uses: aquasecurity/[email protected]
env:
TRIVY_OFFLINE_SCAN: true
with:
image-ref: ${{ env.DATAHUB_MYSQL_SETUP_IMAGE }}:${{ needs.setup.outputs.unique_tag }}
format: "template"
template: "@/contrib/sarif.tpl"
output: "trivy-results.sarif"
severity: "CRITICAL,HIGH"
ignore-unfixed: true
vuln-type: "os,library"
- name: Upload Trivy scan results to GitHub Security tab
uses: github/codeql-action/upload-sarif@v2
with:
sarif_file: "trivy-results.sarif"

elasticsearch_setup_build:
name: Build and Push DataHub Elasticsearch Setup Docker Image
Expand All @@ -521,6 +587,39 @@ jobs:
context: .
file: ./docker/elasticsearch-setup/Dockerfile
platforms: linux/amd64,linux/arm64/v8
elasticsearch_setup_scan:
permissions:
contents: read # for actions/checkout to fetch code
security-events: write # for github/codeql-action/upload-sarif to upload SARIF results
actions: read # only required for a private repository by github/codeql-action/upload-sarif to get the Action run status
name: "[Monitoring] Scan ElasticSearch setup images for vulnerabilities"
runs-on: ubuntu-latest
needs: [ setup, elasticsearch_setup_build ]
if: ${{ needs.setup.outputs.elasticsearch_setup_change == 'true' || (needs.setup.outputs.publish == 'true' || needs.setup.outputs.pr-publish == 'true' ) }}
steps:
- name: Checkout # adding checkout step just to make trivy upload happy
uses: acryldata/sane-checkout-action@v3
- name: Download image
uses: ishworkh/docker-image-artifact-download@v1
if: ${{ needs.setup.outputs.publish != 'true' && needs.setup.outputs.pr-publish != 'true' }}
with:
image: ${{ env.DATAHUB_ELASTIC_SETUP_IMAGE }}:${{ needs.setup.outputs.unique_tag }}
- name: Run Trivy vulnerability scanner
uses: aquasecurity/[email protected]
env:
TRIVY_OFFLINE_SCAN: true
with:
image-ref: ${{ env.DATAHUB_ELASTIC_SETUP_IMAGE }}:${{ needs.setup.outputs.unique_tag }}
format: "template"
template: "@/contrib/sarif.tpl"
output: "trivy-results.sarif"
severity: "CRITICAL,HIGH"
ignore-unfixed: true
vuln-type: "os,library"
- name: Upload Trivy scan results to GitHub Security tab
uses: github/codeql-action/upload-sarif@v2
with:
sarif_file: "trivy-results.sarif"

datahub_ingestion_base_build:
name: Build and Push DataHub Ingestion (Base) Docker Image
Expand Down
1 change: 0 additions & 1 deletion docker/airflow/docker-compose.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,6 @@
#
# Feel free to modify this file to suit your needs.
---
version: '3'
x-airflow-common:
# In order to add custom dependencies or upgrade provider packages you can use your extended image.
# Comment the image line, place your Dockerfile in the directory where you placed the docker-compose.yaml
Expand Down
1 change: 0 additions & 1 deletion docker/cassandra/docker-compose.cassandra.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
# Override to use Cassandra as a backing store for datahub-gms.
---
version: '3.8'
services:
cassandra:
hostname: cassandra
Expand Down
1 change: 0 additions & 1 deletion docker/docker-compose-with-cassandra.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@

# NOTE: This file does not build! No dockerfiles are set. See the README.md in this directory.
---
version: '3.9'
services:
datahub-frontend-react:
hostname: datahub-frontend-react
Expand Down
1 change: 0 additions & 1 deletion docker/docker-compose-without-neo4j.override.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
---
version: '3.9'
services:
datahub-gms:
env_file: datahub-gms/env/docker-without-neo4j.env
Expand Down
1 change: 0 additions & 1 deletion docker/docker-compose-without-neo4j.postgres.override.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
# Override to use PostgreSQL as a backing store for datahub-gms.
---
version: '3.9'
services:
datahub-gms:
env_file:
Expand Down
1 change: 0 additions & 1 deletion docker/docker-compose-without-neo4j.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@

# NOTE: This file will cannot build! No dockerfiles are set. See the README.md in this directory.
---
version: '3.9'
services:
datahub-frontend-react:
hostname: datahub-frontend-react
Expand Down
1 change: 0 additions & 1 deletion docker/docker-compose.consumers-without-neo4j.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
# Service definitions for standalone Kafka consumer containers.
version: '3.9'
services:
datahub-gms:
environment:
Expand Down
1 change: 0 additions & 1 deletion docker/docker-compose.consumers.dev.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
version: '3.9'
services:
datahub-mae-consumer:
image: acryldata/datahub-mae-consumer:debug
Expand Down
1 change: 0 additions & 1 deletion docker/docker-compose.consumers.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
# Service definitions for standalone Kafka consumer containers.
version: '3.9'
services:
datahub-gms:
environment:
Expand Down
1 change: 0 additions & 1 deletion docker/docker-compose.dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@
# To make a JVM app debuggable via IntelliJ, go to its env file and add JVM debug flags, and then add the JVM debug
# port to this file.
---
version: '3.9'
services:
datahub-frontend-react:
image: acryldata/datahub-frontend-react:head
Expand Down
1 change: 0 additions & 1 deletion docker/docker-compose.kafka-setup.yml
Original file line number Diff line number Diff line change
@@ -1,3 +1,2 @@
# Empty docker compose for kafka-setup as we have moved kafka-setup back into the main compose
version: '3.9'
services:
1 change: 0 additions & 1 deletion docker/docker-compose.override.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
# Default override to use MySQL as a backing store for datahub-gms (same as docker-compose.mysql.yml).
---
version: '3.9'
services:
datahub-gms:
env_file: datahub-gms/env/docker.env
Expand Down
1 change: 0 additions & 1 deletion docker/docker-compose.tools.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
# Tools useful for operating & debugging DataHub.
---
version: '3.8'
services:
kafka-rest-proxy:
image: confluentinc/cp-kafka-rest:7.4.0
Expand Down
1 change: 0 additions & 1 deletion docker/docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@

# NOTE: This file does not build! No dockerfiles are set. See the README.md in this directory.
---
version: '3.9'
services:
datahub-frontend-react:
hostname: datahub-frontend-react
Expand Down
1 change: 0 additions & 1 deletion docker/ingestion/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
---
version: '3.5'
services:
ingestion:
build:
Expand Down
1 change: 0 additions & 1 deletion docker/mariadb/docker-compose.mariadb.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
# Override to use MariaDB as a backing store for datahub-gms.
---
version: '3.8'
services:
mariadb:
hostname: mariadb
Expand Down
1 change: 0 additions & 1 deletion docker/monitoring/docker-compose.consumers.monitoring.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
---
version: '3.8'
services:
datahub-mae-consumer:
environment:
Expand Down
1 change: 0 additions & 1 deletion docker/monitoring/docker-compose.monitoring.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
---
version: '3.9'
services:
datahub-frontend-react:
environment:
Expand Down
1 change: 0 additions & 1 deletion docker/mysql/docker-compose.mysql.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
# Override to use MySQL as a backing store for datahub-gms.
---
version: '3.8'
services:
mysql:
hostname: mysql
Expand Down
1 change: 0 additions & 1 deletion docker/quickstart/docker-compose-m1.quickstart.yml
Original file line number Diff line number Diff line change
Expand Up @@ -291,7 +291,6 @@ services:
volumes:
- zkdata:/var/lib/zookeeper/data
- zklogs:/var/lib/zookeeper/log
version: '3.9'
volumes:
broker: null
esdata: null
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -266,7 +266,6 @@ services:
volumes:
- zkdata:/var/lib/zookeeper/data
- zklogs:/var/lib/zookeeper/log
version: '3.9'
volumes:
broker: null
esdata: null
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -266,7 +266,6 @@ services:
volumes:
- zkdata:/var/lib/zookeeper/data
- zklogs:/var/lib/zookeeper/log
version: '3.9'
volumes:
broker: null
esdata: null
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -55,4 +55,3 @@ services:
image: ${DATAHUB_MCE_CONSUMER_IMAGE:-acryldata/datahub-mce-consumer}:${DATAHUB_VERSION:-head}
ports:
- 9090:9090
version: '3.9'
1 change: 0 additions & 1 deletion docker/quickstart/docker-compose.consumers.quickstart.yml
Original file line number Diff line number Diff line change
Expand Up @@ -69,4 +69,3 @@ services:
image: ${DATAHUB_MCE_CONSUMER_IMAGE:-acryldata/datahub-mce-consumer}:${DATAHUB_VERSION:-head}
ports:
- 9090:9090
version: '3.9'
Original file line number Diff line number Diff line change
@@ -1,2 +1 @@
services: {}
version: '3.9'
1 change: 0 additions & 1 deletion docker/quickstart/docker-compose.monitoring.quickstart.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,5 @@ services:
- 9089:9090
volumes:
- ../monitoring/prometheus.yaml:/etc/prometheus/prometheus.yml
version: '3.9'
volumes:
grafana-storage: null
1 change: 0 additions & 1 deletion docker/quickstart/docker-compose.quickstart.yml
Original file line number Diff line number Diff line change
Expand Up @@ -291,7 +291,6 @@ services:
volumes:
- zkdata:/var/lib/zookeeper/data
- zklogs:/var/lib/zookeeper/log
version: '3.9'
volumes:
broker: null
esdata: null
Expand Down
5 changes: 0 additions & 5 deletions docker/quickstart/generate_docker_quickstart.py
Original file line number Diff line number Diff line change
Expand Up @@ -120,11 +120,6 @@ def modify_docker_config(base_path, docker_yaml_config):
elif volumes[i].startswith("./"):
volumes[i] = "." + volumes[i]

# 10. Set docker compose version to 3.
# We need at least this version, since we use features like start_period for
# healthchecks (with services dependencies based on them) and shell-like variable interpolation.
docker_yaml_config["version"] = "3.9"


def dedup_env_vars(merged_docker_config):
for service in merged_docker_config["services"]:
Expand Down
14 changes: 14 additions & 0 deletions docs/how/search.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,20 @@ If you want to:
- ```/q customProperties: encoding*``` [Sample results](https://demo.datahubproject.io/search?page=1&query=%2Fq%20customProperties%3A%20encoding%2A)
- Dataset Properties are indexed in ElasticSearch the manner of key=value. Hence if you know the precise key-value pair, you can search using ```"key=value"```. However, if you only know the key, you can use wildcards to replace the value and that is what is being done here.

- Find an entity with an **unversioned** structured property
- ```/q structuredProperties.io_acryl_privacy_retentionTime01:60```
- This will return results for an **unversioned** structured property's qualified name `io.acryl.private.retentionTime01` and value `60`.
- ```/q _exists_:structuredProperties.io_acryl_privacy_retentionTime01```
- In this example, the query will return any entity which has any value for the **unversioned** structured property with qualified name `io.acryl.private.retentionTime01`.

- Find an entity with a **versioned** structured property
- ```/q structuredProperties._versioned.io_acryl_privacy_retentionTime.20240614080000.number:365```
- This query will return results for a **versioned** structured property with qualified name `io.acryl.privacy.retentionTime`, version `20240614080000`, type `number` and value `365`.
- ```/q _exists_:structuredProperties._versioned.io_acryl_privacy_retentionTime.20240614080000.number```
- Returns results for a **versioned** structured property with qualified name `io.acryl.privacy.retentionTime`, version `20240614080000` and type `number`.
- ```/q structuredProperties._versioned.io_acryl_privacy_retentionTime.\*.\*:365```
- Returns results for a **versioned** structured property with any version and type with a values of `365`

- Find a dataset with a column name, **latitude**
- ```/q fieldPaths: latitude``` [Sample results](https://demo.datahubproject.io/search?page=1&query=%2Fq%20fieldPaths%3A%20latitude)
- fieldPaths is the name of the attribute that holds the column name in Datasets.
Expand Down
24 changes: 24 additions & 0 deletions metadata-ingestion/docs/sources/datahub/datahub_pre.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,3 +71,27 @@ and [mce-consumer](../../../../metadata-jobs/mce-consumer-job/README.md))
- Increase the number of gms pods to add redundancy and increase resilience to node evictions
* If you are migrating large amounts of data, consider increasing elasticsearch's
thread count via the `ELASTICSEARCH_THREAD_COUNT` environment variable.

#### Exclusions
You will likely want to exclude some urn types from your ingestion, as they contain instance-specific
metadata, such as settings, roles, policies, ingestion sources, and ingestion runs. For example, you
will likely want to start with this:

```yaml
source:
config:
urn_pattern: # URN pattern to ignore/include in the ingestion
deny:
# Ignores all datahub metadata where the urn matches the regex
- ^urn:li:role.* # Only exclude if you do not want to ingest roles
- ^urn:li:dataHubRole.* # Only exclude if you do not want to ingest roles
- ^urn:li:dataHubPolicy.* # Only exclude if you do not want to ingest policies
- ^urn:li:dataHubIngestionSource.* # Only exclude if you do not want to ingest ingestion sources
- ^urn:li:dataHubSecret.*
- ^urn:li:dataHubExecutionRequest.*
- ^urn:li:dataHubAccessToken.*
- ^urn:li:dataHubUpgrade.*
- ^urn:li:inviteToken.*
- ^urn:li:globalSettings.*
- ^urn:li:dataHubStepState.*
```
16 changes: 12 additions & 4 deletions metadata-ingestion/src/datahub/cli/delete_cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -338,10 +338,18 @@ def by_filter(
# TODO: add some validation on entity_type

if not force and not soft and not dry_run:
click.confirm(
"This will permanently delete data from DataHub. Do you want to continue?",
abort=True,
)
if only_soft_deleted:
click.confirm(
"This will permanently delete data from DataHub. Do you want to continue?",
abort=True,
)
else:
click.confirm(
"Hard deletion will permanently delete data from DataHub and can be slow. "
"We generally recommend using soft deletes instead. "
"Do you want to continue?",
abort=True,
)

graph = get_default_graph()
logger.info(f"Using {graph}")
Expand Down
Loading

0 comments on commit 38250ed

Please sign in to comment.